Prediction of Intrinsically Disordered Functional Regions By IDPFunNet

Training dataset

training dataset (552 sequences)

Validation dataset

validation dataset (227 sequences)

Independent test datasets

TE210 dataset (210 sequences)

TE83 dataset (83 sequences)

Dataset results

CAID datasets

CAID2 binding dataset (78 sequences)

CAID2 linker dataset (40 sequences)

CAID3 binding dataset (51 sequences)

CAID3 linker dataset (20 sequences)

The CAID2 and CAID3 datasets used in this work were obtained from the official Critical Assessment of Intrinsic Disorder (CAID) challenge website. The datasets correspond to the CAID Round 2 and Round 3 benchmark collections.

Del Conte A, Mehdiabadi M, Bouhraoua A, Miguel Monzon A, Tosatto SCE, Piovesan D. Critical assessment of protein intrinsic disorder prediction (CAID) – Results of round 2. Proteins. 2023 Dec;91(12):1925–1934. doi: 10.1002/prot.26582.
Mehdiabadi M, Del Conte A, Nugnes MV, Aspromonte MC, Tosatto SCE, Piovesan D. Critical Assessment of Protein Intrinsic Disorder Round 3 – Predicting Disorder in the Era of Protein Language Models. Proteins. 2026 Jan;94(1):414–424. doi: 10.1002/prot.70045.

Description of datasets

For Training, Validation, and Independent Test Datasets

All three datasets share the same structure:

Line 1: Protein ID - A unique identifier for each protein.
Line 2: Protein Sequence - Encoded using 1-letter amino acid representation.
Line 3: Annotations of Intrinsic Disorder Regions (IDR) - '1' indicates an IDR, '0' indicates a non-IDR.
Line 4: Annotations of Disordered Protein-binding Regions (PB) - '1' indicates a PB, '0' indicates a non-PB.
Line 5: Annotations of Disordered Nucleic Acid-binding Regions (NB) - '1' indicates a NB, '0' indicates a non-NB.
Line 6: Annotations of Disordered Lipid-binding Regions (LB) - '1' indicates a LB, '0' indicates a non-LB.
Line 7: Annotations of Disordered Ion-binding Regions (IB) - '1' indicates an IB, '0' indicates a non-IB.
Line 8: Annotations of Disordered Small Molecule-binding Regions (SB) - '1' indicates a SB, '0' indicates a non-SB.
Line 9: Annotations of Disordered Flexible Linkers (DFL) - '1' indicates a DFL, '0' indicates a non-DFL.

For CAID Datasets

All four datasets share the same structure:

Line 1: Protein ID - A unique identifier for each protein.
Line 2: Protein Sequence - Encoded using 1-letter amino acid representation.
Line 3: Annotations of Disordered Binding Regions(BR)/Disordered Flexible Linkers(DFL) - '1' indicates a BR/DFL, '0' indicates a BR/DFL.

Reference

Liang et al, Hybrid deep learning with protein language models and dual-path architecture for predicting IDP functions, Briefings in Bioinformatics, 27: bbag126 (2026). (PDF)