training dataset (552 sequences)
validation dataset (227 sequences)
TE210 dataset (210 sequences)
TE83 dataset (83 sequences)
Dataset results
CAID2 binding dataset (78 sequences)
CAID2 linker dataset (40 sequences)
CAID3 binding dataset (51 sequences)
CAID3 linker dataset (20 sequences)
The CAID2 and CAID3 datasets used in this work were obtained from the official
Critical Assessment of Intrinsic Disorder (CAID)
challenge website. The datasets correspond to the CAID Round 2 and Round 3 benchmark
collections.
-
Del Conte A, Mehdiabadi M, Bouhraoua A, Miguel Monzon A, Tosatto SCE, Piovesan D.
Critical assessment of protein intrinsic disorder prediction (CAID) – Results of round 2.
Proteins. 2023 Dec;91(12):1925–1934.
doi: 10.1002/prot.26582.
-
Mehdiabadi M, Del Conte A, Nugnes MV, Aspromonte MC, Tosatto SCE, Piovesan D.
Critical Assessment of Protein Intrinsic Disorder Round 3 – Predicting Disorder in the Era of Protein Language Models.
Proteins. 2026 Jan;94(1):414–424.
doi: 10.1002/prot.70045.
For Training, Validation, and Independent Test Datasets
All three datasets share the same structure:
- Line 1: Protein ID - A unique identifier for each protein.
- Line 2: Protein Sequence - Encoded using 1-letter amino acid representation.
- Line 3: Annotations of Intrinsic Disorder Regions (IDR) - '1' indicates an IDR, '0' indicates a non-IDR.
- Line 4: Annotations of Disordered Protein-binding Regions (PB) - '1' indicates a PB, '0' indicates a non-PB.
- Line 5: Annotations of Disordered Nucleic Acid-binding Regions (NB) - '1' indicates a NB, '0' indicates a non-NB.
- Line 6: Annotations of Disordered Lipid-binding Regions (LB) - '1' indicates a LB, '0' indicates a non-LB.
- Line 7: Annotations of Disordered Ion-binding Regions (IB) - '1' indicates an IB, '0' indicates a non-IB.
- Line 8: Annotations of Disordered Small Molecule-binding Regions (SB) - '1' indicates a SB, '0' indicates a non-SB.
- Line 9: Annotations of Disordered Flexible Linkers (DFL) - '1' indicates a DFL, '0' indicates a non-DFL.
For CAID Datasets
All four datasets share the same structure:
- Line 1: Protein ID - A unique identifier for each protein.
- Line 2: Protein Sequence - Encoded using 1-letter amino acid representation.
- Line 3: Annotations of Disordered Binding Regions(BR)/Disordered Flexible Linkers(DFL) - '1' indicates a BR/DFL, '0' indicates a BR/DFL.
Reference
Liang et al, Hybrid deep learning with protein language models and dual-path architecture for predicting IDP functions, Briefings in Bioinformatics, 27: bbag126 (2026). (PDF)