training dataset (552 sequences)
validation dataset (227 sequences)
TE210 dataset (210 sequences)
TE83 dataset (83 sequences)
CAID2 binding dataset (78 sequences)
CAID2 linker dataset (40 sequences)
CAID3 binding dataset (51 sequences)
CAID3 linker dataset (20 sequences)
For Training, Validation, and Independent Test Datasets
All three datasets share the same structure:
- Line 1: Protein ID - A unique identifier for each protein.
- Line 2: Protein Sequence - Encoded using 1-letter amino acid representation.
- Line 3: Annotations of Intrinsic Disorder Regions (IDR) - '1' indicates an IDR, '0' indicates a non-IDR.
- Line 4: Annotations of Disordered Protein-binding Regions (PB) - '1' indicates a PB, '0' indicates a non-PB.
- Line 5: Annotations of Disordered Nucleic Acid-binding Regions (NB) - '1' indicates a NB, '0' indicates a non-NB.
- Line 6: Annotations of Disordered Lipid-binding Regions (LB) - '1' indicates a LB, '0' indicates a non-LB.
- Line 7: Annotations of Disordered Ion-binding Regions (IB) - '1' indicates an IB, '0' indicates a non-IB.
- Line 8: Annotations of Disordered Small Molecule-binding Regions (SB) - '1' indicates a SB, '0' indicates a non-SB.
- Line 9: Annotations of Disordered Flexible Linkers (DFL) - '1' indicates a DFL, '0' indicates a non-DFL.
For CAID Datasets
All four datasets share the same structure:
- Line 1: Protein ID - A unique identifier for each protein.
- Line 2: Protein Sequence - Encoded using 1-letter amino acid representation.
- Line 3: Annotations of Disordered Binding Regions(BR)/Disordered Flexible Linkers(DFL) - '1' indicates a BR/DFL, '0' indicates a BR/DFL.
Reference
- Liang et al, Hybrid Deep Learning with Protein Language Models and Dual-Path Architecture for Predicting IDP Functions, submitted, 2025.