The APD server is a web-based platform for predicted distance evaluation.
The APD server supports predicted distance in CASP format, i.e., it should contain the list of probabilities of the distance between residues i and j falling within one of the ten predefined distance bins.
The bins are defined as follows:
bin1: d≤4Å,
bin2: 4<d≤6Å,
bin3: 6<d≤8Å, ..,
bin10: >20Å.
Each line consists of 13 numbers:
i j p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
where
i and
j are the residue numbers;
p0 is the probability of residues i and j being in contact, i.e., with distance≤8Å;
p1,..,
p10 are the probabilities of the predefined 10 bins.
Note that p1+p2+p3 should be equal to p0, and p1+..+p10 should be equal to 1, with a tolerance of 0.005.
See an example:
28 116 0.906 0.190 0.597 0.118 0.042 0.018 0.010 0.005 0.004 0.003 0.012
65 101 0.906 0.022 0.670 0.215 0.071 0.018 0.004 0.001 0.000 0.000 0.000
...
The input protein structure file should be in PDB Format.
A detailed description of the PDB Format can be found
here.
Evaluation Metrics
The evaluation can be done in three different flavors: prediction-oriented, native-oriented and full-list, depending on the set of residue pairs being assessed.
To facilitate the description, the following variables are defined:
L and
S: the length of a target protein and the set of assessed residue pairs
Dij and
dij: the native and the predicted real-valued distance between the i-th and the j-th residue. Note that dij is calculated as a weighted average of the predicted distribution, using probabilities of the first nine bins (bin1..bin9).
I(.): the indicator function, whose value is 1 when the corresponding event happens, and 0 otherwise.
In the prediction-oriented case, except for the contact precision which uses the top L residue pairs ranked by p1+p2+p3, other metrics uses top 15L residue pairs ranked by p1+..+p9 for evaluation.
Distance precision (DP) A residue pair (i, j) is defined as being "correctly predicted" if the difference between Dij and dij is less than a tolerance threshold (2 Å here). Distance precision is defined as the ratio of correctly predicted residue pairs in the set S.
Here P(dij≤20) is the cumulative probability of the first nine bins(bin1..bin9), reflecting the confidence of the predicted distance dij.
Fuzzy certainty (FC) To effectively utilize the predicted probability, we define the fuzzy certainty of a predicted distance
distribution as follows. Similar to the fuzzy analysis, besides the native distance bin, its adjacent bins are also considered but with a weight of 0.5,
to reflect the dynamic feature of protein structure. Here P(.) is the predicted probability of the corresponding distance bin.
Macro fuzzy precision/recall/F1 (MFP/MFR/MFF) For each of the first 9 distance bins with distance ≤ 20Å, fuzzy precision (fPRE) and fuzzy recall (fREC) are defined.
To define these metrics, the set S is first divided into a maximum of 9 subsets. For the residue pairs in each subset Sk, the predicted probability of the k-th distance bin is the highest
(among the first 9 distance bins). Here the word fuzzy has similar meaning to that in fuzzy certainty, which means assigning a weight of 0.5 for the predicted class (i.e., distance bin) that
is not correct but is adjacent to the native class (i.e., native distance bin). Here Nk is the number of residue pairs that belong to the k-th class (according to the experimental structure), lij is the real class label for the
residue pair (i, j).
The fuzzy F1 score for each class is a harmonic sum of the corresponding fuzzy precision and fuzzy recall. The macro fuzzy precision/recall/F1 are calculated as the average over the first nine bins.
Absolute/Relative error (AE/RE) The absolute error is computed as the absolute difference between the native and the predicted distance averaged over the set S.
The relative error is defined similarly but with a normalization by the native distance.
Pearson's correlation coefficient (PCC) Here DS and dS refer to the vectors containing the native and the predicted distances of the residue pairs in the set S, respectively. Cov(.)/Var(.) stands for the covariance/variance of the corresponding vectors.
Contact precision the number of correctly predicted residue pairs divided by the number of residue pairs being evaluated. Note that here S refers to the set containing topL residue pairs ranked by p1+p2+p3.
In the native-oriented case, the assessed residue pairs are those with native distance no more than a specified threshold (20 Å here), regardless of the predicted probabilities.
For the native-oriented assessment, the five metrics defined above can be also calculated:
distance precision,
fuzzy certainty,
macro fuzzy precision,
macro fuzzy recall and
macro fuzzy F1.
Distogram LDDT (DLDDT) DLDDT is calculated similar as the model quality measure LDDT. Here Ri is the set of residues that are close to the i-th residue within distance 20Å and with sequence separation no less than 12.
In the full-list case, all the residue pairs with separation &ge 12 are used for evaluation.
For the full-list assessment, the three metrics defined above can be also calculated:
macro fuzzy precision,
macro fuzzy recall and
macro fuzzy F1.
Macro fuzzy certainty (MFC) Macro fuzzy certainty is an extension of the previous metrics fuzzy certainty. First, the fuzzy certainty for each class is calculated as below.
Here Sk is the set of residue pairs in the k-th distance bin according to the native structure, Pk(i, j) is the predicted probability in the k-th class for the residue pair (i, j).
The MFC is then calculated as the average of the fuzzy certainty over all classes.