Skip to content

A large and comprehensive benchmark for estimating the accuracy of protein complex structural models

License

Notifications You must be signed in to change notification settings

BioinfoMachineLearning/PSBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PSBench

A comprehensive benchmark for estimating the accuracy of protein complex structural models (EMA)

I. Four datasets for training and testing EMA methods

PSBench consists of 4 complementary datasets:

    1. CASP15_inhouse_dataset
    1. CASP15_community_dataset
    1. CASP16_inhouse_dataset
    1. CASP16_community_dataset
For each of the four datasets, we provide 10 unique quality scores and a few AlphaFold features:
Category Quality scores / features
Global Quality Scores tmscore (4 variants), rmsd
Local Quality Scores lddt
Interface Quality Scores ics, ics_precision, ics_recall, ips, qs_global, qs_best, dockq_wave
Additional Input Features (CASP15_inhouse_dataset and CASP16_inhouse_dataset) type, afm_confidence_score, af3_ranking_score, iptm, num_inter_pae, mpDockQ/pDockQ

For detailed explanations of each quality score and feature, please refer to Quality_Scores_Definitions

i. CASP15_inhouse_dataset

CASP15_inhouse_dataset consists of a total of 7,885 models generated by MULTICOM3 during the 2022 CASP15 competition. CASP15_inhouse_dataset

ii. CASP15_community_dataset

CASP15_community_dataset consists of a total of 10,942 models generated by all the participating groups during the 2022 CASP15 competition. CASP15_community_dataset

iii. CASP16_inhouse_dataset

CASP16_inhouse_dataset consists of a total of 1,009,050 models generated by MULTICOM4 during the 2024 CASP16 competition. CASP16_inhouse_dataset

iv. CASP16_community_dataset

CASP16_community_dataset consists of a total of 12,904 models generated by all the participating groups during the 2024 CASP16 competition. CASP16_community_dataset

II. Scripts to evaluate EMA methods on a benchmark dataset

generate various evlaution scores

III. Scripts to generate labels for a new benchmark dataset

Following are the prerequisites to generate the labels for new benchmark dataset:

Data:

  • Predicted structures
  • Native structure
  • Fasta file

Tools

  • Openstructure
  • USalign

Download the PSBench repository and cd into scripts

    git clone https://github.com/BioinfoMachineLearning/PSBench.git
    cd PSBench
    cd scripts

Openstructure Installation (Need to run only once)

docker pull registry.scicore.unibas.ch/schwede/openstructure:latest

Check the docker installation with

# should print the latest version of openstructure 
docker run -it registry.scicore.unibas.ch/schwede/openstructure:latest --version

Structure alignment and filtration (required for tmscore_usalign_aligned)

Requires 6 arguments:

  • -f : path to the fasta file for the target
  • -pp : path to the predicted pdbs directory for the target
  • -np : path to the native pdb file for the target
  • -o : path to the output directory
  • -tmp : path to the temporary directory
  • -c : path to the clustalw binary (available in tools/clustalw1.83/clustalw)
python filter_pdb.py --f /path/to/fasta_file -pp /path/to/predicted_pdbs_directory -np /path/to/native_pdb_file -o /path/to/output_directory -tmp /path/to/temporary_directory -c /path/to/clustalw_binary_file

Run openstructure (required for ics, ics_precision, ics_recall, ips, qs_global, qs_best, lddt, rmsd, dockq_wave, mmalign_tmscore)

Requires 3 arguments:

  • --indir : path to the folder containing predicted pdbs
  • --nativedir : path to the corresponding native pdb
  • --outdir : path to the output folder
python run_openstructure.py --indir /path/to/predicted_pdb_folder/ --nativedir /path/to/native_pdb_file --outdir /path/to/output_folder

Run USalign for original predicted structure and original native structure (required for tmscore_usalign)

Requires 4 arguments:

  • --indir : path to the folder containing original predicted pdbs
  • --nativedir : path to the corresponding original native pdb
  • --outdir : path to the output folder
  • --usalign_program : path to the USalign binary (available at tools/USalign)
python run_usalign.py --indir /path/to/predicted_pdb_folder/ --nativedir /path/to/native_pdb_file --outdir /path/to/output_folder --usalign_program /path/to/USalign_binary

Run USalign for filtered predicted structure and filtered native structure (required for tmscore_usalign_aligned)

Requires 4 arguments:

  • --indir : path to the folder containing filtered predicted pdbs
  • --nativedir : path to the corresponding filtered native pdb
  • --outdir : path to the output folder
  • --usalign_program : path to the USalign binary (available at tools/USalign)
python run_usalign.py --indir /path/to/predicted_pdb_folder/ --nativedir /path/to/native_pdb_file --outdir /path/to/output_folder --usalign_program /path/to/USalign_binary

Create a csv out of the results

Requires 5 arguments:

  • -pp : path to the predicted pdbs directory for the target
  • -os : path to the openstructure results for the target
  • -tm_u : path to the tmscore_usalign results for the target
  • -tm_ua : path to the tmscore_usalign_aligned results for the target
  • -oc : path where the output csv is to be saved
python create_csv.py -pp /path/to/predicted_pdbs_directory -os /path/to/openstructure_results_directory/ -tm_u /path/to/tmscore_usalign_results_directory -tm_ua /path/to/tmscore_usalign_aligned_results_directory -oc /path/to/output_csv_file

IV. Baseline EMA methods for comparison with a new EMA method

Reference

About

A large and comprehensive benchmark for estimating the accuracy of protein complex structural models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published