Creating Graphs

Deeprank-GNN automatically generates a residue-level graphs of protein-protein interfaces in which nodes correspond to a single residue, and 2 types of edges are defined:

  • External edges connect 2 residues (nodes) of chain A and B if they have at least 1 pairwise atomic distance < 8.5 A (Used for to define neighbors)

  • Internal edges connect 2 residues (nodes) within a chain if they have at least 1 pairwise atomic distance < 3 A (Used to cluster nodes)

Warning

The graph generation requires an ensemble of PDB files containing two chains: chain A and chain B.

You can provide PSSM matrices to compute evolutionary conservation node features. Some pre-calculated PSSM matrices can be downloaded from http://3dcons.cnb.csic.es/. A 3dcons_to_deeprank_pssm.py converter can be found in the tool folder to convert the 3dcons PSSM format into the Deeprank-GNN PSSM format. Make sure the sequence numbering matches the PDB residues numbering.

By default, the following features are assigned to each node of the graph :

  • pos: xyz coordinates

  • chain: chain ID

  • charge: residue charge

  • polarity: apolar/polar/neg_charged/pos_charged (one hot encoded)

  • bsa: buried surface are

  • type: residue type (one hot encoded)

The following features are computed if PSSM data is provided :

  • pssm: pssm score for each residues

  • cons: pssm score of the residue

  • ic: information content of the PSSM (~Shannon entropy)

The following features are optional, and computed only if Biopython is used (see next example) :

  • depth: average atom depth of the atoms in a residue (distance to the surface)

  • hse: half sphere exposure

Generate your graphs

Note that the pssm information is used to compute the pssm, cons and ic node features and is optional.

In this example, all features are computed

>>> from deeprank_gnn.GraphGenMP import GraphHDF5
>>>
>>> pdb_path = './data/pdb/1ATN/'
>>> pssm_path = './data/pssm/1ATN/'
>>>
>>> GraphHDF5(pdb_path=pdb_path, pssm_path=pssm_path, biopython=True,
>>>          graph_type='residue', outfile='1ATN_residue.hdf5', nproc=4)

In this example, the biopython features (hse and depth) are ignored

>>> from deeprank_gnn.GraphGenMP import GraphHDF5
>>>
>>> pdb_path = './data/pdb/1ATN/' # path to the docking model in PDB format
>>> pssm_path = './data/pssm/1ATN/' # path to the pssm files
>>>
>>> GraphHDF5(pdb_path=pdb_path, pssm_path=pssm_path,
>>>          graph_type='residue', outfile='1ATN_residue.hdf5', nproc=4)

In this example, the biopython features (hse and depth) and the PSSM information are ignored

>>> from deeprank_gnn.GraphGenMP import GraphHDF5
>>>
>>> pdb_path = './data/pdb/1ATN/'
>>>
>>> GraphHDF5(pdb_path=pdb_path,
>>>          graph_type='residue', outfile='1ATN_residue.hdf5', nproc=4)

Add your target values

Use the CustomizeGraph class to add target values to the graphs.

If you are benchmarking docking models, go to the next section.

>>> from deeprank_gnn.GraphGenMP import GraphHDF5
>>> from deeprank_gnn.tools.CustomizeGraph import add_target
>>>
>>> pdb_path = './data/pdb/1ATN/'
>>> pssm_path = './data/pssm/1ATN/'
>>>
>>> GraphHDF5(pdb_path=pdb_path, pssm_path=pssm_path,
>>>          graph_type='residue', outfile='1ATN_residue.hdf5', nproc=4)
>>>
>>> add_target(graph_path='.', target_name='new_target',
>>>            target_list='list_of_target_values.txt')

Note

The list of target values should respect the following format:

model_name_1 0

model_name_2 1

model_name_3 0

model_name_4 0

if your use other separators (eg. ,, ;, tab) use the sep argument:

>>> add_target(graph_path=graph_path, target_name='new_target',
>>>            target_list='list_of_target_values.txt', sep=',')

Docking benchmark mode

In a docking benchmark mode, you can provide the path to the reference structures in the graph generation step. Knowing the reference structure, the following target values will be automatically computed, based on CAPRI quality criteria 1, and assigned to the graphs :

  • irmsd: interface RMSD (RMSD between the superimposed interface residues)

  • lrmsd: ligand RMSD (RMSD between chains B given that chains A are superimposed)

  • fnat: fraction of native contacts

  • dockQ: see Basu et al., “DockQ: A Quality Measure for Protein-Protein Docking Models”, PLOS ONE, 2016

  • bin_class: binary classification (0: irmsd >= 4 A, 1: RMSD < 4A)

  • capri_classes: 1: RMSD < 1A, 2: RMSD < 2A, 3: RMSD < 4A, 4: RMSD < 6A, 0: RMSD >= 6A

>>> from deeprank_gnn.GraphGenMP import GraphHDF5
>>>
>>> pdb_path = './data/pdb/1ATN/'
>>> pssm_path = './data/pssm/1ATN/'
>>> ref = './data/ref/1ATN/'
>>>
>>> GraphHDF5(pdb_path=pdb_path, ref_path=ref, pssm_path=pssm_path,
>>>          graph_type='residue', outfile='1ATN_residue.hdf5', nproc=4)

Note

The different input files must respect the following nomenclature:

  • PDB files: 1ATN_xxx.pdb (xxx may be replaced by anything)

  • PSSM files: 1ATN.A.pdb.pssm 1ATN.B.pdb.pssm or 1ATN.A.pssm 1ATN.B.pssm

  • Reference PDB files: 1ATN.pdb

1

Lensink MF, Méndez R, Wodak SJ, Docking and scoring protein complexes: CAPRI 3rd Edition. Proteins. 2007