Training a module

1. Select node and edge features

1.1. Edge feature:

  • dist: distance between nodes

>>> edge_feature=['dist']

Note

External edges connect 2 residues of chain A and B if they have at least 1 pairwise atomic distance < 8.5 A (Used for to define neighbors)

Internal edges connect 2 residues within a chain if they have at least 1 pairwise atomic distance < 3 A (Used to cluster nodes)

1.2. Node features:

  • pos: xyz coordinates

  • chain: chain ID

  • charge: residue charge

  • polarity: apolar/polar/neg_charged/pos_charged (one hot encoded)

  • bsa: buried surface are

  • pssm: pssm score for each residues

  • cons: pssm score of the residue

  • ic: information content of the PSSM (~Shannon entropy)

  • type: residue type (one hot encoded)

  • depth (opt. in graph. gen. step): average atom depth of the atoms in a residue (distance to the surface)

  • hse (opt. in graph. gen. step): half sphere exposure

>>> node_feature=['type', 'polarity', 'bsa',
>>>               'ic', 'pssm']

2. Select the target (benchmarking mode)

When using Deeprank-GNN in a benchmarking mode, you must specify your target (often referred to as Y). The target values are pre-calculated during the Graph generation step if a reference structure is provided.

Pre-calculated targets:

  • irmsd: interface RMSD (RMSD between the superimposed interface residues)

  • lrmsd: ligand RMSD (RMSD between chains B given that chains A are superimposed)

  • fnat: fraction of native contacts

  • dockQ: see Basu et al., “DockQ: A Quality Measure for Protein-Protein Docking Models”, PLOS ONE, 2016

  • bin_class: binary classification (0: irmsd >= 4 A, 1: RMSD < 4A)

  • capri_classes: 1: RMSD < 1A, 2: RMSD < 2A, 3: RMSD < 4A, 4: RMSD < 6A, 0: RMSD >= 6A

Note

In classification mode (i.e. task=”class”) you must provide the list of target classes to the NeuralNet (e.g. classes=[1,2,3,4])

>>> target='irmsd'

3. Select hyperparameters

  • Regression (‘reg’) of classification (‘class’) mode

>>> task='reg'
  • Batch size

>>> batch_size=64
  • Shuffle the training dataset

>>> shuffle=True
  • Learning rate:

>>> lr=0.001

4. Load the network

This step requires pre-calculated graphs in hdf5 format.

The user may :

  • option 1: input a unique dataset and chose to automatically split it into a training set and an evaluation set

  • option 2: input distinct training/evaluation/test sets

4.1. Option 1

>>> from deeprank_gnn.NeuralNet import NeuralNet
>>> from deeprank_gnn.ginet import GINet
>>>
>>> database = './1ATN_residue.hdf5'
>>>
>>> model = NeuralNet(database, GINet,
>>>                node_feature=node_feature,
>>>                edge_feature=edge_feature,
>>>                target=target,
>>>                task=task,
>>>                lr=lr,
>>>                batch_size=batch_size,
>>>                shuffle=shuffle,
>>>                percent=[0.8, 0.2])
>>>

Note

The percent argument is required to split the input dataset into a training set and a test set. Using percent=[0.8, 0.2], 80% of the input dataset will constitute the training set, 20% will constitute the evaluation set.

4.2. Option 2

>>> from deeprank_gnn.NeuralNet import NeuralNet
>>> from deeprank_gnn.ginet import GINet
>>> import glob
>>>
>>> # load train dataset
>>> database_train = glob.glob('./hdf5/train*.hdf5')
>>> # load validation dataset
>>> database_eval = glob.glob('./hdf5/eval*.hdf5')
>>> # load test dataset
>>> database_test = glob.glob('./hdf5/test*.hdf5')
>>>
>>> model = NeuralNet(database_train, GINet,
>>>                node_feature=node_feature,
>>>                edge_feature=edge_attr,
>>>                target=target,
>>>                task=task,
>>>                lr=lr,
>>>                batch_size=batch_size,
>>>                shuffle=shuffle,
>>>                database_eval = database_eval)

5. Train the model

  • example 1:

train the network, perform 50 epochs

>>> model.train(nepoch=50, validate=False)
  • example 2:

train the model, evaluate the model at each epoch, save the best model (i.e. the model with the lowest loss), and write all predictions to output.hdf5

>>> model.train(nepoch=50, validate=True, save_model='best', hdf5='output.hdf5')

Warning

The last model is saved by default.

When setting save_model='best', a model that is associated with a lower loss than those generated in the previous epochs will be saved. By default, the epoch number is included in the output name not to write over intermediate models.

6. Analysis

6.1. Plot the loss evolution over the epochs

>>> model.plot_loss(name='plot_loss')

6.2 Analyse the performance in benchmarking conditions

The following analysis only apply if a reference structure was provided during the graph generation step.

6.2.1. Plot accuracy evolution

>>> model.plot_acc(name='plot_accuracy')

6.2.2. Plot hitrate

A threshold value is required to binarise the target value

>>> model.plot_hit_rate(data='eval', threshold=4.0, mode='percentage', name='hitrate_eval')

6.2.3. Get various metrics

The following metrics can be easily computed:

Classification metrics:

  • sensitivity: Sensitivity, hit rate, recall, or true positive rate

  • specificity: Specificity or true negative rate

  • precision: Precision or positive predictive value

  • NPV: Negative predictive value

  • FPR: Fall out or false positive rate

  • FNR: False negative rate

  • FDR: False discovery rate

  • accuracy: Accuracy

  • auc(): AUC

  • hitrate(): Hit rate

Regression metrics:

  • explained_variance: Explained variance regression score function

  • max_error: Max_error metric calculates the maximum residual error

  • mean_abolute_error: Mean absolute error regression loss

  • mean_squared_error: Mean squared error regression loss

  • root_mean_squared_error: Root mean squared error regression loss

  • mean_squared_log_error: Mean squared logarithmic error regression loss

  • median_squared_log_error: Median absolute error regression loss

  • r2_score: R^2 (coefficient of determination) regression score function

Note

All classification metrics can be calculated on continuous targets as soon as a threshold is provided to binarise the data.

>>> train_metrics = model.get_metrics('train', threshold = 4.0)
>>> print('training set - accuracy:', train_metrics.accuracy)
>>> print('training set - sensitivity:', train_metrics.sensitivity)
>>>
>>> eval_metrics = model.get_metrics('eval', threshold = 4.0)
>>> print('evaluation set - accuracy:', eval_metrics.accuracy)
>>> print('evaluation set - sensitivity:', eval_metrics.sensitivity)

7. Save the model/network

>>> model.save_model("model_backup.pth.tar")

8. Test the model on an external dataset

8.1. On a loaded model

>>> model.test(database_test, threshold=4.0)

8.2. On a pre-trained model

>>> from deeprank_gnn.NeuralNet import NeuralNet
>>> from deeprank_gnn.ginet import GINet
>>>
>>> database_test = './1ATN_residue.hdf5'
>>>
>>> model = NeuralNet(database_test, GINet, pretrained_model = "model_backup.pth.tar")
>>> model.test(database_test)
>>>
>>> test_metrics = model.get_metrics('test', threshold = 4.0)
>>> print(test_metrics.accuracy)

In short

>>> from deeprank_gnn.NeuralNet import NeuralNet
>>> from deeprank_gnn.ginet import GINet
>>>
>>> database = './1ATN_residue.hdf5'
>>>
>>> edge_feature=['dist']
>>> node_feature=['type', 'polarity', 'bsa',
>>>               'depth', 'hse', 'ic', 'pssm']
>>> target='irmsd'
>>> task='reg'
>>> batch_size=64
>>> shuffle=True
>>> lr=0.001
>>>
>>> model = NeuralNet(database, GINet,
>>>                node_feature=node_feature,
>>>                edge_feature=edge_feature,
>>>                target=target,
>>>                index=None,
>>>                task=task,
>>>                lr=lr,
>>>                batch_size=batch_size,
>>>                shuffle=shuffle,
>>>                percent=[0.8, 0.2])
>>>
>>> model.train(nepoch=50, validate=True, save_model='best', hdf5='output.hdf5')
>>> model.plot_loss(name='plot_loss')
>>>
>>> train_metrics = model.get_metrics('train', threshold = 4.0)
>>> print('training set - accuracy:', train_metrics.accuracy)
>>> print('training set - sensitivity:', train_metrics.sensitivity)
>>>
>>> eval_metrics = model.get_metrics('eval', threshold = 4.0)
>>> print('evaluation set - accuracy:', eval_metrics.accuracy)
>>> print('evaluation set - sensitivity:', eval_metrics.sensitivity)
>>>
>>> model.save_model("model_backup.pth.tar")
>>> #model.test(database_test, threshold=4.0)

Using default settings

>>> from deeprank_gnn.NeuralNet import NeuralNet
>>> from deeprank_gnn.ginet import GINet
>>>
>>> database = glob.glob('./hdf5/*_train.hdf5')
>>> dataset_test = glob.glob('./hdf5/*_test.hdf5')
>>>
>>> target='irmsd'
>>>
>>> model = NeuralNet(database, GINet,
>>>                target=target,
>>>                percent=[0.8, 0.2])
>>>
>>> model.train(nepoch=50, validate=True, save_model='best', hdf5='output.hdf5')
>>> model.plot_loss(name='plot_loss')
>>>
>>> train_metrics = model.get_metrics('train', threshold = 4.0)
>>> print('training set - accuracy:', train_metrics.accuracy)
>>> print('training set - sensitivity:', train_metrics.sensitivity)
>>>
>>> eval_metrics = model.get_metrics('eval', threshold = 4.0)
>>> print('evaluation set - accuracy:', eval_metrics.accuracy)
>>> print('evaluation set - sensitivity:', eval_metrics.sensitivity)
>>>
>>> model.save_model("model_backup.pth.tar")
>>> model.test(database_test, threshold=4.0)

Use DeepRank-GNN paper’s pretrained model

See: M. Réau, N. Renaud, L. C. Xue, A. M. J. J. Bonvin, “DeepRank-GNN: A Graph Neural Network Framework to Learn Patterns in Protein-Protein Interfaces”, bioRxiv 2021.12.08.471762; doi: https://doi.org/10.1101/2021.12.08.471762

You can get the pre-trained model from DeepRank-GNN github repository

>>> import glob
>>> import sys
>>> import time
>>> import datetime
>>> import numpy as np
>>>
>>> from deeprank_gnn.GraphGenMP import GraphHDF5
>>> from deeprank_gnn.NeuralNet import NeuralNet
>>> from deeprank_gnn.ginet import GINet
>>>
>>> ### Graph generation section
>>> pdb_path = '../tests/data/pdb/1ATN/'
>>> pssm_path = '../tests/data/pssm/1ATN/'
>>>
>>> GraphHDF5(pdb_path=pdb_path, pssm_path=pssm_path,
>>>         graph_type='residue', outfile='1ATN_residue.hdf5', nproc=4)
>>>
>>> ### Prediction section
>>> gnn = GINet
>>> pretrained_model = 'fold6_treg_yfnat_b128_e20_lr0.001_4.pt'
>>> database_test = glob.glob('1ATN_residue.hdf5')
>>>
>>> start_time = time.time()
>>> model = NeuralNet(database_test, gnn, pretrained_model = pretrained_model)
>>> model.test(threshold=None)
>>> end_time = time.time()
>>> print ('Elapsed time: {end_time-start_time}')
>>>
>>> ### The output is automatically stored in **test_data.hdf5**

Note

For storage convenience, all predictions are stored in a HDF5 file. A converter from HDF5 to csv is provided in the tools directory