Protein-Protein Docking
Ref
- Protein-Protein Docking Using Bioinformatics Tools (PPDock) Module
- Protein Docking Analysis | DNASTAR
- High-resolution global peptide-protein docking using fragments-based PIPER-FlexPepDock
- Peptide-protein interactions contribute a significant fraction of the protein-protein interactome. Accurate modeling of these interactions is challenging due to the vast conformational space associated with interactions of highly flexible peptides with large receptor surfaces. To address this challenge we developed a fragment based high-resolution peptide-protein docking protocol. By streamlining the Rosetta fragment picker for accurate peptide fragment ensemble generation, the PIPER docking algorithm for exhaustive fragment-receptor rigid-body docking and Rosetta FlexPepDock for flexible full-atom refinement of PIPER docked models, we successfully addressed the challenge of accurate and efficient global peptide-protein docking at high-resolution with remarkable accuracy, as validated on a small but representative set of peptide-protein complex structures well resolved by X-ray crystallography. Our approach opens up the way to high-resolution modeling of many more peptide-protein interactions and to the detailed study of peptide-protein association in general. PIPER-FlexPepDock is freely available to the academic community as a server at http://piperfpd.furmanlab.cs.huji.ac.il.
- Protein-Protein Docking
- Docking Protein - an overview | ScienceDirect Topics
- Docking (molecular) - Wikipedia
- Protein–ligand docking - Wikipedia
- The HDOCK server for integrated protein–protein docking | Nature Protocols
- HDOCK Server
- The HDOCK server (http://hdock.phys.hust.edu.cn/) is a highly integrated suite of homology search, template-based modeling, structure prediction, macromolecular docking, biological information incorporation and job management for robust and fast protein–protein docking. With input information for receptor and ligand molecules (either amino acid sequences or Protein Data Bank structures), the server automatically predicts their interaction through a hybrid algorithm of template-based and template-free docking. The HDOCK server distinguishes itself from similar docking servers in its ability to support amino acid sequences as input and a hybrid docking strategy in which experimental information about the protein–protein binding site and small-angle X-ray scattering can be incorporated during the docking and post-docking processes. Moreover, HDOCK also supports protein–RNA/DNA docking with an intrinsic scoring function. The server delivers both template- and docking-based binding models of two molecules and allows for download and interactive visualization. The HDOCK server is user friendly and has processed >30,000 docking jobs since its official release in 2017. The server can normally complete a docking job within 30 min.
- The ClusPro web server for protein–protein docking | Nature Protocols
- ClusPro 2.0: protein-protein docking
- The ClusPro server (https://cluspro.org) is a widely used tool for protein–protein docking. The server provides a simple home page for basic use, requiring only two files in Protein Data Bank (PDB) format. However, ClusPro also offers a number of advanced options to modify the search; these include the removal of unstructured protein regions, application of attraction or repulsion, accounting for pairwise distance restraints, construction of homo-multimers, consideration of small-angle X-ray scattering (SAXS) data, and location of heparin-binding sites. Six different energy functions can be used, depending on the type of protein. Docking with each energy parameter set results in ten models defined by centers of highly populated clusters of low-energy docked structures. This protocol describes the use of the various options, the construction of auxiliary restraints files, the selection of the energy parameters, and the analysis of the results. Although the server is heavily used, runs are generally completed in <4 h.
- Protein-Protein Docking: From Interaction to Interactome
- Protein-Protein Docking in Drug Design and Discovery - PubMed
- Protein–Protein Docking in Drug Design and Discovery | SpringerLink
- Protein–protein interactions (PPIs) are responsible for a number of key physiological processes in the living cells and underlie the pathomechanism of many diseases. Nowadays, along with the concept of so-called “hot spots” in protein–protein interactions, which are well-defined interface regions responsible for most of the binding energy, these interfaces can be targeted with modulators. In order to apply structure-based design techniques to design PPIs modulators, a three-dimensional structure of protein complex has to be available. In this context in silico approaches, in particular protein–protein docking, are a valuable complement to experimental methods for elucidating 3D structure of protein complexes. Protein–protein docking is easy to use and does not require significant computer resources and time (in contrast to molecular dynamics) and it results in 3D structure of a protein complex (in contrast to sequence-based methods of predicting binding interfaces). However, protein–protein docking cannot address all the aspects of protein dynamics, in particular the global conformational changes during protein complex formation. In spite of this fact, protein–protein docking is widely used to model complexes of water-soluble proteins and less commonly to predict structures of transmembrane protein assemblies, including dimers and oligomers of G protein-coupled receptors (GPCRs). In this chapter we review the principles of protein–protein docking, available algorithms and software and discuss the recent examples, benefits, and drawbacks of protein–protein docking application to water-soluble proteins, membrane anchoring and transmembrane proteins, including GPCRs.
Protein-Protein Docking Using Bioinformatics Tools (PPDock) Module
I. Introduction
As cancerous cells and normal cells exhibit a few biochemical differences, many anticancer drugs affect normal rapidly growing cells in the intestine and bone marrow areas and hence are toxic. Capabilities to determine drug-target binding affinities to achieve high levels of selective drug actions on cancer cells would be very useful for designing anti-cancer therapeutics. Since the over expression of the Janus Kinase 3 (JAK-3) has been implicated in cancerous disorders like adult T-cell lymphoma/leukemia (ATLL), JAK-3 inhibition is expected to play a vital role in treatment of cancer. The objective of this project is to use docking studies to identify potential JAK-3 inhibitors from a number of putative substrates, namely anaplastic lymphoma kinase (ALK), Gene transcription factor II-I (TFII-I) etc. Let us start the process by taking ALK as a potential inhibitor. Follow the following steps to get the process done.
II. Step 1: Gathering Protein Data Bank (PDB) Files
PDB files are the collection of experimentally determined three-dimensional structures of macromolecules, which are generally used by researchers and students. The collection includes the atomic coordinates, crystallographic structure factors and NMR experimental data. It also includes name of molecules, primary and secondary structure information, ligand and biological assembly information, details about data collection and bibliographic citations.
PDB files can be found at the following website: http://www.rcsb.org/pdb/home/home.do
Enter the PDB ID in the search bar as shown below
Gather the crystal structures of Jak3 and ALK (anaplastic lymphoma receptor tyrosine kinase) from RCSB Protein Data Bank. The crystal structures are available in PDB format. The pdb file of Jak3 is 1YVJ. The pdb file of ALK is 4DCE.
III. Step 2: Splitting PDB Files Using DECOMP
DECOMP is a web-based decomposition tool for splitting PDB files. Protein information technology group of Eotvos University located in Hungary developed it. With this program, protein-ligand complexes can be identified reliably and the ligands are deposited in separate files. Missing residues and atoms in chains are handled properly and are inserted into chains for missing residues/atoms. DECOMP server can be found at the following website: http://decomp.pitgroup.org/
Enter the PDB ID in the empty box shown in the Figure 2. The pdb files of the above two proteins are available along with their ligands. So we have to remove the ligand. 1YVJ is the Jak3 kinase domain in complex with a straurosporine analogue. So we have to separate the straurosporine from the Jak3 kinase domain. In the same way 4DCE is anaplastic lymphoma kinase in complex with a piperidine - carboxamide inhibitor. So we have to remove the inhibitor.
Working with DECOMP:
We have to submit the pdb files of the proteins into the server. We are provided with various options to export ligands, ions, and water molecules or to insert missing atoms or residues. Choose the option to export ligand and submitted the files. The requests are in the form of a queue i.e., first in first out and the time of output depends on the traffic in the server. The output will be in the form of a tar.gz files i.e. the compressed version. So after we extract the files from the tar.gz files we have one directory for each of the pdb’s listed. Each of these directories contains an error log with “.Error” extension the decomposed pdb file with “. pdb ” extension and separate files of ligands or ions are present if the option export ligand or ion was chosen.
From this directory take the decomposed pdb file i.e., a file with “. pdb ” extension.
IV. Step 3: Use GRAMM to Predict the Interactions.
GRAMM (Global Range Molecular Matching) is a program for protein docking. GRAMM is open source software and can be installed on the personal computer. It is developed by the Vakser’s lab (Center for Bioinformatics) belonging to university of Kansas. It can be installed on MAC, Windows and Linux operating systems. The working instructions given in this guide pertain to the windows version. It can be downloaded from the following website. Its installation instruction was also given on the same page.
http://vakser.bioinformatics.ku.edu/main/resources_gramm1.03.php
Working with GRAMM:
GRAMM has 3 parameter files rpar.gr, rmol.gr and wlist.gr files. All these files are text files. I prefer using notepad to edit these files.
- Parameters to be considered for rpar.gr file:
The parameters for this file should be considered based on the type of molecules we work with. The options available are high resolution generic docking and low resolution generic docking. Let’s try low-resolution generic docking in this case as we are not sure about the structures of the given molecules and the parameters given for the low resolution generic docking are:
- The other file is rmol.gr file. The following is the format of the file with the values I have used:
The two molecules to which we need to see the predictions should be given here. Under the file name we should give the name of one of the files with atomic coordinates (PDB format). It should be converted in to . ent form before giving it here (1YVJ.ent). To change the file format go to command prompt in windows. To open command prompt press windows symbol on keyboard and type “ cmd ” in the search bar and then press enter. A black screen appears on your desktop, which is command prompt in windows. Now type the following commands in the command prompt to rename the files.
C:\ Users\Name> cd C:\ (This command takes you to C drive).
C: \> cd Gramm (This command takes you to Gramm directory in C drive).
C:\Gramm> rename 1YVJ.pdb 1YVJ.ent (This command renames the file)
Under the column of the fragment you have to mention the range of atoms for which the interactions are to be found or simply you can give it as ’*’ so that the entire molecule can be considered. Under the column of the ID you can give some string of characters without spaces between them to identify your molecules. These ID’S will be used by GRAMM to name the output files.
Run GRAMM with parameter scan ( gramm scan) from the terminal in the GRAMM directory i.e., type “ gramm scan” command in the command prompt and press enter. It creates a .log file and .res file. To be aware which grid has been chosen, see the output .log file. Grid is the potential docking area in which the docking protein searches so as to release maximum amount of energy up on reacting with the docked protein.
- The last step is giving parameters in wlist.gr file: The following is the format of the file with the values I have used:
.res is the file that is obtained in the step b.
First match and last match here refer to the retrieval of the top 10 hits from thousands of complexes generated by Gramm. “Separate” here results in 10 separate pdb files instead of all the ten conformations in the same file. If you want all the 10 conformations in the same pdb file use “joint” Now run GRAMM with the parameter coord ( gramm coord ) from the commands prompt in the GRAMM directory i.e. type the command gramm coord in the command prompt and press enter. Place the GRAMM directory in C drive while you install.
C:\ Users\Name> cd C:\ (This command takes you to C drive).
C: \> cd Gramm (This command takes you to Gramm directory in C drive).
C:\ Gramm> gramm scan (Creates .res and .log file).
C:\Gramm>gramm coord (Creates the pdb files)
After this you will finally get a file in . pdb format if you use “__joint__” option in wlist.gr file or 10 separate pdb files for this example if you use “ seperate ” option, which shows us the various ways that the given two proteins interact.
V. Step 4: Visualizing Protein Interactions
- Python Molecular viewer (PMV): PMV is a powerful molecular viewer that has a number of customizable features. It is distributed as a part of MGLTools , we need to download and install MGLTools to get PMV. It is developed by Molecular Graphics Laboratory of The Scripps Research Institute located in La Jolla, California. It is freely available for download at http://mgltools.scripps.edu/ . It can be used on windows, linux and Mac operating systems.
Three tools that can be used to visualize the final PDF files obtained from GRAMM are:
UCSF Chimera: It is a program for interactive visualization and analysis of molecular structures. High quality images and animations can be generated. It is developed by Resource for Biocomputing, Visualization and Informatics (RBVI) Department belonging to University of California, San Francisco. Download at http://www.cgl.ucsf.edu/chimera/download.html . It can be used on windows, linux and Mac operating systems.
Swiss- Pdb viewer: Swiss Pdb viewer can load and display several molecules simultaneously. Each molecule is loaded into its own layer. It was developed by The SIB Swiss Institute of Bioinformatics located in Switzerland. It is freely available for download at http://spdbv.vital-it.ch/disclaim.html . It can be used on windows, linux and Mac operating systems.
VI. Step 5: Showing Interfacial
Amino Acids
Using servers such as SPPIDER and PISA server to get to know about the interfacial amino acids, number of hydrogen bonds, interfacial area and the amount of energy released. PISA server can be accessed at http://www.ebi.ac.uk/msd-srv/prot_int/ . SPPIDER can be accessed at http://sppider.cchmc.org/ . Some sample outputs are shown below
Protein-Protein Docking
KEYWORDS: DOCKING GENERAL STRUCTURE_PREDICTION
Written by by Sebastian Rämisch (raemisch@scripps.edu). Edited by Shourya S. Roy Burman (ssrb@jhu.edu)
Edited Jun 24 2016
Table of Contents
Summary
Rosetta can be used to predict the bound structure of two proteins starting from unbound structures. By the end of this tutorial, you should be able to understand:
- How to prepare structures for docking
- How to locally dock two proteins
- Hoe to refine an already docked structure
- How to dock two proteins whose interface region is unknown
- How to dock flexible proteins
- How to dock a flexible peptide to a protein
- How to dock symmetric proteins
- How to analyse the best docked model
Docking in Rosetta
Docking in Rosetta in a two stage protocol - the first stage, where aggresive sampling is done, is done in thecentroid modeand the second stage where smaller movements take place is done in the full atom mode. It will internally connect the centers of the two chains with a so-called jump (seeFold Tree). Along this jump the chains are being pulled together (slide into contact). TheMonte Carlomoves, which are selected randomly, are:
- Translations (in x,y or z direction)
- Rotations (around x,y, z axis)
By default, the docking protocol assumes a fixed backbone and only does, translation, rotation and sidechain packing. A detailed algorithm can be foundhere.
Navigating to the Demos
The demos are available at$ROSETTA3/demos/tutorials/Protein-Protein-Docking
. All demo commands listed in this tutorial should be executed when in this directory. All the demos here use thelinuxgccrelease
binary. You may be required to change it to whatever is appropriate given your operating system and compiler.
Compare your output files to the ones present inoutput_files/expected_output
.
Preparing Structures for Docking
This tutorial will introduce you the main steps required for predicting the bound structure of two interacting proteins starting from the unbound structures. For this example, we will dock Colicin-D with its inhibitor, IMM. You are provided with the two refined input filesCOL_D.pdb
andIMM_D.pdb
, and a native file1v74.pdb
in the folderinput_files
.
To prepare structures for docking, be sure to refine them as described inpreparing inputs tutorial.
Local Docking
Rosetta is most accurate when docking locally.**In local docking, we assume that we have some information about the binding pockets of the two proteins.**First, we must manually place the two proteins (within ~10 Å) with the binding pockets roughly facing each other as shown in this figure:
We will pass the following options to indicate that i) chain B is being docked to chain A, and ii) we want to randomly perturb the ligand of the input strucure (chain B) by 3 Å translation and 8° rotation before the start of every individual simulation.
-partners A_B
-dock_pert 3 8
If you are docking two multi-chain proteins (say one with chains A & B, and the other with chains L & H), the option becomes-partners LH_AB
. Make sure that the input PDB has all the chains of each protein listed together.
We will also compare the input with the bound structure 1v74.pdb by passing it as native.
Ensure that the native has the same number of residues, same chain ordering and same chain IDs as the input structure. You may be required to use
-ignore_unrecognized_res
option if the native contains unusual ligands.
Now to start docking, run:
$>$ROSETTA3/main/source/bin/docking_protocol.linuxgccrelease @flag_local_docking
This should take ~1 minute to run and produce a structure file and a score file inoutput_files
. Make sure you use-nstruct 500
or more in production runs.
Local Refinement of Docked Structures
Sometimes docked proteins score very high in Rosetta owing to small clashes. Before refining with`relax`, we need to fix the interface. Since the protein is already docked, we want to avoid large movements of the protein, so we skip the first, centroid mode stage completely and only run the high-resolution full atom mode which does not move the backbone of the docked proteins. We also want to add the rotamers of the input sidechains to our library of rotamers. We replace the flags used forlocal dockingby:
-docking_local_refine
-use_input_sc
We will use the new set of flags to refine the bound structure1v74.pdb
by running:
$>$ROSETTA3/main/source/bin/docking_protocol.linuxgccrelease @flag_local_refine
This should take ~45 seconds to run and produce a structure file and a score file inoutput_files
. Verify that the input structure and the output structure have the same backbone. Now you can refine further withrelax
as described inpreparing inputs tutorial.
Global Docking
Only run global docking if absolutely no information is available about the binding sites in the protein. Global docking assumes a spherical general structure of the proteins and rotates the smaller protein (ligand) around the larger protein (receptor). It also randomizes the starting position of the unbound proteins in every run, so their position in the input structure does not matter as much.
Global docking works best for small complexes (<450 residues).
To do global docking, we add the following three options to the options already present in global docking.
-spin
-randomize1
-randomize2
We will use the same input PDB as inlocal dockingto demostrate the differences in the output. Run:
$>$ROSETTA3/main/source/bin/docking_protocol.linuxgccrelease @flag_global_docking
This should take ~1 minute to run and produce a structure file and a score file inoutput_files
. You will notice that unlike the local docking case, global docking produces a docked structure, which is significantly different from the native. Compare this to the local docking instance, which should be much similar to the native.
Due to the large space sampled, global docking requires a large number of runs to converge on a structure, typically 10,000-100,000.
An alternative to running global docking is to find the likely binding pockets by using an FFT-based global docking programs likeClusProand then run multiple local docking runs.
Docking Flexible Proteins
As mentioned in the introduction, the docking protocol in Rosetta assumes a fixed backbone. If the backbone changes a lot between the unbound and the bound conformations, we dock conformational ensembles of the proteins. Instead of docking one ligand conformation to one receptor conformation, we constantly switch between the conformations in the respective ensembles while sampling in the first stage (in centroid mode). This enables the sampling of multiple backbones with Rosetta’s fixed backbone architecture.
These ensembles can be generated using unconstrained`relax`. For this tutorial, we will use small ensembles - 3 COL_D and 3 IMM_D conformations. Make sure that the number of residues and the chain naming (and order) in every conformation in the ensemble is the same as that of the corresponding partner in the input file. To pass the locations of the conformational ensemble of COL_D, we will pass an ensemble list which looks like:
input_files/COL_D_ensemble/COL_D_0001.pdb
input_files/COL_D_ensemble/COL_D_0002.pdb
input_files/COL_D_ensemble/COL_D_0003.pdb
A similar file is made for IMM_D ensembles. These files are then passed on to docking by adding the following options to thelocal docking:
-ensemble1 COL_D_ensemblelist
-ensemble2 IMM_D_ensemblelist
-ensemble1
must contain the list of conformations to the protein listed first in the input PDB. Even while ensemble docking one conformation against many conformations, you must make ensemble lists for both partners.
However, before we can dock the ensembles, we need to perform an additional step calling_prepacking_. Prepacking prepares the conformers in the ensemble for docking and scores them to inform the docking protocol to remove any energy contribution arising out of the inherantly different energies of the conformers. It uses the same flags except-dock_pert
as the ensemble docking protocol. Run:
$>$ROSETTA3/main/source/bin/docking_prepack_protocol.linuxgccrelease @flag_ensemble_prepack
This should run in ~30 seconds and produce prepacked PDBs in the same location as the original ensemble. It will also modify the ensemble lists with the location of the prepacked PDBs, normalized centroid scores and full atom scores.
input_files/COL_D_ensemble/COL_D_0001.pdb.ppk
input_files/COL_D_ensemble/COL_D_0002.pdb.ppk
input_files/COL_D_ensemble/COL_D_0003.pdb.ppk
0.77058
0
1.00377
-93.3588
-94.2715
-93.9065
Remember thatCOL_D_ensemblelist
andIMM_D_ensemblelist
need to be regenerated if you run docking prepack again, otherwise you will get uninformative error meaasges.
The docking protocol expects these prepacked conformations. This will also produce a score file and an output, but you can ignore both. Now run the docking protocol.
$>$ROSETTA3/main/source/bin/docking_protocol.linuxgccrelease @flag_ensemble_docking
While it is running, you will see the following in your log file:
...
protocols.docking.DockingLowRes: /// Ensemble 1: on ///
protocols.docking.DockingLowRes: /// Ensemble 2: on ///
...
protocols.docking.ConformerSwitchMover: Switching partner with conformer: 3
...
This indicates that your ensembles have been loaded and your backbones are being switched.
This should take about 3 minutes to run and produce a score file and an output.
You will typically use larger ensembles (25 to 100) formations. This does significantly slow the protocol down. Also, due to a dilution of the sampling per backbone, you will have to increase the number of structures.-nstruc 5000
or more is suggested.
Symmetric Docking
Homomeric protein complexes are often symmetric. Rosetta provides the a means to use symmetry information during docking (and many other applications). Check out thesymmetry documentationandtutorialsto know more.
Analyzing Docked Structures
Analyzing docked structures can be tricky. Let’s look at a score file generated from ensemble docking:
SEQUENCE:
SCORE: total_score rms Fnat I_sc Irms cen_dock_ens_conf cen_rms conf_num1 conf_num2 conf_score dock_ens_conf1 dock_ens_conf2 dslf_ca_dih dslf_cs_ang dslf_ss_dih dslf_ss_dst fa_atr fa_dun fa_elec fa_pair fa_rep fa_sol hbond_bb_sc hbond_lr_bb hbond_sc hbond_sr_bb interchain_contact interchain_env interchain_pair interchain_vdw st_rmsd description
SCORE: -202.141 13.143 0.061 -3.086 5.015 -1.004 14.583 1.000 1.000 -25.732 -93.359 -83.050 0.000 0.000 0.000 0.000 -326.487 12.899 -3.158 -8.467 8.982 139.078 -3.968 -4.390 -2.692 -13.937 -20.000 -23.791 -2.002 0.357 7.729 col_complex_ensemble_dock_0001
Thetotal_score
may not indicate the best docked model as the primary contribution comes from the folding energies of the monomers. You should rely more on the interface scoreI_sc
which represents the energy of the interactions across the interface.I_sc
scores are typically much smaller than total scores (typically in the range of -2 to -10 REU). Of course they depend on the size and geometry of the interface; so they are difficult to compare across different proteins.
If you provide a native structure, you get comparative data like the fraction of native contacts (Fnat
), the CαRMSD of the ligand (rms
) and the RMSD of the interface atoms (Irms
). If you run ensemble docking, you will also get to know which of the conformers was finally selected for protein 1(conf1
) and protein 2(conf2
).
You can repeat the run with_-nstruct 20000_and outputting silent files. Then you could extract the best model and create a score vs. rmsd plot.
Look atAnalysisfor further information on how to analyze your results.
Protein-Protein Docking: From Interaction to Interactome
Author informationArticle notesCopyright and License informationDisclaimer
This article has beencited byother articles in PMC.
Abstract
The protein-protein docking problem is one of the focal points of activity in computational biophysics and structural biology. The three-dimensional structure of a protein-protein complex, generally, is more difficult to determine experimentally than the structure of an individual protein. Adequate computational techniques to model protein interactions are important because of the growing number of known protein structures, particularly in the context of structural genomics. Docking offers tools for fundamental studies of protein interactions and provides a structural basis for drug design. Protein-protein docking is the prediction of the structure of the complex, given the structures of the individual proteins. In the heart of the docking methodology is the notion of steric and physicochemical complementarity at the protein-protein interface. Originally, mostly high-resolution, experimentally determined (primarily by x-ray crystallography) protein structures were considered for docking. However, more recently, the focus has been shifting toward lower-resolution modeled structures. Docking approaches have to deal with the conformational changes between unbound and bound structures, as well as the inaccuracies of the interacting modeled structures, often in a high-throughput mode needed for modeling of large networks of protein interactions. The growing number of docking developers is engaged in the community-wide assessments of predictive methodologies. The development of more powerful and adequate docking approaches is facilitated by rapidly expanding information and data resources, growing computational capabilities, and a deeper understanding of the fundamental principles of protein interactions.
Introduction
Proteins recognize each other, typically in a crowded environment, and bind in a highly specific fashion. This process involves diffusion through a densely populated milieu of different proteins and other biomolecular structures, and binding (docking) to their designated protein partner in a structurally unique and precise way. Given the large size of these macromolecules, the great structural diversity, and the high density of the biomolecular environment, this constantly reoccurring process is truly remarkable.
Protein docking—prediction of the structure of a protein-protein complex from the structures of the individual proteins—has evolved significantly since its early days, by incorporating more adequate energy functions and powerful techniques to sample the energy landscapes, and by taking advantage of the rapidly growing body of knowledge on protein structures and interactions. Our current knowledge of protein interaction principles is far greater than before, helping design better docking approaches. The spectacular progress in computing hardware has obviously played a major role as well, opening new ways of thinking about modeling of protein interactions, and often allowing implementation of old but unfeasible at the time ideas. Still, some basic docking principles remain surprisingly unchanged, due to their true nature. Steric and physicochemical complementarity is still the foundation of most docking approaches, as it was in the beginning of the docking field.
Beginnings
The origins of the protein docking field can be traced to the earlier days of molecular modeling. Back than in the 70s, the force fields were simpler, the minds not clouded by the power of computers, and the goals clearer (e.g., to fold proteins from the sequence based on the physical forces alone).
The first docking approaches dealt not with protein-protein complexes per se, but rather with protein interactions with other ligands at predetermined binding sites (1–4). Despite the early times in molecular modeling, the approaches were remarkably sophisticated, implementing flexible docking, taking into account the internal coordinates of not only the ligand, but in some cases also the receptor—a challenging task largely avoided even in today’s community, with all its computing power and the history of methodology development. Protein-protein docking approaches followed shortly, implementing the global search for the docking pose in rigid-body approximation (5,6).
A significant uptake in the development of protein docking techniques (that continues to this day) occurred in the early 90s. Among most influential and consequential approaches put forward at that time were those based on efficient sampling techniques borrowed from computer science (7,8). The docking approach based on correlation by fast Fourier transform, commonly known as FFT docking (7), developed back then by an interdisciplinary group of biologists, chemists, physicists, and computer scientists, has become arguably the most popular protein docking algorithm, implemented over the years in many groups (9). The reason for its popularity is that, as opposed to employing a particular search strategy that may or may not lead to the global minimum (native complex), it allows computationally feasible exhaustive search of the full six-dimensional docking space. Although the space, for the purpose of the exhaustive sampling, has to be discretized, the atomic-size grid steps still provide the “comprehensive” solution to the rigid-body docking problem.
Docking Foundations
The protein-protein docking problem can be formulated as the prediction of the structure of the complex, given the structures of the individual proteins. In the general case, no information other than the structure of these individual proteins is available.Fig. 1shows major steps in the proper development of a docking approach, involving scan (global search) of the docking space using simplified/coarse-grained protein representations, followed by a refinement to a higher resolution (local search), and systematic evaluation on comprehensive benchmark sets and blind community-wide assessments.
The general scheme of protein docking methodology development. The scan (global search for complementarity) is performed on a simplified/coarse-grained representation of proteins (e.g., digitized on a grid, or discretized/approximated in other ways). The scan can be explicit (free) or based on similarity to known cocrystallized complexes (comparative). The refinement is supposed to bring back all or some structural resolution lost in the coarse-graining (e.g., by gradual transition from s smoothed intermolecular energy landscape to the one based on a physical force field, while tracking the position of the global minimum). The validity of the approach is determined by systematic benchmarking on representative sets of structures. To see this figure in color, go online.
In the heart of the docking methodology is the notion of steric complementarity at the protein-protein interface. These interfaces are indeed tightly packed, as observed in cocrystallized complexes in the Protein Data Bank (PDB). The steric complementarity has been the major driving force in the development of docking approaches, often with the addition of physicochemical complementarity—hydrophobicity, electrostatics, etc. (10,11), and statistics-based propensities (12,13). The structural complementarity has been observed at different resolutions, from the atomic to ultralow (14–18).
The conformations of the protein within the complex (bound structure) and the one outside the complex (unbound structure) are different. In some cases, this difference can be neglected or approximated (rigid body docking), or taken into account through conformational search (flexible docking). The rigid body docking involves six degrees of freedom of the two rigid bodies system (e.g., three translations and three rotations in the Cartesian coordinates). The flexible docking involves a much greater number of coordinates, given the conformational search in the internal coordinates of the proteins. However, this search typically does not involve solving the elusive ‘protein folding problem’, but rather can be restricted to a much more tractable unbound-to-bound conformational transition.
Originally, mostly the high-resolution, experimentally determined (primarily by x-ray crystallography) structures were considered. However, more recently, the focus has been shifting toward lower resolution modeled structures. The correct prediction of the complex does not mean the exact native (cocrystallized) complex per se, which is mathematically/computationally impossible, but rather a near-native approximation.
The general question is: what is the necessary level of structural accuracy for predicted protein complexes? In protein-protein interactions, many experimental and theoretical studies require simple knowledge of the residues at the interfaces (e.g., for further experimental analysis) and have no use for atomic resolution structural details of the complex (specific atom-atom, or even residue-residue contacts across the interface). For the interface (binding site) prediction, the high-resolution protein structures, generally, are not needed. That has been extensively shown by systematic studies over a number of years (15). Still, a high-resolution structure of the complex is required for a number of studies (e.g., for estimation of the binding affinity, certain approaches to inhibition of protein interactions, and such).
Bound and Unbound Docking
The bound docking problem, where the proteins within a cocrystallized complex are separated and redocked by a computational procedure, is a useful tool for the development of new docking approaches, but obviously has no practical value for biology. Docking becomes useful when it is able to predict complexes from the separately determined protein structures (unbound docking), thus becoming a tool for generating new knowledge.
Bound docking is the easiest docking case, because by definition it does not involve conformational change. Thus, the structures match ideally at the interface and the rigid body approach is the only tool required to deliver the correct solution. The bound docking problem has been considered solved for a number of years, in the sense that the existing docking approaches reliably and routinely deliver the near-native structures of the complex among the top predictions.
The approaches to the unbound docking problem have to deal with the conformational difference between the unbound and the bound structures. The change from the unbound to the bound conformation is the basis of the protein’s function in its interactions with other proteins. The intermolecular energy landscapes are characterized by conformational properties of the interacting proteins (19–21). One basic direction in the docking methodology involves coarse-graining (22,23). At lower levels of structural resolution, the difference between unbound and bound conformations is less significant (24,25), and ultimately disappears at ultralow (but still structurally meaningful) resolution (15,24,26). Such approaches allow prediction of the gross features of the complex, due to the large structural recognition factors, and the related funnel in the intermolecular energy landscape (27–29). However, prediction of the higher resolution structural details of interface requires modeling of the structural flexibility, at least at the interface regions.
Still, the majority of protein complexes in the nonredundant benchmark sets have small C_α_root mean-square deviation (RMSD) between bound and unbound structures. Indeed, 71% of thedockgroundset (30,31) has RMSD between superimposed unbound and bound proteins <2 Å for 71% of the complexes (31). The benchmark set from Weng’s group (32) has unbound/bound interface C_α_RMSD (between C_α_atoms of the interface residues only) <2.2 Å for 86% of complexes. In a number of cases, when the RMSD is large, the conformational change upon binding is a domain shift. The domains themselves do not undergo a significant conformational change. Thus, this docking still can be addressed by a rigid body approach (33).
Because most docking cases can be resolved by accounting for the flexibility of the surface side chains, the statistics of side-chain conformational changes is important. The results of a systematic large-scale study indicate that short and long side chains have different propensities for the conformational changes (34). Long side chains with three or more dihedral angles are often subject to large conformational transition. Shorter residues with one or two dihedral angles typically undergo local conformational changes not leading to a conformational transition. Most side chains undergo larger changes in the dihedral angle most distant from the backbone. The binding increases both polar and nonpolar interface areas. However, the increase of the nonpolar area is larger, suggesting that the protein association perturbs the unbound interfaces to increase the hydrophobic contribution to the binding free energy (34). Analysis of ensembles of bound and unbound conformations points to conformational selection as the binding mechanism for proteins. The bound and the unbound spectra of conformers also significantly overlap (35). An elastic network model, accounting for the mass distribution, was used to compare the binding site residues fluctuations with other surface residues, showing that, on average, the interface is more rigid (36).
Discretization of the conformational space into rotameric states is useful for the sampling of the conformational space in flexible docking (37,38). Such rotameric libraries for the surface side chains in bound and unbound proteins were generated and used to calculate the probabilities of the rotamer transitions upon binding (38). The stability of amino acids was quantified based on the transition maps. Most side chains changed conformation within the same rotamer or moved to an adjacent rotamer. The highest percentage of the transitions was observed primarily between the two most occupied rotamers (38).
Docking of Models
The docking problem is further complicated if the interacting proteins are models rather than the experimentally determined structures. The errors in such ‘double modeling’ (first of the individual proteins, then of the complex) accumulate, which presents a greater challenge, especially in higher resolution docking (Fig. 2). Thus, the use of approaches to dock these structures should be assessed by thorough benchmarking, specifically designed for protein models (39). To be credible, such benchmarking has to be based on carefully curated sets of structures with levels of distortion typical for modeled proteins. A suite of models was generated for the benchmark set of the x-ray structures from thedockgroundresource (http://dockground.bioinformatics.ku.edu) by a combination of homology modeling and the nudged elastic band method (40). For each monomer, six models were generated with predefined C_α_RMSD from the native structure (1, 2, …, 6 Å). The sets and the accompanying data provide a comprehensive resource for the development of docking methodology for modeled proteins (41). A new approach, implementing only actual modeling of new protein structures as in the real case scenario, was used by the same group of authors to generate a larger set of models (165 complexes, with full arrays of models for each).
Structures with the increasing level of inaccuracy. The model structures (cyan) are overlapped with the x-ray structure (light brown). To see this figure in color, go online.
Physics Versus Knowledge-Based Docking
Solving the equations of motion for two proteins in arbitrary relative orientation, using atomic resolution force fields, does not dock them in the correct configuration. The reason is the extreme complexity of the energy landscape of the system—its span in multidimensional space of its coordinates and the multiplicity of the energy minima (24,42)—all compounded by the approximate nature of the landscape, with error bars that are often larger than the relative depth of the energy basins.
Still, if one is interested just in the location of the global minimum of the free energy, corresponding to the native structure of the complex, with no regard to the binding pathways, there are nonphysical sampling protocols that efficiently search the landscape and deliver the solution. These protocols treat the problem as global optimization, and find the global minimum through various nonlinear programming techniques, including the most trivial (and effective) one—systematics search (7). The main reason for their success, given the complexity of the landscape, is that the global minimum is significantly different from the local minima. It is not just deeper, as it is supposed to be by definition, but deeper by a significant margin (43), and has a number of other distinguishing characteristics, such as size and ruggedness (28,42). Thus, the unavoidable common approximations of the energy landscapes, although distorting the local minima hierarchy, are not approximate enough to eliminate the difference between the global and the local minima (or at least to remove the actual global minimum from the top candidates). Although such minimization protocols are nonphysical, because the landscape represents physics-based energy (even in its simplest form of steric complementarity, which is none other than the minimum of van der Waals energy) such approaches still pass as physical (often among nonphysicists, and physicists who know biology). Still, in many such approaches, the only physical concept is the trivial steric complementarity, and the rest are techniques borrowed from computer science and other engineering disciplines (pattern recognition, optimization, machine learning, etc.). Such approaches have been dominant in the protein docking field since its inception.
The whole notion of physics though goes out the window altogether with the recent docking approaches based solely on similarity to the existing experimentally determined complexes/templates. If two similar pairs of proteins generally bind in a similar way, and one of them is cocrystallized, for the other pair there may be no need to sample the extremely complex intermolecular landscape—one can simply get straight to the presumed global minimum (the correct structure of the complex) by assuming similarity to the experimentally determined complex.
Structural modeling by similarity (comparative modeling) of individual proteins has been around for a long time, since the establishment of the correlation between sequence and structure similarity in the 80s (44). Such similarity suggested that if the sequence of protein A, the structure of which is to be modeled, is similar to the sequence of protein A′, the structure of which is known, one can put protein A in the same fold as protein A′. That provided a dramatic improvement in terms of prediction reliability over the proverbial ‘protein folding problem’ where the protein structure is supposed to be modeled based on the amino acid sequence alone. The atomic resolution prediction of the protein structure was reduced to the repacking of the side chains, and tweaking of the backbone (often involving flexible loops)—a difficult, but quite tractable task, incomparable with the ultimate complexity of the structure prediction from the sequence alone. The critical aspect of such approach is the availability of the experimentally determined (largely by x-ray crystallography) templates. One can date the emergence of the comparative modeling to the expansion of the PDB that at the time had become large enough to provide a meaningful pool of templates. Currently, with the rapid growth of PDB, the template-based modeling of individual proteins is a dominant approach to the prediction of protein structures (45). The modeling of the folding-related physical processes is still a big challenge, and may well remain so in the future. However, the on-going expansion of PDB will arguably keep further simplifying the nonphysical prediction shortcut to the equilibrium structure, given the limited structural scope of the protein universe (46), which causes the reduction of the pool of proteins with the new fold (45).
In protein-protein docking, the similarity between proteins in complexes can be assessed through comparison/alignment of sequences (47–49), sequences and structures (threading) (50–52), or just the structures (52–58) because the structures of the protein to be docked are assumed to be known by the very definition of docking. However, the protein docking field, as opposed to the prediction of individual proteins, largely has not been taking advantage of the template-based modeling. One reason is that protein-protein docking is younger and thus less advanced (the protein docking community is also significantly smaller than the one in modeling of individual proteins, based on the number of participants in the community-wide prediction assessments (∼200 in CASP vs. ∼40 in CAPRI) and prediction targets (45,59).
Another reason has been the relative success of the traditional template free (ab initio) docking, as opposed to the ab initio modeling of the individual proteins. The rigid-body docking (six degrees of freedom) is a meaningful, working approximation for many complexes, whereas any practical approximation in protein folding involves the conformational search space of far greater dimensionality.
Still, the main reason for the almost comple