MaDCaT – Mapping of Distances for the Categorization of Topology

Getting MaDCaT

MaDCaT is written in C++, with the source code freely available to academic users under the terms of the GNU General Public License. Download the source code from here. Inquiries about commercial licensing should be directed to Dr. Gevorg Grigoryan.

Downloading the distance-map database

MaDCaT uses the distance-map representation of structure to search. This means it needs a database of pre-computed maps to search against. Included with the source code is a program for generating maps from PDB structures. In addition, a database of maps derived from the PDB can be downloaded from our server (synced with the PDB nightly; includes only X-ray entries). To do so, issue the following command:

rsync -varz arteni.cs.dartmouth.edu::BioUnitMaps/ /local/path

where /local/path is a path to the directory where you want the maps to reside. To search against a database of maps, you will need to pass to MaDCaT a file with a list of paths to individual maps. Below are some lists generated for this database, which include the list of all maps (all.list) as well as some smaller non-redundant lists generated by taking the first entry from each sequence cluster produced by blastclust when run on the entire PDB. For example, bc-30.list corresponds to sequence clusters produced at 30% sequence redundancy. NOTE: you will need to replace the string /local/path with the base path to where you store your maps locally.

Documentation

All the executables and scripts that are part of MaDCaT print detailed usage statements when run without command-line arguments. These contain the description of all allowed options. Below we present a simple usage scenario, which should cover the most typical circumstance. Some of the additional non-default options are pointed out along the way, but for a full list of all capabilities see the usage summaries printed by each program.

Suppose we have constructed an ideal fragment of tertiary structure, consistent of an alpha helix packing against a parallel two-stranded beta sheet:

An ideal fragment of tertiary structure (left) and its corresponding distance map (right). Red lines in the distance map delineate locations of breaks between secondary-structure elements in the tertiary motif.

We would like to find out 1) whether this fragment is designable and 2) sequence features that tend to encode it. We begin by creating a PDB file corresponding to just this fragment, lets call it motif.pdb. Next, we use the program createDM to generate a distance map corresponding to this fragment (substitute $MADCAT_HOME with the path where you compiled MaDCaT):

$MADCAT_HOME/createDM --p motif.pdb --o motif.map

This generates a flat-text file motif.map which contains the distance matrix (in the first section of the file, before the END statement) as well as some additional information useful for MaDCaT. If you examine motif.map you will find that 1) values listed are inverse distances and 2) only atom pairs closer than 25 Å are listed (i.e. those with inverse distances below 0.04). Inverse distances below this cutoff value are assumed to be too small and are disregarded. 25 Å is the default cutoff and this can be adjusted using the --dcut option to createDM. Note, however, that the pre-generated database of PDB maps from our server has been constructed using the 25 Å cutoff, so this value should be used when searching against this database. The resulting distance map is visualized in the figure above on the right.

The next step is to search a database with this map. Here the bc-30 database is used, which is highly non-redundant but still contains a large number of entries (~16,000 structures at the time of this writing):

$MADCAT_HOME/madcat --map motif.map --compList bc-30.list --diag 1 --topN 1000 --matchOut motif.match --seqOut motif.seq --structOut motif.struct --structOutType match

This tells MaDCaT to look into the file bc-30.list (assumed to be in the local directory here, but you can specify a path) for a list of database maps, search each for structural motifs similar to that given by motif.map, find the top 1,000 closest matches (in the sense of the distance-matrix difference norm) and produce several types of output. File motif.match (the "match" file) is the only mandatory output and it contains the exact specification of the 1,000 best matches found. This format is convenient for MaDCaT's internal use and this file can be used to reconstruct all of the other output information from a previous run. Here is a small fragment of the match file from the above run:

match: structure 'maps/a0/4a0f.1.bin' [(1: 906 x 906), (2: 906 x 769), (3: 906 x 787), (4: 769 x 906), (5: 769 x 769), (6: 769 x 787), (7: 787 x 906), (8: 787 x 769), (9: 787 x 787)] 0.0112288
match: structure 'maps/a0/4a0f.1.bin' [(1: 162 x 162), (2: 162 x 22), (3: 162 x 43), (4: 22 x 162), (5: 22 x 22), (6: 22 x 43), (7: 43 x 162), (8: 43 x 22), (9: 43 x 43)] 0.0112327
match: structure 'maps/l1/1l1s.1.bin' [(1: 1 x 1), (2: 1 x 14), (3: 1 x 32), (4: 14 x 1), (5: 14 x 14), (6: 14 x 32), (7: 32 x 1), (8: 32 x 14), (9: 32 x 32)] 0.0114376
...

Each line corresponds to a single match (ordered best to worst by distance-map difference score). So the top match came from the PDB entry 4A0F (map file maps/a0/4a0f.1.bin) and had a distance-map difference norm score of 0.0112288 (end of the line). The middle portion of each line (between square brackets) specifies the precise indices within the database map where the submaps of the query map align (there are as many index pairs as there are submaps in the query -- on our case 9). The second type of output file is the "sequence" file, here motif.seq. A short excerpt from this file in the current example:

0.810112 0.0112288 ASP LEU LEU CYS LEU VAL GLU LYS THR LEU VAL SER THR GLY ILE ALA ALA SER PHE LEU LEU THR LYS LEU LEU TYR LEU LYS
0.81016 0.0112327 ASP LEU LEU CYS LEU VAL GLU LYS THR LEU VAL SER THR GLY ILE ALA ALA SER PHE LEU LEU THR LYS LEU LEU TYR LEU LYS
0.797668 0.0114376 TYR ARG VAL VAL PHE HIS ILE ARG VAL LEU LEU LEU ILE SER ASN VAL ARG ASN LEU MSE ALA VAL ARG ILE GLU VAL VAL ALA
...

This file lists the sequence of each match (in the same order as in the match file) along with RMSD from the query (first column) and distance-matrix difference norm score (second column). Note that the first two matches have the same sequence (and come from the same PDB entry as we learn form the match file above), but have slightly different scores. This is likely because there is some structural redundancy within the entry itself (e.g. it is a non-crystallographically symmetric homo-oligomer). MaDCaT's approach is to disregard such redundancy during the search step and to worry about removing it during the step of analysis (see below).

The last category of output are structures. In the syntax above, MaDCaT was asked to place the matching portion of structure from each match into a separate file under the directory named motif.struct (will be created if does not exist). Individual file names of matches are numbered to make the correspondence between match structures and lines in the above output files obvious. If the value of the option --structOutType includes the string "file", instead of one file per match, all matches will be put into a single NMR-style PDB file named according to the --structOut option.

Additional options exist that control whether and how information about sequences/structure intervening the disjoint query fragments in the matching structures is presented. See usage statements for more details.

By looking at the sequence output file, we can discern how close structures in real proteins come to our idealized query from above, as well as what sequences tend to stabilize them. As mentioned above, this file can (and often will) contain some redundancy (i.e. different matches may have the same sequence, usually indicating that they come from symmetric regions of a homo-oligomeric protein or a protein with an internal repeat). The perl script seqAnal.pl, included with MaDCaT, can be used to analyze these data by removing such redundancy. For example, the figure below shows the structures of all of the sequence-unique hits with an RMSD from the query below 0.8 Å:

*Query tertiary structure (left), matches from bc-30 that align with a CA RMSD below 0.8 Å (middle), sequence logo diagram from close matches (right).*

As with other programs, running seqAnal.pl without arguments will print a full usage statement. Besides simply pointing out the unique close hits, the script can also draw sequence logo diagrams (provided a standalone version of WebLogo is installed on your machine and a path to it is given to the script or the environmental variable $WEBLOGO_EXEC is set to point to the installation). Such diagrams can reveal sequence features important for realizing the given motif (which can be necessary for either stability, fold specificity or other reasons). However, in cases where there exist two or more sequence sub-motifs, each of which encode for subtly different variations of the query motif, a simple sequence logo may fail to reveal these important features (as the total information content in each position may appear low). For such cases, it may be useful to first cluster the close matches structurally, and then derive sequence logos for each cluster separately. seqAnal.pl provides this capability as well (environmental variable $MADCAT_HOME must be set to the directory where you compiled MaDCaT for this to work). For example, applying this to the above motif's close matches reveals several clusters, sequence logos for two of which are shown below:

*Sequence logos corresponding to two sub-clusters from within the close matches to the above query.*

The generate this information with seqAnal.pl one could say something like this:

perl -w $MADCAT_HOME/scripts/seqAnal.pl -s motif.seq -c 1.5 --sdir motif.structs --rbeg 1 --rlen 28 --rmsc 0.8 -o motif.eps

This asks seqAnal.pl to consider all unique matches to the query with an RMSD below 1.5 Å, then cluster those with a distance cutoff of 0.8 Å (greedy clustering is used), and output various information for the resulting clusters (the sequence logo, if WebLogo is installed, which matches make up the cluster, and their sequences) into file names based on the base name provided in the -o option. Switches --rbeg and --rlen designate the beginning and length of the region of structure used in clustering. The query structure is in our case 29 residues, so the entire structures is considered in clustering. But sometimes it is convenient to focus on variations only in a certain part of the structure. Many more options are available, please see the usage statement.

A web-based search tool

A web-based applet for MaDCaT allows to search for a structural fragment, specified as a PDB file, with optional breaks in the structure. To limit server load, the web interface is currently limited to searching against a randomly sub-sampled set of non-redundant structures from the PDB (currently 1,000 structures) and returns the top 1,000 hits. Results from such a search can be used to determine whether a given structural motif is common, but will not find every matching instance of it in the PDB. For the latter, use of the stand-alone program is recommended.

Reference

If you use MaDCaT in your research, please cite the following paper:

Zhang J., Grigoryan G., "Mining Tertiary Structural Motifs for Assessment of Designability", Methods in Enzymology, 523: 21-40, 2013.

Dartmouth College, Hanover, NH, 03755 USA
Departments of Computer Science, Biological Sciences, and Chemistry
Institute for Biomolecular Targeting