sstrajectory

SSTrajectory Class

Reading in trajectories starts with the SSTrajectory class. Upon reading, SOURSOP extracts out individual protein chains which can then be subsequently analyzed.

By way of a quickstart example:

from soursop.sstrajectory import SSTrajectory
import numpy as np

# read in a trajectory
TrajOb = SSTrajectory('traj.xtc', 'start.pdb')

# get the first protein chain (this is an SSProtein object)
ProtObj = TrajOb.proteinTrajectoryList[0]

# print the mean radius of gyration
print(np.mean(ProtObj.get_radius_of_gyration()))

The set of possible SSProtein functions can be found in the ssprotein documentation page (quick-link to the left in the navigation bar). The remainder of these documents walk through trajectory initialization, SSTrajectory class variables, SSTrajectory properties, and SSTrajectory functions.

class soursop.sstrajectory.SSTrajectory(trajectory_filename=None, pdb_filename=None, TRJ=None, protein_grouping=None, pdblead=False, debug=False, extra_valid_residue_names=None, explicit_residue_checking=False)[source]
__init__(trajectory_filename=None, pdb_filename=None, TRJ=None, protein_grouping=None, pdblead=False, debug=False, extra_valid_residue_names=None, explicit_residue_checking=False)[source]

SSTrajectory is the class used to read in a work with simulation trajectories in SOURSOP.

There are two ways new SSTrajectory objects can be generated;

  1. By reading from disk (i.e. passing in a trajectory file and a topology file.

  2. By passing in an existing trajectory object that has already been read in.

Both of these are described below.

SOURSOP will, by default, extract out the protein component from your trajectory automatically, which lets you ask questions about the protein only (i.e. without salt ions getting in the way).

You can also explicitly define which protein groups should be considered independently.

Note that by default the mechanism by which individual proteins are identified is by cycling through the unique chains and determining if they are protein or not. You can also provide manual grouping via the protein_grouping option, which lets you define which residues should make up an individual protein. This can be useful if you have multiple proteins associated with the same chain, which happens in CAMPARI if you have more than 26 separate chains (i.e. every protein after the 26th is the ‘Z’ chain).

Class Variables

.proteinTrajectoryListlist

List of SSProtein objects which themselves contain a large set of associated functions

.trajmdtraj.trajectory

The underlying mdtraj.trajectory object enables any/all mdtraj analyses to be performed as one might want normally.

Parameters:
  • trajectory_filename (str) – Filename which contains the trajectory file of interest. Normally this is __traj.xtc or __traj.dcd.

  • pdb_filename (str) – Filename which contains the pdb file associated with the trajectory of interest. Normally this is __START.pdb.

  • TRJ (mdtraj.Trajectory) – It is sometimes useful to re-defined a trajectory and create a new SSTrajectory object from that trajectory. This could be done by writing that new trajectory to file, but this is extremely slow due to the I/O impact of reading/writing from disk. If an mdtraj trajectory objected is passed, this is used as the new trajectory from which the SSTrajectory object is constructed. Default = None

  • protein_grouping (list of lists of ints) – Lets you manually define protein groups to be considered independently. Default = None

  • pdblead (bool) – Lets you set the PDB file (which is normally ONLY used as a topology file) to be the first frame of the trajectory. This is useful when the first PDB file holds some specific reference information which you want to use (e.g. RMSD or Q). Default = False

  • debug (book) – Prints warning/help information to help debug weird stuff during initial trajectory read-in. Default = False.

  • extra_valid_residue_names (list) –

    By default, SOURSOP identifies chains as proteins based on the a set of normally seen protein residue names. These are defined in soursop/ssdata, and are listed below:

    'ALA', 'CYS', 'ASP', 'ASH', 'GLU', 'GLU', 'PHE', 'GLY',
    'HIE', 'HIS', 'HID', 'HIP', 'ILE', 'LEU', 'LYS', 'LYD',
    'MET', 'ASN', 'PRO', 'GLN', 'ARG', 'SER', 'THR', 'VAL',
    'TRP', 'TYR', 'AIB', 'ABA', 'NVA', 'NLE', 'ORN', 'DAB',
    'PTR', 'TPO', 'SEP', 'KAC', 'KM1', 'KM2'  'KM3', 'ACE',
    'NME', 'FOR', 'NH2'
    

    This keyword allows the user to pass a list of ADDITIONAL residues that we want SOURSOP to recognize as valid residues to extract a chain as a protein molecule. This can be especially useful if you want to trick SOURSOP into analyzing polymer simulations where your PDB file may have non-standard residue names (e.g., XXX).

  • explicit_residue_checking (bool) – In early versions of SOURSOP we operate under the assumption that every residue in a chain will be protein if the first residue is a protein. In general this is a good assumption, especially because it means we can implicitly deal with non-standard residue names that are internal to a chain. However, if you’re reading in a .gro file, or any kind of topology file that lacks explicit chains, then any non-protein residues end up being brought in and counted which if you have 10000s of solvent molecules makes parsing impossible. If explicit_residue_checking is set to True, then each residue in each chain is explicitly checked, meaning solvent molecules/atoms in a single chain are discarded.

Example

Example of reading in an XTC trajectory file:

from soursop.sstrajectory import SSTrajectory

TrajOb = SSTrajector('traj.xtc','start.pdb')

SSTrajectory Properties

Properties are class variables that are dynamically calculated. They do not require parentheses at the end. For example:

from soursop.sstrajectory import SSTrajectory

# read in a trajectory
TrajOb = SSTrajectory('traj.xtc', 'start.pdb')

print(TrajOb.n_frames)

Will print the number of frames associated with this trajectory.

Properties

SSTrajectory.n_frames()
Property:

Returns the number of frames in the trajectory.

SSTrajectory.n_proteins()
Property:

Returns the number of individual proteins found.

SSTrajectory.length()[source]
Property:

Returns a tuple with number of proteins and number of frames.

In addition to these three functions, the Python builtin len() returns the number of frames in a trajectory.

SSTrajectory Functions

Functions enable operations to be performed on the entire system. Note if you wish to obtain information about a specific protein in isolation, you should use the SSProtein objects, but for information where multiple proteins are considered all operations should be performed at the level of the SSTrajectory object.

Functions

SSTrajectory.get_overall_radius_of_gyration()[source]

Function which returns the per-frame OVERALL radius of gyration for every protein residue in the trajectory. For systems with multiple protein chains, all chains are combined together. For systems with a single protein chain, this function offers no advantage over interacting directly with the SSProtein object in the .proteinTrajectoryList.

Parameters:

None

Returns:

Returns a numpy array with per-frame instantaneous radius of gyration

Return type:

np.ndarray

SSTrajectory.get_overall_hydrodynamic_radius()[source]

Function which returns the per-frame OVERALL hydrodynamic radius for everyprotein residue in the trajectory. For systems with multiple protein chains, all chains are combined together. For systems with a single protein chain, this function offers no advantage over interacting directly with the SSProtein object in the underlying .proteinTrajectoryList object.

Parameters:

None

Returns:

Returns a numpy array with per-frame instantaneous asphericity

Return type:

np.ndarray

SSTrajectory.get_overall_asphericity()[source]

Function which returns the per-frame OVERALL asphericity for every protein residue in the trajectory. For systems with multiple protein chains, all chains are combined together. For systems with a single protein chain, this function offers no advantage over interacting directly with the SSProtein object in the underlying .proteinTrajectoryList object.

Parameters:

None

Returns:

Returns a numpy array with per-frame instantaneous asphericity

Return type:

np.ndarray

SSTrajectory.get_interchain_distance_map(proteinID1, proteinID2, mode='CA')[source]

Function which returns two matrices with the mean and standard deviation distances between the residues in resID1 from proteinID1 and resID2 from proteinID2.

This computes the (full) intramolecular distance map, where the “distancemap” function computes the intermolecular distance map.

Specifically, this allows the user to define two distinct chains (i.e. an “interchain” distance map).

Obviously this only makes sense if your system has two separate protein objects defined, but in principle the output from:

TrajObj # TrajOb is an SSTrajectory object

TrajObj.get_interchain_distance_map(0,0)

would be the same as:

TrajObj # TrajOb is an SSTrajectory object

TrajObj.proteinTrajectoryList[0].get_distance_map()

This is actually a useful sanity check!

Parameters:
  • proteinID1 (int) – The ID of the first protein of the two being considered, where the ID is the proteins position in the self.proteinTrajectoryList list.

  • proteinID2 (int) – The ID of the second protein of the two being considered, where the ID is the proteins position in the self.proteinTrajectoryList list

  • mode (str (default = 'CA')) – String, must be one of either ‘CA’ or ‘COM’, where CA means alpha carbon and COM means center of mass.

Returns:

get_interchain_distance_map() returns a tuple containing two elements, distanceMap and STDMap.

distanceMap is an [n x m] numpy matrix where n and m are the number of proteinID1 residues and proteinID2 residues. Each position in the matrix corresponds to the mean distance between those two residues over the course of the simulation.

stdMap is an [n x m] numpy matrix where n and m are the number of proteinID1 residues and proteinID2 residues. Each position in the matrix corresponds to the standard devaiation associated with the distances between those two residues.

Return type:

tuple

SSTrajectory.get_interchain_contact_map(proteinID1, proteinID2, threshold=5.0, mode='atom', A1='CA', A2='CA', periodic=False, verbose=False)[source]

Function which returns a matrix with inter-residue contact fractions i.e., this returns the fraction of simulation that each residue from one chain is in contact with each residue from the other chain, where “contact” is defined as the inter-residue distance being below the passed threshold.

By default the mode here is CA-CA distance, which depending on the question may not be what you want. See the various options under ‘mode’ for alternatives.

Note that this analysis can take some time for large trajectories. If this is an issue consider setting the verbose flag to True, and the progress will be printed. By default this is off, but it can be reassuring to confirm things are running.

Parameters:
  • proteinID1 (int) – The ID of the first protein of the two being considered, where the ID is the proteins position in the self.proteinTrajectoryList list.

  • proteinID2 (int) – The ID of the second protein of the two being considered, where the ID is the proteins position in the self.proteinTrajectoryList list.

  • threshold (float) – Distance that is used as the distance cutoff to define something as a contact or not

  • mode (str (default = 'atom')) –

    Mode allows the user to define different modes for computing atomic distance.

    The default is atom whereby a pair of atoms (A1 and A2) are provided. Other options are detailed below and are identical to those offered by mdtraj in compute_contacts.

    Note that if modes other than atom are used the A1 and A2 options are ignored.

    • ca - same as setting atom and then defining atoms 1 and 2 (A1 and A2) as CA.

    • closest - closest atom associated with each of the residues, i.e. the point of closest approach between the two residues.

    • closest-heavy - same as ‘closest’, except only non- hydrogen atoms are considered.

    • sidechain - closest atom where that atom is in the sidechain.

    • sidechain-heavy - closest atom where that atom is in the sidechain and is heavy.

  • A1 (str (default = 'CA')) – Atom name of the atom in R1 we’re looking at.

  • A2 (str (default = 'CA')) – Atom name of the atom in R2 we’re looking at.

  • periodic (bool (default = False)) – Flag which if distances mode is passed as anything other than ‘atom’ then this determines if the minimum image convention should be used. Note that this is only available if pdb crystal dimensions are provided, and in general it’s better to set this to false and center the molecule first. Default = False.

  • verbose (bool) – Flag which, if set to true, will print each iteraction through the outer loop of each residue in the protein as the distances are calculated. Default = False.

Returns:

Returns an matrix with inter-residue contact fractions reported at each intersection.

Return type:

np.ndarray [n x m]

SSTrajectory.get_interchain_distance(proteinID1, proteinID2, R1, R2, A1='CA', A2='CA', mode='atom', periodic=False)[source]

Function which returns the distance between two specific atoms on two residues, or between two residues based on mdtraj’s atom selection mode rules (discussed below). Required input are protein ID selectors and the resid being used. Resids should be used as would be normally used for the SSProtein objects associated with proteinID1 and proteinID2.

For inter-atomic distances, atoms are selected from the passed residue and their ‘name’ field from the topology selection language (e.g. “CA”, “CB” “NH” etc). By default CA atoms are used, but one can define any residue of interest. We do not perform any sanity checking on the atom name - this gets really hard - so have an explicit try/except block which will warn you that you’ve probably selected an illegal atom name from the residues.

For inter-residue distances the associated rules are defined by the ‘mode’ selector. By default mode is set to ‘atom’, which means the variables A1 and A2 are used (with CA as default) to define inter- residue distance. However, if one of the other modes are used the A1/A2 parameters are ignored and alternative rules for computing inter-residue distance are used. These modes are detailed below.

Distance is returned in Angstroms.

Parameters:
  • proteinID1 (int) – Index of the first protein of interest

  • proteinID2 (int) – Index of the second protein of interest

  • R1 (int) – Residue index of first residue

  • R2 (int) – Residue index of second residue

  • A1 (str (default = 'CA')) – Atom name of the atom in R1 we’re looking at.

  • A2 (str (default = 'CA')) – Atom name of the atom in R2 we’re looking at.

  • mode (str (default = 'atom')) –

    Mode allows the user to define different modes for computing atomic distance.

    The default is atom whereby a pair of atoms (A1 and A2) are provided. Other options are detailed below and are identical to those offered by mdtraj in compute_contacts.

    Note that if modes other than atom are used the A1 and A2 options are ignored.

    • ca - same as setting atom and then defining atoms 1 and 2 (A1 and A2) as CA.

    • closest - closest atom associated with each of the residues, i.e. the point of closest approach between the two residues.

    • closest-heavy - same as ‘closest’, except only non- hydrogen atoms are considered.

    • sidechain - closest atom where that atom is in the sidechain.

    • sidechain-heavy - closest atom where that atom is in the sidechain and is heavy.

periodicbool (default = False)

Flag which if distances mode is passed as anything other than ‘atom’ then this determines if the minimum image convention should be used. Note that this is only available if pdb crystal dimensions are provided, and in general it’s better to set this to false and

center the molecule first. Default = False.

Returns:

Returns a 1D numpy array with the distance-per-frame betwee the specified residues

Return type:

np.array