sstrajectory

Overview

SSTrajectory is the entry point for all SOURSOP analyses. It reads a simulation trajectory from disk (or accepts a pre-loaded mdtraj.Trajectory object), identifies every protein chain present, and exposes each chain as an SSProtein object via the proteinTrajectoryList attribute.

Relationship to SSProtein. SSTrajectory operates at the system level; SSProtein operates at the chain level. For analyses of a single isolated chain — radius of gyration, end-to-end distance, secondary structure, internal scaling, etc. — retrieve the relevant SSProtein from proteinTrajectoryList and call methods on that object. SSTrajectory functions are reserved for analyses that span multiple chains, such as inter-chain distance maps and contact maps, or for system-wide observables that combine all chains into a single pseudo-chain (overall \(R_g\), \(R_h\), asphericity).

Supported formats. Any trajectory format supported by mdtraj can be read in (XTC, DCD, TRR, NetCDF, etc.), paired with a compatible topology file (PDB, GRO, etc.).

Typical workflow:

from soursop.sstrajectory import SSTrajectory
import numpy as np

# read in a trajectory and topology
TrajOb = SSTrajectory('traj.xtc', 'start.pdb')

# SSProtein objects for each chain
protein = TrajOb.proteinTrajectoryList[0]

# system-level observable (all chains combined)
rg_overall = TrajOb.get_overall_radius_of_gyration()
print(f"Mean overall Rg = {np.mean(rg_overall):.2f} Å")

# chain-level observable via SSProtein
print(f"Mean chain Rg   = {np.mean(protein.get_radius_of_gyration()):.2f} Å")

The SSProtein API is documented on the ssprotein page (quick-link in the left navigation bar). The remainder of this page covers trajectory initialisation, class properties, and SSTrajectory-level functions.

class soursop.sstrajectory.SSTrajectory(trajectory_filename=None, pdb_filename=None, TRJ=None, protein_grouping=None, pdblead=False, debug=False, extra_valid_residue_names=None, explicit_residue_checking=False, print_warnings=False)[source]
__init__(trajectory_filename=None, pdb_filename=None, TRJ=None, protein_grouping=None, pdblead=False, debug=False, extra_valid_residue_names=None, explicit_residue_checking=False, print_warnings=False)[source]

SSTrajectory is the class used to read in a work with simulation trajectories in SOURSOP.

There are two ways new SSTrajectory objects can be generated;

  1. By reading from disk (i.e. passing in a trajectory file and a topology file.

  2. By passing in an existing trajectory object that has already been read in.

Both of these are described below.

SOURSOP will, by default, extract out the protein component from your trajectory automatically, which lets you ask questions about the protein only (i.e. without salt ions getting in the way).

You can also explicitly define which protein groups should be considered independently.

Note that by default the mechanism by which individual proteins are identified is by cycling through the unique chains and determining if they are protein or not. You can also provide manual grouping via the protein_grouping option, which lets you define which residues should make up an individual protein. This can be useful if you have multiple proteins associated with the same chain, which happens in CAMPARI if you have more than 26 separate chains (i.e. every protein after the 26th is the ‘Z’ chain).

Class Variables

.proteinTrajectoryListlist

List of SSProtein objects which themselves contain a large set of associated functions

.trajmdtraj.trajectory

The underlying mdtraj.trajectory object enables any/all mdtraj analyses to be performed as one might want normally.

Parameters:
  • trajectory_filename (str) – Filename which contains the trajectory file of interest. Normally this is __traj.xtc or __traj.dcd.

  • pdb_filename (str) – Filename which contains the pdb file associated with the trajectory of interest. Normally this is __START.pdb.

  • TRJ (mdtraj.Trajectory) – It is sometimes useful to re-defined a trajectory and create a new SSTrajectory object from that trajectory. This could be done by writing that new trajectory to file, but this is extremely slow due to the I/O impact of reading/writing from disk. If an mdtraj trajectory objected is passed, this is used as the new trajectory from which the SSTrajectory object is constructed. Default = None

  • protein_grouping (list of lists of ints) – Lets you manually define protein groups to be considered independently. Default = None

  • pdblead (bool) – Lets you set the PDB file (which is normally ONLY used as a topology file) to be the first frame of the trajectory. This is useful when the first PDB file holds some specific reference information which you want to use (e.g. RMSD or Q). Default = False

  • debug (book) – Prints warning/help information to help debug weird stuff during initial trajectory read-in. Default = False.

  • extra_valid_residue_names (list) –

    By default, SOURSOP identifies chains as proteins based on the a set of normally seen protein residue names. These are defined in soursop/ssdata, and are listed below:

    'ALA', 'CYS', 'ASP', 'ASH', 'GLU', 'GLU', 'PHE', 'GLY',
    'HIE', 'HIS', 'HID', 'HIP', 'ILE', 'LEU', 'LYS', 'LYD',
    'MET', 'ASN', 'PRO', 'GLN', 'ARG', 'SER', 'THR', 'VAL',
    'TRP', 'TYR', 'AIB', 'ABA', 'NVA', 'NLE', 'ORN', 'DAB',
    'PTR', 'TPO', 'SEP', 'KAC', 'KM1', 'KM2'  'KM3', 'ACE',
    'NME', 'FOR', 'NH2'
    

    This keyword allows the user to pass a list of ADDITIONAL residues that we want SOURSOP to recognize as valid residues to extract a chain as a protein molecule. This can be especially useful if you want to trick SOURSOP into analyzing polymer simulations where your PDB file may have non-standard residue names (e.g., XXX).

  • explicit_residue_checking (bool) – In early versions of SOURSOP we operate under the assumption that every residue in a chain will be protein if the first residue is a protein. In general this is a good assumption, especially because it means we can implicitly deal with non-standard residue names that are internal to a chain. However, if you’re reading in a .gro file, or any kind of topology file that lacks explicit chains, then any non-protein residues end up being brought in and counted which if you have 10000s of solvent molecules makes parsing impossible. If explicit_residue_checking is set to True, then each residue in each chain is explicitly checked, meaning solvent molecules/atoms in a single chain are discarded.

  • print_warnings (bool) – Print warnings if the unit cell lengths are zero or not set. This is a common issue with old CAMPARI trajectories or trajectories generated without CRYSTAL records, but is generally not an issue. Default = False

Example

Example of reading in an XTC trajectory file:

from soursop.sstrajectory import SSTrajectory

TrajOb = SSTrajector('traj.xtc','start.pdb')

SSTrajectory Properties

Properties are class variables that are dynamically calculated. They do not require parentheses at the end. For example:

from soursop.sstrajectory import SSTrajectory

# read in a trajectory
TrajOb = SSTrajectory('traj.xtc', 'start.pdb')

print(TrajOb.n_frames)

Will print the number of frames associated with this trajectory.

Properties

SSTrajectory.n_frames

Number of frames in the underlying mdtraj trajectory.

Return type:

int

Example

>>> traj.n_frames
1000
SSTrajectory.n_proteins

Number of distinct protein chains identified during loading.

For a single-chain PDB this is 1; for multi-chain inputs (or when a manual protein_grouping list was provided to SSTrajectory) the value matches the number of chains.

Return type:

int

Example

>>> two_chain_traj.n_proteins
2
SSTrajectory.length

Convenience (n_proteins, n_frames) summary tuple.

Preserves the historic behaviour of __len__ from before SOURSOP switched its len(traj) semantics to match mdtraj (frame count only).

Returns:

(n_proteins, n_frames).

Return type:

tuple of (int, int)

Example

>>> traj.length
(2, 1000)

In addition to these three functions, the Python builtin len() returns the number of frames in a trajectory.

SSTrajectory Functions

Functions enable operations to be performed on the entire system. Note if you wish to obtain information about a specific protein in isolation, you should use the SSProtein objects, but for information where multiple proteins are considered all operations should be performed at the level of the SSTrajectory object.

Functions

SSTrajectory.get_overall_radius_of_gyration(weights=False, etol=1e-07)[source]

Per-frame radius of gyration computed across every protein chain.

For multi-chain systems the chains are combined into a single “pseudo-chain” before computing \(R_g\). For single-chain systems the result is identical to self.proteinTrajectoryList[0].get_radius_of_gyration().

Warning

This does NOT apply periodic-boundary corrections. If your chains are split across PBC images, centre the molecule first.

Parameters:
  • weights (array_like or False, optional) – Per-frame re-weighting vector forwarded to SSProtein.get_radius_of_gyration(). False (default) returns the per-frame array; if supplied the scalar deterministic weighted-mean \(R_g\) is returned.

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Per-frame \(R_g\) (length n_frames), or the scalar weighted mean if weights is supplied. Angstroms.

Return type:

np.ndarray or float

Example

>>> rg_all = traj.get_overall_radius_of_gyration()
SSTrajectory.get_overall_hydrodynamic_radius(weights=False, etol=1e-07)[source]

Per-frame hydrodynamic radius computed across every protein chain.

For multi-chain systems the chains are combined into a single “pseudo-chain” before computing \(R_h\). Uses the default Nygaard-et-al estimator. For single-chain systems the result equals self.proteinTrajectoryList[0].get_hydrodynamic_radius().

Parameters:
  • weights (array_like or False, optional) – Per-frame re-weighting vector forwarded to SSProtein.get_hydrodynamic_radius(). False (default) returns the per-frame array; if supplied the scalar deterministic weighted-mean \(R_h\) is returned.

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Per-frame \(R_h\) (length n_frames), or the scalar weighted mean if weights is supplied. Angstroms.

Return type:

np.ndarray or float

Example

>>> rh_all = traj.get_overall_hydrodynamic_radius()
SSTrajectory.get_overall_asphericity(weights=False, etol=1e-07)[source]

Per-frame asphericity computed across every protein chain.

For multi-chain systems the chains are combined into a single “pseudo-chain” before computing the asphericity. For single-chain systems the result equals self.proteinTrajectoryList[0].get_asphericity().

Warning

This does NOT apply periodic-boundary corrections.

Parameters:
  • weights (array_like or False, optional) – Per-frame re-weighting vector forwarded to SSProtein.get_asphericity(). False (default) returns the per-frame array; if supplied the scalar deterministic weighted-mean asphericity is returned.

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Per-frame asphericity (length n_frames), or the scalar weighted mean if weights is supplied.

Return type:

np.ndarray or float

Example

>>> asph_all = traj.get_overall_asphericity()
SSTrajectory.get_interchain_distance_map(proteinID1, proteinID2, mode='CA', periodic=False, weights=False, etol=1e-07)[source]

Mean and std inter-chain distance map between two protein chains.

Returns an (n_res_P1, n_res_P2) matrix of per-pair mean inter-residue distances along with the matching per-pair standard deviations. Distances are in Angstroms.

Calling get_interchain_distance_map(i, i) produces the intra-chain distance map of protein i, which is the same as self.proteinTrajectoryList[i].get_distance_map() and makes a useful sanity check.

Parameters:
  • proteinID1 (int) – Index of the first protein in self.proteinTrajectoryList.

  • proteinID2 (int) – Index of the second protein in self.proteinTrajectoryList.

  • mode ({'CA', 'COM'}, optional) –

    • 'CA' (default) - distances between alpha-carbon atoms.

    • 'COM' - distances between residue centres of mass.

  • periodic (bool, optional) – If True, apply the minimum-image convention. Requires a recorded periodic box and a cubic cell. Generally it is better to centre the molecule first and leave this False. Default False.

  • weights (array_like or False, optional) – Per-frame re-weighting vector (validated against the shared trajectory frame count). False (default) gives the ordinary per-pair mean/std (byte-identical to before); if supplied each pair’s mean/std is the deterministic weighted mean / weighted (population) std over frames.

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

(distance_map, std_map) each of shape (n_res_P1, n_res_P2), in Angstroms.

Return type:

tuple of (np.ndarray, np.ndarray)

Example

>>> dmap, smap = two_chain_traj.get_interchain_distance_map(0, 1)
>>> dmap.shape
(58, 58)
SSTrajectory.get_interchain_contact_map(proteinID1, proteinID2, threshold=5.0, mode='atom', A1='CA', A2='CA', periodic=False, stride=1, verbose=False)[source]

Inter-chain contact-fraction map between two protein chains.

Returns an (n_res_P1, n_res_P2) matrix where entry [i, j] is the fraction of (strided) frames in which residue i of chain proteinID1 is within threshold Angstroms of residue j of chain proteinID2 under the chosen distance mode. Calling with proteinID1 == proteinID2 produces the chain’s intra-chain contact map (caps and glycines contribute zero rows/columns for modes that can’t be evaluated for them).

Parameters:
  • proteinID1 (int) – Index of the first protein in self.proteinTrajectoryList.

  • proteinID2 (int) – Index of the second protein in self.proteinTrajectoryList.

  • threshold (float, optional) – Contact distance cutoff in Angstroms. Default 5.0.

  • mode ({'atom', 'ca', 'closest', 'closest-heavy', 'sidechain', 'sidechain-heavy'}, optional) – How inter-residue distance is defined (see SSProtein.get_inter_residue_atomic_distance() for full descriptions). For mode='atom' the A1 / A2 atom names are used; for every other mode they are ignored. Default 'atom'.

  • A1 (str, optional) – Atom name in residue R1 when mode='atom'. Default 'CA'.

  • A2 (str, optional) – Atom name in residue R2 when mode='atom'. Default 'CA'.

  • periodic (bool, optional) – If True, apply the minimum-image convention to the mdtraj-driven modes (closest, closest-heavy, sidechain, sidechain-heavy). Default False.

  • stride (int, optional) – Use every stride-th frame when computing the per-pair distances; the contact fraction is then sum(distances < threshold) / n_strided_frames. Default 1.

  • verbose (bool, optional) – If True, print one status line per residue in the outer loop. Default False.

Returns:

Array of shape (n_res_P1, n_res_P2) of contact fractions in [0, 1].

Return type:

np.ndarray

Raises:

SSException – If either proteinID is out of range, mode is invalid, stride is not a positive integer, or every per-residue inner call failed (e.g. an invalid atom name was supplied).

Example

>>> cmap = two_chain_traj.get_interchain_contact_map(0, 1, mode='closest-heavy')
>>> cmap.shape
(58, 58)
SSTrajectory.get_interchain_distance(proteinID1, proteinID2, R1, R2, A1='CA', A2='CA', mode='atom', periodic=False, stride=1)[source]

Per-frame distance between two residues, one in each chain.

Resids are interpreted as they would be inside the corresponding SSProtein (so the same R1=5 means “residue 5 of chain proteinID1” rather than a global topology index).

For mode='atom' the distance is between the named atoms A1 of R1 in chain 1 and A2 of R2 in chain 2. For every other mode the A1 / A2 arguments are ignored and the distance comes from mdtraj.compute_contacts.

Distance is returned in Angstroms.

Parameters:
  • proteinID1 (int) – Index of the first protein in self.proteinTrajectoryList.

  • proteinID2 (int) – Index of the second protein.

  • R1 (int) – Resids to measure between (R1 in chain 1, R2 in chain 2).

  • R2 (int) – Resids to measure between (R1 in chain 1, R2 in chain 2).

  • A1 (str, optional) – For mode='atom', the atom names. Defaults 'CA'.

  • A2 (str, optional) – For mode='atom', the atom names. Defaults 'CA'.

  • mode ({'atom', 'ca', 'closest', 'closest-heavy', 'sidechain', 'sidechain-heavy'}, optional) – How the distance is defined. See SSProtein.get_inter_residue_atomic_distance() for full descriptions. Default 'atom'.

  • periodic (bool, optional) – If True, apply the minimum-image convention for the mdtraj-driven modes. Default False.

  • stride (int, optional) – Use every stride-th frame. Returned array has length ceil(n_frames / stride). Must be a positive integer. Default 1.

Returns:

1D array of length ceil(n_frames / stride) with the inter-chain distance in each (strided) frame, in Angstroms.

Return type:

np.ndarray

Raises:

SSException – If mode is invalid, a protein ID is out of range, a resid has no atoms, an atom name cannot be found, or stride is not a positive integer.

Example

>>> d = two_chain_traj.get_interchain_distance(0, 1, R1=10, R2=15)
>>> d.shape
(1000,)