sstrajectory
Overview
SSTrajectory is the entry point for all SOURSOP analyses. It reads a simulation trajectory from disk (or accepts a pre-loaded mdtraj.Trajectory object), identifies every protein chain present, and exposes each chain as an SSProtein object via the proteinTrajectoryList attribute.
Relationship to SSProtein. SSTrajectory operates at the system level; SSProtein operates at the chain level. For analyses of a single isolated chain — radius of gyration, end-to-end distance, secondary structure, internal scaling, etc. — retrieve the relevant SSProtein from proteinTrajectoryList and call methods on that object. SSTrajectory functions are reserved for analyses that span multiple chains, such as inter-chain distance maps and contact maps, or for system-wide observables that combine all chains into a single pseudo-chain (overall \(R_g\), \(R_h\), asphericity).
Supported formats. Any trajectory format supported by mdtraj can be read in (XTC, DCD, TRR, NetCDF, etc.), paired with a compatible topology file (PDB, GRO, etc.).
Typical workflow:
from soursop.sstrajectory import SSTrajectory
import numpy as np
# read in a trajectory and topology
TrajOb = SSTrajectory('traj.xtc', 'start.pdb')
# SSProtein objects for each chain
protein = TrajOb.proteinTrajectoryList[0]
# system-level observable (all chains combined)
rg_overall = TrajOb.get_overall_radius_of_gyration()
print(f"Mean overall Rg = {np.mean(rg_overall):.2f} Å")
# chain-level observable via SSProtein
print(f"Mean chain Rg = {np.mean(protein.get_radius_of_gyration()):.2f} Å")
The SSProtein API is documented on the ssprotein page (quick-link in the left navigation bar). The remainder of this page covers trajectory initialisation, class properties, and SSTrajectory-level functions.
- class soursop.sstrajectory.SSTrajectory(trajectory_filename=None, pdb_filename=None, TRJ=None, protein_grouping=None, pdblead=False, debug=False, extra_valid_residue_names=None, explicit_residue_checking=False, print_warnings=False)[source]
- __init__(trajectory_filename=None, pdb_filename=None, TRJ=None, protein_grouping=None, pdblead=False, debug=False, extra_valid_residue_names=None, explicit_residue_checking=False, print_warnings=False)[source]
SSTrajectory is the class used to read in a work with simulation trajectories in SOURSOP.
There are two ways new SSTrajectory objects can be generated;
By reading from disk (i.e. passing in a trajectory file and a topology file.
By passing in an existing trajectory object that has already been read in.
Both of these are described below.
SOURSOP will, by default, extract out the protein component from your trajectory automatically, which lets you ask questions about the protein only (i.e. without salt ions getting in the way).
You can also explicitly define which protein groups should be considered independently.
Note that by default the mechanism by which individual proteins are identified is by cycling through the unique chains and determining if they are protein or not. You can also provide manual grouping via the protein_grouping option, which lets you define which residues should make up an individual protein. This can be useful if you have multiple proteins associated with the same chain, which happens in CAMPARI if you have more than 26 separate chains (i.e. every protein after the 26th is the ‘Z’ chain).
Class Variables
- .proteinTrajectoryListlist
List of SSProtein objects which themselves contain a large set of associated functions
- .trajmdtraj.trajectory
The underlying mdtraj.trajectory object enables any/all mdtraj analyses to be performed as one might want normally.
- Parameters:
trajectory_filename (str) – Filename which contains the trajectory file of interest. Normally this is __traj.xtc or __traj.dcd.
pdb_filename (str) – Filename which contains the pdb file associated with the trajectory of interest. Normally this is __START.pdb.
TRJ (mdtraj.Trajectory) – It is sometimes useful to re-defined a trajectory and create a new SSTrajectory object from that trajectory. This could be done by writing that new trajectory to file, but this is extremely slow due to the I/O impact of reading/writing from disk. If an mdtraj trajectory objected is passed, this is used as the new trajectory from which the SSTrajectory object is constructed. Default = None
protein_grouping (list of lists of ints) – Lets you manually define protein groups to be considered independently. Default = None
pdblead (bool) – Lets you set the PDB file (which is normally ONLY used as a topology file) to be the first frame of the trajectory. This is useful when the first PDB file holds some specific reference information which you want to use (e.g. RMSD or Q). Default = False
debug (book) – Prints warning/help information to help debug weird stuff during initial trajectory read-in. Default = False.
extra_valid_residue_names (list) –
By default, SOURSOP identifies chains as proteins based on the a set of normally seen protein residue names. These are defined in soursop/ssdata, and are listed below:
'ALA', 'CYS', 'ASP', 'ASH', 'GLU', 'GLU', 'PHE', 'GLY', 'HIE', 'HIS', 'HID', 'HIP', 'ILE', 'LEU', 'LYS', 'LYD', 'MET', 'ASN', 'PRO', 'GLN', 'ARG', 'SER', 'THR', 'VAL', 'TRP', 'TYR', 'AIB', 'ABA', 'NVA', 'NLE', 'ORN', 'DAB', 'PTR', 'TPO', 'SEP', 'KAC', 'KM1', 'KM2' 'KM3', 'ACE', 'NME', 'FOR', 'NH2'
This keyword allows the user to pass a list of ADDITIONAL residues that we want SOURSOP to recognize as valid residues to extract a chain as a protein molecule. This can be especially useful if you want to trick SOURSOP into analyzing polymer simulations where your PDB file may have non-standard residue names (e.g., XXX).
explicit_residue_checking (bool) – In early versions of SOURSOP we operate under the assumption that every residue in a chain will be protein if the first residue is a protein. In general this is a good assumption, especially because it means we can implicitly deal with non-standard residue names that are internal to a chain. However, if you’re reading in a .gro file, or any kind of topology file that lacks explicit chains, then any non-protein residues end up being brought in and counted which if you have 10000s of solvent molecules makes parsing impossible. If explicit_residue_checking is set to True, then each residue in each chain is explicitly checked, meaning solvent molecules/atoms in a single chain are discarded.
print_warnings (bool) – Print warnings if the unit cell lengths are zero or not set. This is a common issue with old CAMPARI trajectories or trajectories generated without CRYSTAL records, but is generally not an issue. Default = False
Example
Example of reading in an XTC trajectory file:
from soursop.sstrajectory import SSTrajectory TrajOb = SSTrajector('traj.xtc','start.pdb')
SSTrajectory Properties
Properties are class variables that are dynamically calculated. They do not require parentheses at the end. For example:
from soursop.sstrajectory import SSTrajectory
# read in a trajectory
TrajOb = SSTrajectory('traj.xtc', 'start.pdb')
print(TrajOb.n_frames)
Will print the number of frames associated with this trajectory.
Properties
- SSTrajectory.n_frames
Number of frames in the underlying mdtraj trajectory.
- Return type:
int
Example
>>> traj.n_frames 1000
- SSTrajectory.n_proteins
Number of distinct protein chains identified during loading.
For a single-chain PDB this is 1; for multi-chain inputs (or when a manual
protein_groupinglist was provided toSSTrajectory) the value matches the number of chains.- Return type:
int
Example
>>> two_chain_traj.n_proteins 2
- SSTrajectory.length
Convenience
(n_proteins, n_frames)summary tuple.Preserves the historic behaviour of
__len__from before SOURSOP switched itslen(traj)semantics to match mdtraj (frame count only).- Returns:
(n_proteins, n_frames).- Return type:
tuple of (int, int)
Example
>>> traj.length (2, 1000)
In addition to these three functions, the Python builtin len() returns the number of frames in a trajectory.
SSTrajectory Functions
Functions enable operations to be performed on the entire system. Note if you wish to obtain information about a specific protein in isolation, you should use the SSProtein objects, but for information where multiple proteins are considered all operations should be performed at the level of the SSTrajectory object.
Functions
- SSTrajectory.get_overall_radius_of_gyration(weights=False, etol=1e-07)[source]
Per-frame radius of gyration computed across every protein chain.
For multi-chain systems the chains are combined into a single “pseudo-chain” before computing \(R_g\). For single-chain systems the result is identical to
self.proteinTrajectoryList[0].get_radius_of_gyration().Warning
This does NOT apply periodic-boundary corrections. If your chains are split across PBC images, centre the molecule first.
- Parameters:
weights (array_like or False, optional) – Per-frame re-weighting vector forwarded to
SSProtein.get_radius_of_gyration().False(default) returns the per-frame array; if supplied the scalar deterministic weighted-mean \(R_g\) is returned.etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
Per-frame \(R_g\) (length
n_frames), or the scalar weighted mean ifweightsis supplied. Angstroms.- Return type:
np.ndarray or float
Example
>>> rg_all = traj.get_overall_radius_of_gyration()
- SSTrajectory.get_overall_hydrodynamic_radius(weights=False, etol=1e-07)[source]
Per-frame hydrodynamic radius computed across every protein chain.
For multi-chain systems the chains are combined into a single “pseudo-chain” before computing \(R_h\). Uses the default Nygaard-et-al estimator. For single-chain systems the result equals
self.proteinTrajectoryList[0].get_hydrodynamic_radius().- Parameters:
weights (array_like or False, optional) – Per-frame re-weighting vector forwarded to
SSProtein.get_hydrodynamic_radius().False(default) returns the per-frame array; if supplied the scalar deterministic weighted-mean \(R_h\) is returned.etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
Per-frame \(R_h\) (length
n_frames), or the scalar weighted mean ifweightsis supplied. Angstroms.- Return type:
np.ndarray or float
Example
>>> rh_all = traj.get_overall_hydrodynamic_radius()
- SSTrajectory.get_overall_asphericity(weights=False, etol=1e-07)[source]
Per-frame asphericity computed across every protein chain.
For multi-chain systems the chains are combined into a single “pseudo-chain” before computing the asphericity. For single-chain systems the result equals
self.proteinTrajectoryList[0].get_asphericity().Warning
This does NOT apply periodic-boundary corrections.
- Parameters:
weights (array_like or False, optional) – Per-frame re-weighting vector forwarded to
SSProtein.get_asphericity().False(default) returns the per-frame array; if supplied the scalar deterministic weighted-mean asphericity is returned.etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
Per-frame asphericity (length
n_frames), or the scalar weighted mean ifweightsis supplied.- Return type:
np.ndarray or float
Example
>>> asph_all = traj.get_overall_asphericity()
- SSTrajectory.get_interchain_distance_map(proteinID1, proteinID2, mode='CA', periodic=False, weights=False, etol=1e-07)[source]
Mean and std inter-chain distance map between two protein chains.
Returns an
(n_res_P1, n_res_P2)matrix of per-pair mean inter-residue distances along with the matching per-pair standard deviations. Distances are in Angstroms.Calling
get_interchain_distance_map(i, i)produces the intra-chain distance map of proteini, which is the same asself.proteinTrajectoryList[i].get_distance_map()and makes a useful sanity check.- Parameters:
proteinID1 (int) – Index of the first protein in
self.proteinTrajectoryList.proteinID2 (int) – Index of the second protein in
self.proteinTrajectoryList.mode ({'CA', 'COM'}, optional) –
'CA'(default) - distances between alpha-carbon atoms.'COM'- distances between residue centres of mass.
periodic (bool, optional) – If True, apply the minimum-image convention. Requires a recorded periodic box and a cubic cell. Generally it is better to centre the molecule first and leave this False. Default False.
weights (array_like or False, optional) – Per-frame re-weighting vector (validated against the shared trajectory frame count).
False(default) gives the ordinary per-pair mean/std (byte-identical to before); if supplied each pair’s mean/std is the deterministic weighted mean / weighted (population) std over frames.etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
(distance_map, std_map)each of shape(n_res_P1, n_res_P2), in Angstroms.- Return type:
tuple of (np.ndarray, np.ndarray)
Example
>>> dmap, smap = two_chain_traj.get_interchain_distance_map(0, 1) >>> dmap.shape (58, 58)
- SSTrajectory.get_interchain_contact_map(proteinID1, proteinID2, threshold=5.0, mode='atom', A1='CA', A2='CA', periodic=False, stride=1, verbose=False)[source]
Inter-chain contact-fraction map between two protein chains.
Returns an
(n_res_P1, n_res_P2)matrix where entry[i, j]is the fraction of (strided) frames in which residueiof chainproteinID1is withinthresholdAngstroms of residuejof chainproteinID2under the chosen distance mode. Calling withproteinID1 == proteinID2produces the chain’s intra-chain contact map (caps and glycines contribute zero rows/columns for modes that can’t be evaluated for them).- Parameters:
proteinID1 (int) – Index of the first protein in
self.proteinTrajectoryList.proteinID2 (int) – Index of the second protein in
self.proteinTrajectoryList.threshold (float, optional) – Contact distance cutoff in Angstroms. Default 5.0.
mode ({'atom', 'ca', 'closest', 'closest-heavy', 'sidechain', 'sidechain-heavy'}, optional) – How inter-residue distance is defined (see
SSProtein.get_inter_residue_atomic_distance()for full descriptions). Formode='atom'theA1/A2atom names are used; for every other mode they are ignored. Default'atom'.A1 (str, optional) – Atom name in residue R1 when
mode='atom'. Default'CA'.A2 (str, optional) – Atom name in residue R2 when
mode='atom'. Default'CA'.periodic (bool, optional) – If True, apply the minimum-image convention to the mdtraj-driven modes (closest, closest-heavy, sidechain, sidechain-heavy). Default False.
stride (int, optional) – Use every
stride-th frame when computing the per-pair distances; the contact fraction is thensum(distances < threshold) / n_strided_frames. Default 1.verbose (bool, optional) – If True, print one status line per residue in the outer loop. Default False.
- Returns:
Array of shape
(n_res_P1, n_res_P2)of contact fractions in[0, 1].- Return type:
np.ndarray
- Raises:
SSException – If either
proteinIDis out of range,modeis invalid,strideis not a positive integer, or every per-residue inner call failed (e.g. an invalid atom name was supplied).
Example
>>> cmap = two_chain_traj.get_interchain_contact_map(0, 1, mode='closest-heavy') >>> cmap.shape (58, 58)
- SSTrajectory.get_interchain_distance(proteinID1, proteinID2, R1, R2, A1='CA', A2='CA', mode='atom', periodic=False, stride=1)[source]
Per-frame distance between two residues, one in each chain.
Resids are interpreted as they would be inside the corresponding
SSProtein(so the same R1=5 means “residue 5 of chainproteinID1” rather than a global topology index).For
mode='atom'the distance is between the named atomsA1of R1 in chain 1 andA2of R2 in chain 2. For every other mode theA1/A2arguments are ignored and the distance comes frommdtraj.compute_contacts.Distance is returned in Angstroms.
- Parameters:
proteinID1 (int) – Index of the first protein in
self.proteinTrajectoryList.proteinID2 (int) – Index of the second protein.
R1 (int) – Resids to measure between (R1 in chain 1, R2 in chain 2).
R2 (int) – Resids to measure between (R1 in chain 1, R2 in chain 2).
A1 (str, optional) – For
mode='atom', the atom names. Defaults'CA'.A2 (str, optional) – For
mode='atom', the atom names. Defaults'CA'.mode ({'atom', 'ca', 'closest', 'closest-heavy', 'sidechain', 'sidechain-heavy'}, optional) – How the distance is defined. See
SSProtein.get_inter_residue_atomic_distance()for full descriptions. Default'atom'.periodic (bool, optional) – If True, apply the minimum-image convention for the mdtraj-driven modes. Default False.
stride (int, optional) – Use every
stride-th frame. Returned array has lengthceil(n_frames / stride). Must be a positive integer. Default 1.
- Returns:
1D array of length
ceil(n_frames / stride)with the inter-chain distance in each (strided) frame, in Angstroms.- Return type:
np.ndarray
- Raises:
SSException – If
modeis invalid, a protein ID is out of range, a resid has no atoms, an atom name cannot be found, orstrideis not a positive integer.
Example
>>> d = two_chain_traj.get_interchain_distance(0, 1, R1=10, R2=15) >>> d.shape (1000,)