ssprotein
Overview
SSProtein is the core analysis class in SOURSOP and the primary interface for characterizing the conformational behaviour of a single protein chain across a simulation trajectory.
SSProtein objects are not created directly. Instead, they are extracted from an SSTrajectory object after reading a trajectory from disk. Each protein chain in the system is represented by its own SSProtein instance, accessible via SSTrajectory.proteinTrajectoryList.
Analyses fall into several broad categories:
Residue utilities — sequence retrieval, atom/CA index lookup, residue center-of-mass positions and masses.
Inter-residue distances — pairwise CA or COM distance matrices, distance maps, and polymer-scaled distance maps.
Global size and shape — radius of gyration, hydrodynamic radius, end-to-end distance, asphericity, gyration tensor, and the \(t\)-parameter.
Secondary structure — per-frame DSSP assignments and BBSEG backbone-torsion-based classification.
Polymer scaling — internal scaling profiles (\(\langle r^2 \rangle\) vs sequence separation), the scaling exponent \(\nu\), and local heterogeneity in scaling behaviour.
Contact and RMSD analysis — contact maps (with configurable threshold and mode), RMSD to a reference structure, and fraction of native contacts \(Q\).
Solvent accessibility — per-residue and region-level SASA via
get_all_SASA,get_regional_SASA, andget_site_accessibility.Local dynamics — local collapse profiles, sidechain alignment angles, dihedral mutual information, local-to-global correlation, and angle decay.
Clustering and overlap — conformational clustering via
get_clusters, overlap concentration \(c^*\).
Most functions return NumPy arrays; per-frame results have shape (n_frames,) or (n_frames, ...) so standard NumPy operations (np.mean, np.std, etc.) apply directly.
By way of an example:
from soursop.sstrajectory import SSTrajectory
import numpy as np
# read in a trajectory
TrajOb = SSTrajectory('traj.xtc', 'start.pdb')
# get the first protein chain (this is an SSProtein object)
ProtObj = TrajOb.proteinTrajectoryList[0]
# print the mean end-to-end distance
mean_e2e = np.mean(ProtObj.get_end_to_end_distance())
print(mean_e2e)
SSProtein Properties
SSProtein objects have a set of object variables associated with them.
- class soursop.ssprotein.SSProtein(traj, debug=False, check_one_bead_per_residue=True)[source]
- resid_with_CA
Resids of residues that have a CA (alpha-carbon) atom.
ACE and NME capping residues, water, ions, and any non-standard residue without a CA are excluded. Use this whenever you need to iterate over “real” residues and not the cap pseudo-residues.
- Returns:
Resids (0-indexed) of residues with a CA atom, in ascending order.
- Return type:
list of int
Example
>>> protein.resid_with_CA [1, 2, 3, ..., 92] # ctl9_AA has ACE at 0 and NME at 93
- ncap
True if an N-terminal capping residue (e.g. ACE) is present.
Inferred from whether resid 0 has a CA atom: if it does not, a cap is assumed to occupy resid 0.
- Returns:
True if an N-terminal cap is present, False otherwise.
- Return type:
bool
Example
>>> protein.ncap True
- ccap
True if a C-terminal capping residue (e.g. NME) is present.
Inferred from whether the last resid has a CA atom: if it does not, a cap is assumed to occupy that resid.
- Returns:
True if a C-terminal cap is present, False otherwise.
- Return type:
bool
Example
>>> protein.ccap True
- n_frames
Number of frames in the underlying mdtraj trajectory.
- Returns:
Total frame count of the simulation trajectory.
- Return type:
int
Example
>>> protein.n_frames 1000
- n_residues
Number of residues in the protein, including any cap residues.
- Returns:
Total residue count, counting ACE / NME caps if present.
- Return type:
int
Example
>>> protein.n_residues 94
- residue_index_list
All resids associated with this protein, extracted from the topology.
As of v0.1.3 the list always starts at 0 and increments by one up to
n_residues - 1. It is extracted live from the underlyingmdtraj.topology(and then cached) so it is useful as a ground-truth reference when debugging residue-index issues.- Returns:
Resids in ascending order; length equals
n_residues.- Return type:
list of int
Example
>>> protein.residue_index_list[:5] [0, 1, 2, 3, 4]
SSProtein Functions
- class soursop.ssprotein.SSProtein(traj, debug=False, check_one_bead_per_residue=True)[source]
- reset_cache()[source]
Clear every value the SSProtein has memoised so far.
Many SSProtein methods cache their first-call result (sequences, atom index tables, per-residue centres of mass, SASA results, dihedral angles). This method discards those caches and re-initialises the
resid_with_CA/ cap flags from the underlying topology. It is rarely necessary in normal use; reach for it if you have mutated the trajectory in-place and need the SSProtein to re-derive its tables, or if you are profiling cold-vs-warm performance.- Return type:
None
Example
>>> protein.get_radius_of_gyration() # first call: populates caches >>> protein.reset_cache() # discard all cached values >>> protein.get_radius_of_gyration() # cold-cache call again
- print_residues(verbose=True)[source]
Map each zero-indexed residue position to its PDB resname-resid label.
Useful when debugging cap/non-cap offsets or aligning SOURSOP residue indices against a published PDB numbering.
- Parameters:
verbose (bool, optional) – If True (default), also prints each
"<idx> --> <resname-resid>"line to stdout. If False, the list is returned silently.- Returns:
One
[index, resname-resid]pair per residue, in topology order.- Return type:
list of [int, str]
Example
>>> mapping = protein.print_residues(verbose=False) >>> mapping[:3] [[0, 'ACE-0'], [1, 'MET-1'], [2, 'ALA-2']]
- get_amino_acid_sequence(oneletter=False, numbered=True)[source]
Return the amino-acid sequence of the protein in several formats.
Caps (ACE / NME) and other non-standard residues are included in the sequence using their three-letter or one-letter codes as recorded by the topology.
- Parameters:
oneletter (bool, optional) – If True, use single-letter amino-acid codes. If False (default), use three-letter codes.
numbered (bool, optional) – If True (default), include the resid in each entry as a
"RESNAME-RESID"string. If False, just the resname.
- Returns:
oneletter=False, numbered=True->['MET-0', 'ALA-1', ...]oneletter=False, numbered=False->['MET', 'ALA', ...]oneletter=True, numbered=True-> the full one-letter string, e.g."MA..."oneletter=True, numbered=False->['M', 'A', ...]
- Return type:
list of str OR str
Example
>>> protein.get_amino_acid_sequence(oneletter=True, numbered=True) 'MAAEELANAK...' >>> protein.get_amino_acid_sequence()[:3] ['MET-0', 'ALA-1', 'ALA-2']
- get_residue_atom_indices(resid, atom_name=None)[source]
Look up the atom indices that belong to a specific residue.
Results are memoised on first lookup so repeated calls for the same
(resid, atom_name)pair are essentially free. The previous name of this method wasresidue_atom_com().- Parameters:
resid (int) – Resid of the residue to look up.
atom_name (str, optional) – If given, restrict the result to atoms whose name matches this string (e.g.
"CA","CB","N"). If None (default), all atoms of the residue are returned.
- Returns:
Atom indices into the underlying mdtraj topology that satisfy the
resid/atom_namefilter.- Return type:
list of int
Example
>>> ca_atoms = protein.get_residue_atom_indices(5, atom_name='CA') >>> all_atoms = protein.get_residue_atom_indices(5)
- get_CA_index(resid)[source]
Return the atom index of the alpha-carbon (CA) for a residue.
Results are memoised on first lookup so repeated calls are essentially free. Raises
SSExceptionif the residue does not have exactly one CA atom (e.g. cap residues like ACE / NME have none).- Parameters:
resid (int) – Resid whose CA atom should be returned.
- Returns:
The mdtraj atom index of the CA atom of residue
resid.- Return type:
int
- Raises:
SSException – If the residue has no CA atom (caps, non-standard residues, or coarse-grained beads named differently).
Example
>>> protein.get_CA_index(5) 37
- get_multiple_CA_index(resID_list=None)[source]
Return CA atom indices for many residues at once.
Convenience wrapper around
get_CA_index()that handles three input forms: a single integer, a list of integers, orNone(meaning “every residue with a CA”). Residues without a CA are silently skipped rather than raising.- Parameters:
resID_list (int, list of int, or None, optional) –
None(default) -> useresid_with_CA(every residue that has a CA).A single
int-> wrap in a list, return one-element list.A
list[int]-> look up each in turn, skipping any that raise.
- Returns:
CA atom indices, in the same order as the input residues.
- Return type:
list of int
Example
>>> protein.get_multiple_CA_index([5, 10, 15]) [37, 86, 134] >>> all_ca = protein.get_multiple_CA_index() # every CA
- get_residue_COM(resid, atom_name=None)[source]
Per-frame centre of mass (COM) of a residue or a specific atom of it.
The COM is computed once and cached, so repeated calls for the same
(resid, atom_name)pair are essentially free. All positions are returned in Angstroms.- Parameters:
resid (int) – Resid of the residue to extract.
atom_name (str, optional) – If given, only the named atom contributes to the COM (a single-atom COM is just the atom’s position). If None (default), every atom of the residue contributes.
- Returns:
Array of shape
(n_frames, 3)giving the COM(x, y, z)of the residue in each frame.- Return type:
np.ndarray
Example
>>> com = protein.get_residue_COM(5) # full-residue COM >>> ca = protein.get_residue_COM(5, 'CA') # CA atom position
- get_residue_mass(R1)[source]
Total mass of every atom belonging to a residue.
The mass is summed from each atom’s
element.massas recorded in the underlying mdtraj topology, in atomic mass units (g/mol).- Parameters:
R1 (int) – Resid whose mass should be summed.
- Returns:
Total mass of residue
R1in atomic mass units.- Return type:
float
Example
>>> protein.get_residue_mass(5) # e.g. leucine 113.16
- calculate_all_CA_distances(residueIndex, mode='CA', only_C_terminal_residues=True, stride=1)[source]
Distances between a reference residue and every other CA residue.
Computes either CA-CA or COM-COM distances from
residueIndexto all other residues in the chain that have a CA atom. By default only residues C-terminal of the reference are considered, which is the right choice when building an all-vs-all upper-triangular matrix.Distance is returned in Angstroms.
- Parameters:
residueIndex (int) – Reference residue. Must have a CA atom; if not, the function returns
-1rather than raising.mode ({'CA', 'COM'}, optional) –
'CA'(default) - distance between alpha-carbon atoms.'COM'- distance between residue centres of mass.
only_C_terminal_residues (bool, optional) – If True (default) only consider residues with index strictly greater than
residueIndex. If False, also include residues N-terminal of the reference.stride (int, optional) – Use every
stride-th frame. Default is 1.
- Returns:
On success: array of shape
(n_frames_after_stride, M)whereMis the number of other residues considered.If
residueIndexhas no CA: the integer-1.
- Return type:
np.ndarray or int
- Raises:
SSException – If
modeis not one of'CA'or'COM'.
Example
>>> d = protein.calculate_all_CA_distances(5) >>> d.shape (1000, 86)
- get_inter_residue_COM_distance(R1, R2, stride=1, weights=False, etol=1e-07)[source]
Per-frame distance between two residues’ centres of mass.
Distances are in Angstroms.
- Parameters:
R1 (int) – Resid of the first residue.
R2 (int) – Resid of the second residue.
stride (int, optional) – Use every
stride-th frame. Default is 1.weights (array_like or False, optional) – Per-frame re-weighting vector, one entry per trajectory frame (
len == n_frames), validated bysoursop.ssutils.validate_weights()(stride-subsampled and renormalised to align with the strided frames).False(default) returns the per-frame array; if supplied the frame axis is collapsed and the scalar deterministic weighted-mean distance is returned.etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
If
weights is False: 1D array of lengthceil(n_frames / stride)with the inter-residue COM distance for each (strided) frame. Ifweightsis supplied: the scalar weighted-mean distance. Angstroms.- Return type:
np.ndarray or float
Example
>>> d = protein.get_inter_residue_COM_distance(5, 45) >>> d.mean(), d.std() (12.3, 1.1)
- get_inter_residue_COM_vector(R1, R2)[source]
Per-frame displacement vector
COM(R1) - COM(R2)between residue COMs.Unlike
get_inter_residue_COM_distance(), this returns the full(dx, dy, dz)separation rather than its magnitude. The convention isCOM(R1) - COM(R2), i.e. the vector that points from residueR2’s COM to residueR1’s COM. Units are Angstroms.Note
Units changed in v0.2.0; prior to that release the vector was returned in nanometres.
- Parameters:
R1 (int) – Resid of the first residue (the “head” of the vector).
R2 (int) – Resid of the second residue (the “tail” of the vector).
- Returns:
Array of shape
(n_frames, 3)giving the displacementCOM(R1) - COM(R2)in each frame, in Angstroms.- Return type:
np.ndarray
Example
>>> v = protein.get_inter_residue_COM_vector(5, 45) >>> v.shape (1000, 3)
- get_inter_residue_atomic_distance(R1, R2, A1='CA', A2='CA', mode='atom', stride=1, weights=False, etol=1e-07)[source]
Per-frame distance between two residues, with mode-dependent semantics.
For
mode='atom'the distance is between the named atomsA1of residueR1andA2of residueR2. For all other modes theA1/A2arguments are ignored and the distance is the one returned bymdtraj.compute_contactsunder the chosen scheme.SOURSOP does no sanity checking on atom names — if a name is invalid, an SSException is raised with a hint about likely mistypes.
Distance is returned in Angstroms.
- Parameters:
R1 (int) – Resids of the two residues to measure between.
R2 (int) – Resids of the two residues to measure between.
A1 (str, optional) – For
mode='atom', the atom names withinR1andR2. Defaults are'CA'. Ignored for all other modes.A2 (str, optional) – For
mode='atom', the atom names withinR1andR2. Defaults are'CA'. Ignored for all other modes.mode ({'atom', 'ca', 'closest', 'closest-heavy', 'sidechain', 'sidechain-heavy'}, optional) –
'atom'(default) - distance between atomsA1andA2.'ca'- equivalent tomode='atom',A1=A2='CA'.'closest'- closest approach between any atom of R1 and any atom of R2 in each frame.'closest-heavy'- same as'closest'but ignoring hydrogens.'sidechain'- closest sidechain atom of R1 to closest sidechain atom of R2 (requires mdtraj >= 1.8.0).'sidechain-heavy'- sidechain heavy-atom version (requires mdtraj >= 1.8.0). Raises for glycine residues.
stride (int, optional) – Use every
stride-th frame. Default is 1.weights (array_like or False, optional) – Per-frame re-weighting vector, one entry per trajectory frame (
len == n_frames), validated bysoursop.ssutils.validate_weights()and stride-aligned.False(default) returns the per-frame array; if supplied the frame axis is collapsed to the scalar weighted mean.etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
If
weights is False: 1D array of lengthceil(n_frames / stride)with the inter-residue distance for each (strided) frame. Ifweightsis supplied: the scalar deterministic weighted-mean distance. Angstroms.- Return type:
np.ndarray or float
- Raises:
SSException – If
modeis invalid, an atom name cannot be found in the corresponding residue, or the weight vector fails validation.
Example
>>> d = protein.get_inter_residue_atomic_distance(5, 45) # CA-CA by default >>> d_min = protein.get_inter_residue_atomic_distance(5, 45, mode='closest-heavy')
- get_distance_map(mode='CA', RMS=False, stride=1, return_instantaneous_maps=False, weights=False, verbose=True)[source]
Inter-residue distance map (and its standard deviation).
Builds the full N x N matrix of inter-residue distances where N is the number of residues with a CA atom (ACE / NME caps are therefore excluded). Distances are returned in Angstroms. The matrix is upper triangular (entries below the diagonal are zero).
- Parameters:
mode ({'CA', 'COM'}, optional) –
'CA'(default) - distances between alpha-carbon atoms.'COM'- distances between residue centres of mass.
RMS (bool, optional) – If True, report root-mean-square distances \(\sqrt{\langle r_{ij}^2 \rangle}\), the polymer-physics order parameter. If False (default), report the ensemble mean \(\langle r_{ij} \rangle\).
stride (int, optional) – Use every
stride-th frame. Default is 1 (every frame).return_instantaneous_maps (bool, optional) – If True, the first element of the returned tuple is a 3D array
(n_frames, N, N)of per-frame distance maps rather than the 2D ensemble-average map. Default is False.weights (list or np.ndarray, optional) – Per-frame weights for re-weighted averaging (e.g. T-WHAM output). Default
Falsemeans uniform weighting; in that case the std-map is also returned, otherwise it is None.verbose (bool, optional) – If True (default), print one status line per row of the matrix. This calculation can be slow on long trajectories.
- Returns:
(distance_map, std_distance_map)where:distance_mapis(N, N)ensemble-mean distances (or(n_frames, N, N)per-frame distances ifreturn_instantaneous_maps=True).std_distance_mapis(N, N)per-pair standard deviations across frames;Noneifweightswas given.
- Return type:
tuple of (np.ndarray, np.ndarray)
Example
>>> mean_map, std_map = protein.get_distance_map(verbose=False) >>> mean_map.shape (92, 92) >>> rms_map, _ = protein.get_distance_map(mode='COM', RMS=True, verbose=False)
- get_polymer_scaled_distance_map(nu=None, A0=None, min_separation=10, mode='fractional-change', stride=1, weights=False, etol=1e-07, verbose=True)[source]
Quantify how each inter-residue distance deviates from a homopolymer model.
For a standard polymer the equilibrium distance scales as \(\langle r_{ij} \rangle = A_0 |i-j|^{\nu}\). This method builds an N x N matrix where each entry quantifies how far the observed mean distance is from that homopolymer prediction. The deviation can be expressed in several ways (see
mode).Note that \(A_0\) here is the prefactor of the inter-residue distance scaling and is NOT the same numerical value as the \(R_0\) prefactor that defines \(R_g = R_0 N^{\nu}\). The scaling exponent \(\nu\) should typically lie in
[0.33, 0.598]for real polymers.If
nuandA0are not supplied, the homopolymer model is fit automatically byget_scaling_exponent()(with default settings) and that fit is used as the reference model.- Parameters:
nu (float, optional) – Scaling exponent used as the homopolymer reference. If supplied,
A0must also be supplied. IfNone(default), both are obtained by callingget_scaling_exponent().A0 (float, optional) – Scaling prefactor for the homopolymer reference; paired with
nu.min_separation (int, optional) – Minimum sequence separation
|i-j|for which a deviation is computed. Entries with separations below this threshold are filled with the mode-dependent default value (0 for the change modes, 1 for'scaled'). Default is 10.mode ({'fractional-change', 'signed-fractional-change', 'signed-absolute-change', 'scaled'}, optional) –
How to quantify the deviation between the observed mean distance \(r_{ij}\) and the homopolymer prediction \(p_{ij}\):
'fractional-change'(default): \(|r_{ij} - p_{ij}| / p_{ij}\). Unsigned magnitude.'signed-fractional-change': \((r_{ij} - p_{ij}) / p_{ij}\). Positive => expansion vs. the homopolymer, negative => contraction.'signed-absolute-change': \(r_{ij} - p_{ij}\) (Angstroms).'scaled': \(r_{ij} / p_{ij}\) (default value 1 on near-diagonal).
stride (int, optional) – Use every
stride-th frame for the distance map. Default is 1.weights (list or np.ndarray, optional) – Per-frame weights for re-weighted analysis (e.g. T-WHAM output). Default
Falsemeans uniform weighting.etol (float, optional) – Tolerance for the weights-sum-to-one check. Default
1e-7.verbose (bool, optional) – If True (default), print status updates during the distance map and (when needed) the scaling-exponent fit.
- Returns:
(map, nu, A0, redchi)where:mapis an(N, N)numpy matrix of deviations (units depend onmode).nuis the homopolymer scaling exponent used.A0is the homopolymer scaling prefactor used.redchiis the reduced chi-squared of the homopolymer fit (-1ifnu/A0were supplied rather than fit).
- Return type:
tuple of (np.ndarray, float, float, float)
- Raises:
SSException – If
modeis not one of the four allowed values, ifmin_separation < 1, if exactly one ofnu/A0is given, or if the suppliednu/A0are out of physical range.
Example
>>> m, nu, A0, chi2 = protein.get_polymer_scaled_distance_map(verbose=False) >>> m.shape, round(nu, 2) ((92, 92), 0.6)
- get_radius_of_gyration(R1=None, R2=None, weights=False, etol=1e-07)[source]
Per-frame radius of gyration of the chain (or a sub-region).
\(R_g = \sqrt{(1/N) \sum_i (r_i - R)^2}\) is the canonical polymer-physics measure of overall chain size. It is computed by
mdtraj.compute_rgon every heavy/light atom in the selection and returned in Angstroms.- Parameters:
R1 (int, optional) – First residue of the region.
None(default) means the first residue including caps.R2 (int, optional) – Last residue of the region.
None(default) means the last residue including caps.weights (array_like or False, optional) – Per-frame re-weighting vector (validated by
soursop.ssutils.validate_weights()).False(default) returns the unmodified per-frame array. If supplied, the frame axis is collapsed and the single deterministic weighted-mean \(R_g\) is returned instead.etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
If
weights is False: 1D array of lengthn_frameswith the per-frame \(R_g\) in Angstroms. Ifweightsis supplied: the scalar weighted-mean \(R_g\).- Return type:
np.ndarray or float
Example
>>> rg = protein.get_radius_of_gyration() >>> rg.mean() 33.4
- get_hydrodynamic_radius(R1=None, R2=None, mode='nygaard', alpha1=0.216, alpha2=4.06, alpha3=0.821, distance_mode='CA', weights=False, etol=1e-07)[source]
Per-frame apparent hydrodynamic radius \(R_h\) (in Angstroms).
Two estimators are available:
mode='nygaard'- empirical Rg-to-Rh conversion of Nygaard et al. (2017), using the three parametersalpha1,alpha2,alpha3.mode='kr'- Kirkwood-Riseman equation (Kirkwood & Riseman 1948), recommended for comparison with PFG-NMR-derived \(R_h\) values (Pesce et al. 2022).
Both approximations hold for fully flexible disordered proteins and become less accurate for larger / more folded domains.
- Parameters:
R1 (int, optional) – First residue of the region.
None(default) means the first residue including caps.R2 (int, optional) – Last residue of the region.
None(default) means the last residue including caps.mode ({'nygaard', 'kr'}, optional) – Which estimator to use. Default is
'nygaard'.alpha1 (float, optional) – Parameters of equation (7) of Nygaard et al. Defaults reproduce the published values (0.216, 4.06, 0.821). Ignored for
mode='kr'.alpha2 (float, optional) – Parameters of equation (7) of Nygaard et al. Defaults reproduce the published values (0.216, 4.06, 0.821). Ignored for
mode='kr'.alpha3 (float, optional) – Parameters of equation (7) of Nygaard et al. Defaults reproduce the published values (0.216, 4.06, 0.821). Ignored for
mode='kr'.distance_mode ({'CA', 'COM'}, optional) – For
mode='kr', which inter-residue distance to feed into the Kirkwood-Riseman sum. Default is'CA'. Ignored formode='nygaard'.weights (array_like or False, optional) – Per-frame re-weighting vector.
False(default) returns the per-frame array; if supplied the scalar deterministic weighted-mean \(R_h\) is returned.etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
Per-frame \(R_h\) (length
n_frames), or the scalar weighted mean ifweightsis supplied. Angstroms.- Return type:
np.ndarray or float
Example
>>> rh_nyg = protein.get_hydrodynamic_radius() >>> rh_kr = protein.get_hydrodynamic_radius(mode='kr') >>> rh_nyg.mean(), rh_kr.mean() (24.1, 23.8)
References
Nygaard M, Kragelund BB, Papaleo E, Lindorff-Larsen K. An Efficient Method for Estimating the Hydrodynamic Radius of Disordered Protein Conformations. Biophys J. 2017;113:550-557.
Kirkwood JG, Riseman J. The Intrinsic Viscosities and Diffusion Constants of Flexible Macromolecules in Solution. J Chem Phys. 1948;16(6):565-573.
Pesce F et al. Assessment of models for calculating the hydrodynamic radius of intrinsically disordered proteins. Biophysical Journal. 2022. doi:10.1016/j.bpj.2022.12.013
- get_end_to_end_distance(mode='COM', weights=False, etol=1e-07)[source]
Per-frame end-to-end distance between the first and last CA residues.
Caps (ACE / NME) are excluded — the “ends” are the first and last residues with a CA atom, so the two modes operate on the same pair of residues.
End-to-end distance is returned in Angstroms.
- Parameters:
mode ({'CA', 'COM'}, optional) –
'COM'(default) - distance between residue centres of mass.'CA'- distance between alpha-carbon atoms.
weights (array_like or False, optional) – Per-frame re-weighting vector.
False(default) returns the per-frame array; if supplied the scalar deterministic weighted-mean end-to-end distance is returned.etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
Per-frame end-to-end distance (length
n_frames), or the scalar weighted mean ifweightsis supplied. Angstroms.- Return type:
np.ndarray or float
Example
>>> ee = protein.get_end_to_end_distance(mode='CA') >>> ee.mean(), ee.std() (30.1, 4.4)
- get_end_to_end_vs_rg_correlation(mode='COM', weights=False, etol=1e-07)[source]
Pearson correlation between per-frame \(R_e^2\) and \(R_g^2\).
For a Gaussian chain Debye’s result gives \(R_e^2 = 6 R_g^2\), but in practice the correlation is a more useful diagnostic of how fractal-like (or otherwise) the chain is. This method returns the Pearson correlation across frames between the end-to-end distance squared and the radius of gyration squared.
- Parameters:
mode ({'COM', 'CA'}, optional) – Type of distance to use for the end-to-end distance. Default
'COM'.- Returns:
Pearson correlation coefficient in
[-1, 1].- Return type:
float
Example
>>> protein.get_end_to_end_vs_rg_correlation() 0.92
- get_asphericity(R1=None, R2=None, verbose=True, weights=False, etol=1e-07)[source]
Per-frame asphericity of the chain (or a sub-region).
Asphericity is a dimensionless shape descriptor computed from the eigenvalues of the gyration tensor: it is 0 for a perfectly spherical distribution of mass and approaches 1 as the chain becomes increasingly extended/rod-like.
See p. 65 of Andreas Vitalis’ thesis (Probing the Early Stages of Polyglutamine Aggregation with Computational Methods, 2009, WashU) for a canonical derivation.
- Parameters:
R1 (int, optional) – First residue of the region to consider.
None(default) means the first residue in the sequence (including caps).R2 (int, optional) – Last residue of the region to consider.
None(default) means the last residue in the sequence (including caps).verbose (bool, optional) – If True (default), print one status line every 500 frames during the gyration-tensor computation.
weights (array_like or False, optional) – Per-frame re-weighting vector.
False(default) returns the per-frame array; if supplied the frame axis is collapsed and the scalar deterministic weighted-mean asphericity is returned.etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
Per-frame asphericity (length
n_frames), or the scalar weighted mean ifweightsis supplied.- Return type:
np.ndarray or float
Example
>>> a = protein.get_asphericity(verbose=False) >>> a.mean() 0.18
- get_t(R1=None, R2=None, weights=False, etol=1e-07)[source]
Per-frame dimensionless size parameter \(\langle t \rangle\).
\(\langle t \rangle\) is the Vitalis-thesis size parameter (Vitalis 2009) that scales the radius of gyration against the contour length \(L = 3.6 N\) (Angstroms). It is dimensionless and useful for comparing chains of different lengths.
- Parameters:
R1 (int, optional) – First residue of the region.
None(default) means the first residue.R2 (int, optional) – Last residue of the region.
None(default) means the last residue.weights (array_like or False, optional) – Per-frame re-weighting vector (
len == n_frames), validated bysoursop.ssutils.validate_weights().False(default) returns the per-frame array; if supplied the frame axis is collapsed and the scalar deterministic weighted-mean \(\langle t \rangle\) is returned.etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
Per-frame \(\langle t \rangle\) (length
n_frames), or the scalar weighted mean ifweightsis supplied.- Return type:
np.ndarray or float
Example
>>> t = protein.get_t() >>> t.mean() 2.4
References
Vitalis, A. (2009). Probing the Early Stages of Polyglutamine Aggregation with Computational Methods (R. Pappu, ed.) [Ph.D. thesis from Washington University in St. Louis].
- get_gyration_tensor(R1=None, R2=None, verbose=True, weights=False, etol=1e-07)[source]
Per-frame
3 x 3gyration tensor of the chain (or a sub-region).The gyration tensor \(T_{ab} = (1/N) \sum_i (r_{i,a} - R_a)(r_{i,b} - R_b)\) is the second moment of mass about the centre of mass. Its eigenvalues are the principal moments of inertia divided by mass and underlie derived quantities like
get_radius_of_gyration()andget_asphericity(). Units are Angstroms^2.- Parameters:
R1 (int, optional) – First residue of the region to consider.
None(default) means the first residue (including caps).R2 (int, optional) – Last residue of the region to consider.
None(default) means the last residue (including caps).verbose (bool, optional) – If True (default), print one status line every 500 frames.
weights (array_like or False, optional) – Per-frame re-weighting vector.
False(default) returns the per-frame(n_frames, 3, 3)array; if supplied the frame axis is collapsed and a single deterministic weighted-mean(3, 3)tensor is returned.etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
(n_frames, 3, 3)per-frame tensors, or a single(3, 3)weighted-mean tensor ifweightsis supplied.- Return type:
np.ndarray
Example
>>> T = protein.get_gyration_tensor(verbose=False) >>> T.shape (1000, 3, 3)
- get_angles(angle_name)[source]
Per-frame backbone or sidechain dihedral angles for every applicable residue.
Computes the named dihedral using the corresponding mdtraj helper (
compute_phi,compute_psi, …,compute_chi5) and converts the result to degrees. The first call for a givenangle_nameis memoised so repeated calls are essentially free.- Parameters:
angle_name ({'phi', 'psi', 'omega', 'chi1', 'chi2', 'chi3', 'chi4', 'chi5'}) – Dihedral to compute.
chi3-chi5only exist for residues with long sidechains (Arg, Lys, Met, etc.), so the result will contain fewer entries thann_residues.- Returns:
A two-element list:
[0]is a list-of-lists; each sub-list holds the fourmdtraj.Atomobjects that define one dihedral.[1]is an array of shape(n_dihedrals, n_frames)of the per-frame angles in degrees.
- Return type:
list of [list_of_mdtraj_atoms, np.ndarray]
- Raises:
SSException – If
angle_nameis not one of the eight allowed strings.
Example
For a protein with two arginine residues:
>>> chi5 = protein.get_angles('chi5') >>> chi5[0][1] # atoms defining 2nd arginine's chi5 >>> chi5[1][1] # per-frame chi5 angles for that residue
- get_secondary_structure_DSSP(R1=None, R2=None, return_per_frame=False, weights=False, etol=1e-07)[source]
Per-residue secondary structure (helix / extended / coil) via DSSP.
Calls
mdtraj.compute_dsspon the chosen residue range and summarises the output into per-residue fractional occupancies (the default) or per-frame binary masks (withreturn_per_frame=True).The three buckets are H (helix), E (extended/beta), C (coil). At every residue the three values sum to 1 (default mode).
- Parameters:
R1 (int, optional) – First residue of the region.
None(default) means the first CA-bearing residue.R2 (int, optional) – Last residue of the region.
None(default) means the last CA-bearing residue.return_per_frame (bool, optional) –
False (default): return per-residue fractional occupancies.
True: return per-frame binary 1/0 masks.
weights (array_like or False, optional) – Per-frame re-weighting vector (
len == n_frames), validated bysoursop.ssutils.validate_weights(). Only valid withreturn_per_frame=False: the per-residue fractional occupancies become the deterministic frame-weighted fractions. Supplyingweightswithreturn_per_frame=Trueraises anSSException(a per-frame mask is not an ensemble average). DefaultFalse(unweighted).etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
(resid_list, helix, extended, coil)where:resid_list(list of int): residue indices covered.If
return_per_frame=False:helix,extended,coilare 1D arrays of lengthlen(resid_list)with per-residue fractional time in each bucket.If
return_per_frame=True:helix,extended,coilare 2D arrays of shape(n_frames, len(resid_list))with 1 / 0 entries indicating presence of that bucket in that frame.
- Return type:
tuple of length 4
Example
>>> resids, h, e, c = protein.get_secondary_structure_DSSP() >>> h.mean() # mean helicity across the chain 0.32
- get_secondary_structure_BBSEG(R1=None, R2=None, return_per_frame=False, weights=False, etol=1e-07)[source]
Per-residue secondary structure via the BBSEG2 phi/psi classification.
Classifies each (phi, psi) pair into one of 9 BBSEG2 bins using a lookup table distributed with CAMPARI. Implementation is in pure Python, so the result is version-stable across mdtraj releases.
BBSEG2 classes:
0 - unclassified
1 - beta (turn/sheet)
2 - PII (polyproline type II helix)
3 - unusual region
4 - right-handed alpha helix
5 - inverse C7-equatorial (gamma-prime turn)
6 - classic C7-equatorial (gamma turn)
7 - 7-residue-per-turn helix
8 - left-handed alpha helix
- Parameters:
R1 (int, optional) – First residue of the region.
None(default) means the first residue with phi/psi defined.R2 (int, optional) – Last residue of the region.
None(default) means the last residue with phi/psi defined.return_per_frame (bool, optional) –
False (default): return per-residue fractional occupancies.
True: return per-frame binary masks.
weights (array_like or False, optional) – Per-frame re-weighting vector (
len == n_frames), validated bysoursop.ssutils.validate_weights(). Only valid withreturn_per_frame=False: the per-class per-residue fractional occupancies become the deterministic frame-weighted fractions. Supplyingweightswithreturn_per_frame=Trueraises anSSException. DefaultFalse(unweighted).etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
(resid_list, per_class)where:resid_list(list of int): residue indices covered.per_class(dict): keys 0..8 (int), values are either 1D fractional-occupancy arrays (lengthlen(resid_list)) or 2D(n_frames, len(resid_list))binary masks depending onreturn_per_frame.
- Return type:
tuple of (list, dict)
Example
>>> resids, classes = protein.get_secondary_structure_BBSEG() >>> classes[4].mean() # mean right-handed-alpha occupancy across chain
- get_internal_scaling(R1=None, R2=None, mode='COM', mean_vals=False, stride=1, weights=False, etol=1e-07, verbose=True)[source]
Internal scaling profile: sequence separation vs. inter-residue distance.
For every sequence separation \(|i-j|\) from 0 to
R2-R1, this method collects the inter-residue distances of every pair at that separation across every frame. The result is the raw material for an internal-scaling plot — a foundational polymer-physics observable for IDPs and IDRs (Mao et al. 2013; Pappu et al. 2008).ACE / NME peptide caps are excluded. (Note: this differs from CAMPARI, which includes them.)
Distances are in Angstroms.
- Parameters:
R1 (int, optional) – First residue of the region.
None(default) means the first CA-bearing residue.R2 (int, optional) – Last residue of the region.
None(default) means the last CA-bearing residue.mode ({'COM', 'CA'}, optional) –
'COM'(default) - distances between residue centres of mass.'CA'- distances between alpha-carbon atoms.
mean_vals (bool, optional) – If True, the per-separation distances are reduced to their mean before being returned. If False (default), every individual inter-residue distance is kept (useful for plotting the full distribution).
stride (int, optional) – Use every
stride-th frame. Default is 1. Larger values are recommended for long trajectories of large proteins.weights (list or np.ndarray, optional) – Per-frame weights for re-weighted averaging. Default
Falsemeans uniform weighting.etol (float, optional) – Tolerance for the weights-sum-to-one check. Default
1e-7.verbose (bool, optional) – If True (default), print one status line per sequence separation.
- Returns:
(seq_sep, distances)where:seq_sepis a list of integer sequence separations[0, 1, ..., R2-R1].If
mean_vals=True,distancesis a list of per-separation mean distances (one float per separation).If
mean_vals=False(default),distancesis a list of 1D arrays — one array per separation containing every individual inter-residue distance at that separation.
- Return type:
tuple of (list, list)
Example
>>> sep, mean_d = protein.get_internal_scaling(mean_vals=True, verbose=False) >>> # plot sep vs mean_d for an internal-scaling profile
References
Mao AH, Lyle N, Pappu RV. Describing sequence-ensemble relationships for intrinsically disordered proteins. Biochem J. 2013;449:307-318.
Pappu RV, Wang X, Vitalis A, Crick SL. A polymer-physics perspective on driving forces and mechanisms for protein aggregation. Arch Biochem Biophys. 2008;469:132-141.
Notes
Computational cost scales rapidly with chain length; consider increasing
stridefor trajectories of long proteins.
- get_internal_scaling_RMS(R1=None, R2=None, mode='COM', stride=1, weights=False, etol=1e-07, verbose=True)[source]
Root-mean-square internal scaling profile: separation vs. RMS distance.
Like
get_internal_scaling()but reports \(\sqrt{\langle r_{ij}^2 \rangle}\) for each sequence separation — the formally correct order parameter for polymer-physics analyses (Mao et al. 2013; Pappu et al. 2008). ACE / NME caps are excluded. Distances in Angstroms.- Parameters:
R1 (int, optional) – First residue of the region.
None(default) means the first CA-bearing residue.R2 (int, optional) – Last residue of the region.
None(default) means the last CA-bearing residue.mode ({'COM', 'CA'}, optional) –
'COM'(default) - distances between residue centres of mass.'CA'- distances between alpha-carbon atoms.
stride (int, optional) – Use every
stride-th frame. Default is 1.weights (list or np.ndarray, optional) – Per-frame weights for re-weighted averaging. Default
False.etol (float, optional) – Tolerance for the weights-sum-to-one check. Default
1e-7.verbose (bool, optional) – If True (default), print one status line per sequence separation.
- Returns:
(seq_sep, rms_distance)whereseq_sepis the list of integer separations andrms_distanceis the corresponding list of RMS inter-residue distances in Angstroms.- Return type:
tuple of (list, list)
Example
>>> sep, rms = protein.get_internal_scaling_RMS(verbose=False) >>> import numpy as np >>> # log-log fit of rms vs sep recovers the apparent scaling exponent >>> nu, _ = np.polyfit(np.log(sep[1:]), np.log(rms[1:]), 1)
References
Mao AH, Lyle N, Pappu RV. Describing sequence-ensemble relationships for intrinsically disordered proteins. Biochem J. 2013;449:307-318.
Pappu RV, Wang X, Vitalis A, Crick SL. A polymer-physics perspective on driving forces and mechanisms for protein aggregation. Arch Biochem Biophys. 2008;469:132-141.
- get_scaling_exponent(inter_residue_min=15, end_effect=5, subdivision_batch_size=20, mode='COM', num_fitting_points=40, fraction_of_points=0.5, fraction_override=False, etol=1e-07, stride=1, weights=False, verbose=True)[source]
Fit the apparent polymer scaling exponent and prefactor by loglog regression.
Fits the homopolymer scaling relationship \(\sqrt{\langle r_{ij}^2 \rangle} = A_0 |i-j|^{\nu_{app}}\) to the internal-scaling profile of the chain. The exponent \(\nu_{app}\) (“nu-app”) reports on solvent quality and the prefactor \(A_0\) captures chain stiffness / segment volume.
Bootstrap error estimates for
nuandA0come from randomly subdividing the trajectory intosubdivision_batch_sizechunks, fitting each, and taking the min/max across the chunks.Note
Despite the precision of the fit,
nu_appandA0are qualitative metrics subject to finite-chain-size effects, and are most rigorously interpreted for genuine homopolymers. For heteropolymers considerget_polymer_scaled_distance_map()which avoids the homopolymer assumption.- Parameters:
inter_residue_min (int, optional) – Minimum sequence separation
|i-j|used in the fit. Helps avoid the short-distance regime where local steric effects dominate. Default is 15.end_effect (int, optional) – Exclude pairs in which one residue is within
end_effectresidues of either chain end. Default is 5.subdivision_batch_size (int, optional) – Target chunk size when bootstrapping fit-quality errors. Default is 20.
mode ({'COM', 'CA'}, optional) – Distance type.
'COM'(default) is preferred — CA-CA tends to inflate the apparent profile.num_fitting_points (int, optional) – Maximum number of evenly-spaced loglog points used in the linear fit. Default is 40.
fraction_of_points (float, optional) – When
fraction_overrideis True, or when the chain is too short to providenum_fitting_pointspoints, this fraction of the valid sequence separations is used instead. Default is 0.5.fraction_override (bool, optional) – If True, always use
fraction_of_pointsand ignorenum_fitting_points. Default is False.etol (float, optional) – Tolerance for the weights-sum-to-one check. Default
1e-7.stride (int, optional) – Use every
stride-th frame. Default is 1.weights (list or np.ndarray, optional) – Per-frame weights for re-weighted averaging. Default
False.verbose (bool, optional) – If True (default), print status updates during the internal-scaling pass.
- Returns:
(best_nu, best_A0, min_nu, max_nu, min_A0, max_A0, redchi_fit_region, redchi_all_points, fit_region_data, all_points_data)where:best_nu,best_A0- point estimates from the full fit.min_nu..``max_A0`` - bootstrap min/max bounds.redchi_fit_region- reduced chi^2 on the points used in the linear fit (loglog-uniform).redchi_all_points- reduced chi^2 across every observed \(|i-j|\).fit_region_data- shape(2, K); col 0 is sequence separation, col 1 is RMS distance for the fit-region points.all_points_data- shape(3, M); col 0 is sequence separation, col 1 is observed RMS distance, col 2 is the best-fit prediction.
- Return type:
tuple of length 10
- Raises:
SSException – If
fraction_of_points > 1.0or if too few points remain to fit a line (< 3).
Example
>>> import numpy as np >>> np.random.seed(42) # error bootstrap consumes the RNG >>> result = protein.get_scaling_exponent(verbose=False) >>> nu, A0 = result[0], result[1]
- get_local_heterogeneity(fragment_size=10, bins=None, stride=20, verbose=True, weights=False, etol=1e-07)[source]
Sliding-window heterogeneity: per-position distribution of intra-window RMSDs.
At each starting residue
i(from 0 ton_residues - fragment_size) the method computes the RMSD of the local fragment[i, i+fragment_size]across every (strided) frame, then summarises that distribution by its mean, standard deviation, and histogram. The Phi parameter of Lyle, Das & Pappu (2013) is built from this same family of D-values.Computational cost scales with
n_residues * (n_frames / stride); the defaultstride=20keeps things tractable on long trajectories.Note
weights/etolare accepted for API uniformity with the other ensemble-average methods, but supplyingweightsraises anSSException: this is a pairwise frame-vs-frame measure for which a single per-frame probability vector is not well-defined.- Parameters:
fragment_size (int, optional) – Size of the sliding window (residues). Must be
>= 2and<= n_residues. Default is 10.bins (np.ndarray or list, optional) – Histogram bin edges. If None (default), uses
np.arange(0, 10, 0.01). Must contain at least two evenly- spaced floats.stride (int, optional) – Use every
stride-th frame in the inner RMSD pass. Default 20.verbose (bool, optional) – If True (default), print one status line per starting residue.
- Returns:
(mean, std, histo, bins)where:mean(list of float, lengthn_residues - fragment_size): mean intra-window RMSD per starting residue.std(list of float, same length): standard deviation of the same RMSD distribution.histo(list of np.ndarray, same length): histogram counts per starting residue.bins(np.ndarray): the bin edges used (echoed back).
- Return type:
tuple of (list, list, list, np.ndarray)
- Raises:
SSException – If
fragment_sizeis out of range, orbinsis not a valid evenly-spaced 1D vector.
Example
>>> mean, std, hist, bins = protein.get_local_heterogeneity(fragment_size=8)
References
Lyle N, Das RK, Pappu RV. A quantitative measure for protein conformational heterogeneity. J Chem Phys. 2013;139(12):121907.
- get_D_vector(stride=20, verbose=True)[source]
Pairwise frame-vs-frame conformational dissimilarity vector \(D\).
For every pair of (strided) frames \((A, B)\) the method computes a single dissimilarity scalar \(D_{AB} = 1 - V_A \cdot V_B / (||V_A|| ||V_B||)\) where \(V\) is the flattened upper-triangular CA-CA distance vector for that frame. This is the input to the Phi parameter of Lyle, Das, Pappu (2013).
Using CA-CA distances (rather than full inter-atomic distances) makes the measure backbone-specific and much faster than the original formulation.
- Parameters:
stride (int, optional) – Use every
stride-th frame. Default is 20.verbose (bool, optional) – If True (default), print one status line per outer-loop frame.
- Returns:
1D array of length \(\binom{n_{frames}/\text{stride}}{2}\) containing the pairwise dissimilarity values.
- Return type:
np.ndarray
Example
>>> D = protein.get_D_vector(stride=20) >>> D.mean() # mean pairwise dissimilarity 0.27
References
Lyle N, Das RK, Pappu RV. A quantitative measure for protein conformational heterogeneity. J Chem Phys. 2013;139(12):121907.
- get_RMSD(frame1, frame2=-1, region=None, backbone=True, stride=1)[source]
Aligned root-mean-square deviation (RMSD) of one frame vs. another (or all).
Both modes superpose the target onto the reference before computing the RMSD, so the result is rotation/translation invariant. Units are Angstroms.
- Parameters:
frame1 (int) – Index of the reference frame.
frame2 (int, optional) – Index of the comparison frame.
-1(default) means “every frame” — i.e. compute an RMSD trace offrame1against the entire trajectory (subsampled bystride).region (list or tuple of length 2, optional) –
[first_resid, last_resid](inclusive) restricting the atoms used to compute and align the RMSD.None(default) uses the full chain.backbone (bool, optional) – If True (default), use only the backbone heavy atoms (N, CA, C, O). If False, use every atom in the selection.
stride (int, optional) – Used only when
frame2 == -1. Compareframe1against everystride-th frame of the trajectory. Default is 1.
- Returns:
1-element array if
frame2is a specific frame index.Length-
ceil(n_frames/stride)array ifframe2 == -1.
Values are in Angstroms.
- Return type:
np.ndarray
Example
>>> rmsd_traj = protein.get_RMSD(frame1=0) # frame 0 vs all >>> rmsd_pair = protein.get_RMSD(frame1=0, frame2=42) # frame 0 vs 42
- get_Q(protein_average=True, region=None, beta_const=50.0, lambda_const=1.8, native_contact_threshold=4.5, stride=1, native_state_reference_frame=0, weights=False)[source]
Fraction-of-native-contacts order parameter \(Q\) (Best et al.).
Native contacts are defined from the reference frame (frame 0) as heavy-atom pairs separated by at least 4 residues in sequence and less than
native_contact_thresholdAngstroms in space. Each frame’s \(Q\) is then a sigmoid-weighted fraction of those contacts that remain “native-like” in that frame, per the Best, Hummer, Eaton formula (Best et al. 2013).The reference frame is hard-coded to frame 0 because preserving a variable reference frame became unwieldy with re-weighted ensembles.
- Parameters:
protein_average (bool, optional) –
If True (default), return one Q value per frame (the protein average).
If False, return per-native-contact and per-residue breakdowns in a 5-tuple — useful for visualising which contacts are most persistent.
region (list or tuple of length 2, optional) –
[first_resid, last_resid](inclusive) restricting the residues used to identify native contacts.None(default) uses the full protein.beta_const (float, optional) – Steepness of the sigmoid weighting (1/nm). Default 50. Do not change without strong justification.
lambda_const (float, optional) – Sigmoid width factor (all-atom default 1.8). Do not change without strong justification.
native_contact_threshold (float, optional) – Distance cutoff (Angstroms) that defines a contact in the reference frame. Default is 4.5.
stride (int, optional) – Use every
stride-th frame. Default is 1. May be combined withweights(the weight vector is subsampled with the same stride and renormalised internally).native_state_reference_frame (int, optional) – Currently ignored — the reference frame is always frame 0. Kept in the signature for backward compatibility.
weights (list or np.ndarray, optional) – Per-frame weights for re-weighted analysis, one entry per trajectory frame (
len(weights) == n_frames), each in[0, 1]and summing to 1. Validated bysoursop.ssutils.validate_weights(). Only used whenprotein_average=False(the per-contact / per-residue breakdown); forprotein_average=Truesee Raises. When astrideis given the vector is strided and renormalised, and the native reference frame (frame 0) is excluded and the remainder renormalised so the weighted average is a proper expectation over the non-native (strided) frames. DefaultFalse.
- Returns:
If
protein_average=True(default): 1D array of lengthceil(n_frames/stride)with the per-frame Q value.If
protein_average=False: a 5-tupleper_contact_fraction- 1D array of lengthN_NATIVE_CONTACTS; fraction of time each native contact was native-like over the trajectory.native_contact_pairs-(N_NATIVE_CONTACTS, 2)int array; the atom-pair definitions of native contacts.per_residue_contacts- dict keyed by"RESNAME-RESID"; each value is a list of fractional-native-contact scores for atoms in that residue. Take the mean per key to get the per-residue Q.ordered_residue_keys- the keys of (2) in residue order.res_res_q_matrix-(n_residues, n_residues)array of inter-residue Q values (symmetric).
- Return type:
np.ndarray OR tuple of length 5
- Raises:
SSException – If
weightsis given withprotein_average=True(frame-averaged Q must be reweighted outside SOURSOP); if the weight vector fails validation (wrong length, out of[0, 1], non-finite, or not summing to 1); if excluding the native reference frame leaves a zero-sum weight vector; or if any constant cannot be coerced to float.
Example
>>> q = protein.get_Q() # per-frame Q >>> q.mean() 0.85 >>> per_c, pairs, per_res, keys, res_res = protein.get_Q(protein_average=False)
References
Best RB, Hummer G, Eaton WA. Native contacts determine protein folding mechanisms in atomistic simulations. PNAS 2013. doi:10.1073/pnas.1311599110
- get_contact_map(distance_thresh=5.0, mode='closest-heavy', stride=1, weights=False)[source]
Inter-residue contact map and per-residue contact-order vector.
For every pair of residues this returns the fraction of frames in which the chosen inter-residue distance is below
distance_thresh. The companion contact-order vector reduces the 2D map to a per-residue summary by averaging over the appropriate neighbours-excluded denominator.i to i,i to i+1, andi to i+2pairs are always excluded. ACE / NME caps are excluded from the residue set.- Parameters:
distance_thresh (float, optional) – Contact threshold in Angstroms. A pair is “in contact” in a given frame if its distance under the chosen
modeis below this value. Default is 5.0.mode ({'closest-heavy', 'ca', 'closest', 'sidechain', 'sidechain-heavy'}, optional) –
How the inter-residue distance is defined (the schemes from
mdtraj.compute_contacts):'closest-heavy'(default) - closest pair of heavy atoms.'ca'- alpha-carbon distance.'closest'- closest pair of any atoms (incl. hydrogens).'sidechain'- closest pair of sidechain atoms (mdtraj >= 1.8).'sidechain-heavy'- closest pair of sidechain heavy atoms (mdtraj >= 1.8). Raises if any residue is glycine.
stride (int, optional) – Use every
stride-th frame. Default is 1. May be combined withweights(the weight vector is subsampled with the same stride and renormalised internally so it aligns with the strided frames).weights (list or np.ndarray, optional) – Per-frame weights for re-weighted averaging, one entry per trajectory frame (
len(weights) == n_frames), each in[0, 1]and summing to 1. Validated bysoursop.ssutils.validate_weights(). When supplied the contact map is the deterministic weighted mean of the per-frame contact maps over the (strided) frames. DefaultFalse.
- Returns:
(contact_map, contact_order)where:contact_mapis an(N, N)matrix of contact fractions in[0, 1].contact_orderis a 1D vector of lengthNgiving the mean fractional contact count per residue, normalised by the number of valid partners for each residue.
- Return type:
tuple of (np.ndarray, np.ndarray)
- Raises:
SSException – For
mode='sidechain-heavy'when any glycine is present, or if the weight vector fails validation (wrong length, out of[0, 1], non-finite, or not summing to 1).
Example
>>> cmap, corder = protein.get_contact_map() >>> cmap.shape (92, 92)
- get_clusters(region=None, n_clusters=10, backbone=True, stride=20)[source]
Ward hierarchical clustering of conformations by RMSD.
Builds a strided all-vs-all RMSD matrix and groups frames using Ward’s agglomerative clustering. The number of clusters (
n_clusters) must be specified up front. The chosen “centroid” of each cluster is the member whose summed exponential RMSD weight is largest (i.e. the most internally-connected frame).- Parameters:
region (list of length 2, optional) –
[first_resid, last_resid](inclusive) restricting the atoms used for the RMSD calculation.None(default) uses the full chain.n_clusters (int, optional) – Target number of clusters. Default is 10. Note that the actual number of clusters returned may be smaller if some labels are never assigned.
backbone (bool, optional) – If True (default), use only backbone heavy atoms for the RMSD (much faster). If False, use every atom in the selection.
stride (int, optional) – Use every
stride-th frame. Default is 20.stride=1does an exact all-vs-all but can be very slow.
- Returns:
(cluster_members, cluster_trajs, cluster_distance_matrices, cluster_centroids, cluster_frames)where:cluster_members(list of int): number of frames in each cluster.cluster_trajs(list ofmdtraj.Trajectory): one sub-trajectory per cluster containing the assigned frames.cluster_distance_matrices(list of np.ndarray): per-cluster all-vs-all RMSD submatrix (Angstroms).cluster_centroids(list of int): for each cluster, the index within that cluster of the representative frame.cluster_frames(list of np.ndarray): frame indices in the original trajectory that were assigned to each cluster.
- Return type:
tuple of length 5
Example
>>> members, trajs, dmats, centroids, frames = protein.get_clusters(n_clusters=5) >>> members [120, 95, 80, 60, 45]
- get_all_SASA(probe_radius=1.4, mode='residue', stride=20, weights=False, etol=1e-07)[source]
Solvent-accessible surface area (SASA) per residue or per atom.
Internally uses mdtraj’s Shrake-Rupley implementation (Golden-Spiral algorithm). Results are memoised per
(stride, mode, probe_radius)triple so repeated calls with the same parameters are free.SASA is returned in Angstroms^2;
probe_radiusis in Angstroms. Because the calculation is expensive the defaultstrideis 20.- Parameters:
probe_radius (float, optional) – Solvent probe radius in Angstroms. Default 1.4 (water).
mode ({'residue', 'atom', 'sidechain', 'backbone', 'all'}, optional) –
Granularity:
'residue'(default) - per-residue SASA, shape(n_frames_strided, n_residues).'atom'- per-atom SASA, shape(n_frames_strided, n_atoms).'sidechain'- per-residue SASA summed over sidechain atoms (excluding backbone H, HA, HA2, HA3 which mdtraj’s ‘sidechain’ selector erroneously includes).'backbone'- per-residue SASA summed over backbone atoms (including the backbone hydrogens that mdtraj’s ‘backbone’ selector erroneously excludes).'all'- returns a 3-tuple of (residue, sidechain, backbone) arrays.
stride (int, optional) – Use every
stride-th frame. Default is 20.weights (array_like or False, optional) – Per-frame re-weighting vector (validated against the strided frame axis).
False(default) returns the per-frame array (and uses the memo cache). If supplied, the strided frame axis is collapsed and the deterministic weighted-mean SASA is returned (not cached).etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
See
modedescription. All arrays are in Angstroms^2. Ifweightsis supplied the frame axis is collapsed (e.g.(n_residues,)formode='residue'; a 3-tuple of such arrays formode='all').- Return type:
np.ndarray or tuple of np.ndarray
Example
>>> sasa = protein.get_all_SASA(mode='residue', stride=10) >>> sasa.mean(axis=0) # mean per-residue SASA across frames
- get_regional_SASA(R1, R2, probe_radius=1.4, stride=20, weights=False, etol=1e-07)[source]
Mean total SASA summed over a contiguous residue range.
Calls
get_all_SASA()(whose result is memoised) and sums the time-averaged per-residue SASA values fromR1toR2-1. Note the SASA must be computed on the full protein first because atoms outside the region can occlude region atoms.SASA is returned in Angstroms^2.
- Parameters:
R1 (int) – First residue of the region.
R2 (int) – Last residue of the region (exclusive in the internal sum).
probe_radius (float, optional) – Solvent probe radius in Angstroms. Default 1.4 (water).
stride (int, optional) – Use every
stride-th frame. Default is 20.weights (array_like or False, optional) – Per-frame re-weighting vector (validated against the strided frame axis).
False(default) is byte-identical to before; if supplied each residue’s time-average is the deterministic weighted mean before summing.etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
Sum over residues
[R1, R2)of the time-averaged per-residue SASA in Angstroms^2.- Return type:
float
Example
>>> protein.get_regional_SASA(5, 25) 2150.6
- get_site_accessibility(input_list, probe_radius=1.4, mode='residue_type', stride=20, weights=False, etol=1e-07)[source]
Mean & std SASA for selected residue types or specific residue indices.
Under
mode='residue_type',input_listis interpreted as canonical three-letter residue names and every matching residue in the chain is summarised. Undermode='resid'the input is treated as a list of explicit residue indices.Useful for comparing accessibility of chemically-equivalent residues at different positions in the chain.
- Parameters:
input_list (list) –
Either:
three-letter residue names, e.g.
['TRP', 'TYR', 'GLN'](whenmode='residue_type').residue indices, e.g.
[1, 2, 3, 4](whenmode='resid').
probe_radius (float, optional) – Solvent probe radius in Angstroms. Default 1.4 (water).
mode ({'residue_type', 'resid'}, optional) – How to interpret
input_list. Default'residue_type'.stride (int, optional) – Use every
stride-th frame. Default is 20.weights (array_like or False, optional) – Per-frame re-weighting vector (validated against the strided frame axis).
False(default) gives the ordinary mean/std (byte-identical to before); if supplied the per-residue[mean, std]is the deterministic weighted mean and weighted (population) std.etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
Keys are
"RESNAME-RESID"strings; each value is[mean_SASA, std_SASA]in Angstroms^2.- Return type:
dict
- Raises:
SSException – If
input_listcontains a residue name not in the canonical 20 + caps set whenmode='residue_type'.
Example
>>> protein.get_site_accessibility(['LEU', 'VAL']) {'LEU-5': [42.1, 6.0], 'LEU-12': [55.4, 8.3], 'VAL-21': [38.0, 5.5], ...}
- get_local_collapse(window_size=10, bins=None, verbose=True, weights=False, etol=1e-07)[source]
Sliding-window local radius of gyration profile.
At every starting residue
ithe local Rg is computed over the window[i, i + window_size], producing a per-position profile of compaction along the chain. Useful for identifying locally collapsed regions in otherwise extended IDPs (and vice versa).- Parameters:
window_size (int, optional) – Size of the sliding window in residues. Must be
<= n_residues. Default 10.bins (np.ndarray or list, optional) – Histogram bin edges (evenly spaced). If None (default),
np.arange(0, 10, 0.01)is used.verbose (bool, optional) – If True (default), print one status line per starting residue.
weights (array_like or False, optional) – Per-frame re-weighting vector.
False(default) is byte-identical to before. If supplied, each window’s mean/std is the deterministic weighted mean / weighted (population) std and the histogram counts are frame-weighted.etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
(mean, std, histo, bins)where:mean(list of float, lengthn_residues - window_size + 1): mean local Rg per starting residue.std(same length): standard deviation of local Rg per starting residue.histo(list of np.ndarray, same length): histogram counts.bins(np.ndarray): the bin edges actually used.
- Return type:
tuple of (list, list, list, np.ndarray)
- Raises:
SSException – If
window_size > n_residuesorbinsis malformed.
Example
>>> mean, std, hist, bins = protein.get_local_collapse(window_size=8) >>> # plot mean vs residue index to see where the chain collapses
Returns (legacy enumeration)
- [0] - list of floats (length = n)
Each float reports on the mean Rg at a specific position along the sequence.
- [1] - list of floats (length = n)
Each float reports on the standard deviation of the Rg distribution at a specific position along the sequence.
- [2] - list of np.ndarrays (length = n)
Histogram counts associated with the local Rg at a given position along the sequence. Basically, this is the emprical distribution that the standard devaitions and mean report on
- [3] - np.ndarray (length = n)
Bin values for histogram counts returned in [2]
- get_sidechain_alignment_angle(R1, R2, sidechain_atom_1='default', sidechain_atom_2='default')[source]
Per-frame angle between two residues’ sidechain orientation vectors.
For each residue a “sidechain vector” is the CA-to-tip vector, where the tip atom is the canonical sidechain terminus (e.g. CB for Ala, CZ for Phe/Tyr/Arg). The returned angle is between those two vectors. Useful for asking whether two sidechains tend to point towards each other (small angle) or away (~180 degrees).
Default tip atoms (from the OPLS-AA force field):
ALA: CB CYS: SG ASP: CG GLU: CD PHE: CZ HIS: NE2 ILE: CD1 LYS: NZ LEU: CG MET: CE ASN: CG PRO: CG GLN: CD ARG: CZ SER: OG THR: CB VAL: CB TRP: CG TYR: CZ
GLY, ACE, NME have no sidechain and raise.
- Parameters:
R1 (int) – Resid of the first residue. Must not be GLY/ACE/NME.
R2 (int) – Resid of the second residue. Must not be GLY/ACE/NME.
sidechain_atom_1 (str, optional) – Override the tip atom of R1.
'default'(default) uses the canonical OPLS-AA atom from the table above.sidechain_atom_2 (str, optional) – Override the tip atom of R2. Same convention as
sidechain_atom_1.
- Returns:
1D array of length
n_frameswith the per-frame alignment angle in degrees.- Return type:
np.ndarray
- Raises:
SSException – If either residue is glycine or a cap, if a residue index is out of range, or if a custom sidechain atom name cannot be found.
Example
>>> theta = protein.get_sidechain_alignment_angle(5, 45) >>> theta.mean() 82.4
- get_dihedral_mutual_information(angle_name='psi', bwidth=0.6283185307179586, stride=1, weights=False, normalize=False)[source]
Mutual-information matrix between every pair of a chosen dihedral.
For each pair of dihedrals \((\phi_i, \phi_j)\) of the same type, this computes the mutual information
\[I(\phi_i; \phi_j) = H(\phi_i) + H(\phi_j) - H(\phi_i, \phi_j)\]where the Shannon entropies \(H\) come from binned histograms. The natural baseline for interpretation is the equivalent matrix computed on a polymer-model reference ensemble (e.g. an excluded-volume simulation), since the absolute values are bin-width-dependent. Setting
normalize=Truedivides each entry by the joint entropy, which can help with comparisons across ensembles of different sizes.- Parameters:
angle_name ({'phi', 'psi', 'omega', 'chi1'}, optional) – Dihedral to use. Default
'psi'. (chi2-chi5are not supported by this method.)bwidth (float, optional) – Histogram bin width on the
[-pi, pi]interval. Smaller widths give a finer-grained estimate but need more data. Defaultpi/5.stride (int, optional) – Use every
stride-th frame. Default 1.weights (array_like, optional) – Per-frame weights for re-weighted MI estimation. Must satisfy
len(weights) == n_frames // stride. DefaultFalse.normalize (bool, optional) – If True, divide each MI value by the joint entropy. Default False.
- Returns:
Symmetric
(n_dihedrals, n_dihedrals)mutual-information matrix. The diagonal is the self-information of each dihedral.- Return type:
np.ndarray
- Raises:
SSException – If
bwidthis not in(0, 2*pi]orangle_nameis not one of the four allowed strings.
Example
>>> MI = protein.get_dihedral_mutual_information(angle_name='psi') >>> MI.shape (92, 92)
- get_local_to_global_correlation(mode='COM', n_cycles=100, max_num_pairs=10, stride=20, weights=False, verbose=True)[source]
How well finite subsets of inter-residue distances predict global Rg.
For each
kin[1, max_num_pairs), this randomly pickskinter-residue distance pairsn_cyclestimes and computes the correlation between the chain’s \(R_g^2\) and the mean-squared of thosekdistances. Averaging the correlations across cycles gives a per-kmeasure of how informative a finite local measurement is about the global chain dimensions.This is intended to inform experimental design (e.g. how many smFRET pairs are “enough” to pin down \(R_g\)).
- Parameters:
mode ({'COM', 'CA'}, optional) – Distance type. Default
'COM'.n_cycles (int, optional) – Number of random pair-resamples per
k. Default 100. Largern_cyclesreduces sampling noise; values below ~50 are not recommended.max_num_pairs (int, optional) – Upper bound on the number of pairs sampled. Default 10. Beyond 10-20 the average correlation typically saturates close to 1.
stride (int, optional) – Use every
stride-th frame. Default 20.weights (array_like or False, optional) – Per-frame weights for the covariance computation. Default
False(uniform weighting).verbose (bool, optional) – If True (default), print one status line per
k.
- Returns:
(raw, n_pairs, mean_corr, std_corr)where:raw(np.ndarray of shape(n_cycles * (max_num_pairs - 1), 2)): one row per resample; col 0 iskand col 1 is the \(R_g^2\) vs. local-mean-square correlation for that resample.n_pairs(np.ndarray): thekvalues used,[1, 2, ..., max_num_pairs - 1].mean_corr(np.ndarray): mean correlation perk.std_corr(np.ndarray): standard deviation of correlation perk(assumes Gaussian distribution, may be approximate).
- Return type:
tuple of length 4
- Raises:
SSException – If
modeis invalid, the strided Rg length disagrees with the inter-residue distance length, or the installed numpy version cannot acceptaweightsinnumpy.cov.
Example
>>> import numpy as np >>> np.random.seed(42) # the pair-resampling consumes the RNG >>> raw, ks, mean_c, std_c = protein.get_local_to_global_correlation()
- get_overlap_concentration()[source]
Overlap concentration \(c^*\) of the chain, in Molar.
Defined as the polymer concentration at which independent chains begin to occupy the same volume in trans, computed by treating each chain as a sphere of radius equal to its mean \(R_g\). Useful as a coarse limit on dilute-regime behaviour.
- Returns:
Overlap concentration \(c^*\) in Molar.
- Return type:
float
Example
>>> protein.get_overlap_concentration() 0.0042
- get_angle_decay(atom1='C', atom2='N', return_all_pairs=False, weights=False, etol=1e-07)[source]
Bond-vector autocorrelation along the chain (persistence-length proxy).
For each residue we form a single intra-residue vector \(\vec{v}_i\) (default
C -> N, the peptide bond direction). For every pair \((i, j)\) we compute \(\langle \hat{v}_i \cdot \hat{v}_j \rangle\) across frames; averaging over all pairs with the same sequence separation \(|i-j|\) yields the decay profile. A faster initial decay implies a less rigid chain.Both
atom1andatom2must exist in every residue, so the peptide C->N vector is essentially the only reasonable choice.- Parameters:
atom1 (str, optional) – Origin atom name. Default
'C'.atom2 (str, optional) – Tip atom name. Default
'N'.return_all_pairs (bool, optional) – If True, additionally return a dict of every individual
"i-j"residue-pair correlation. Default is False.weights (array_like or False, optional) – Per-frame re-weighting vector.
False(default) is byte-identical to before. If supplied, each residue-pair’s mean cos(theta) is the deterministic frame-weighted mean (the over-pairs averaging per separation is unchanged).etol (float, optional) – Tolerance on
|sum(weights) - 1|. Default1e-7.
- Returns:
If
return_all_pairs=False(default): array of shape(n_separations, 3)with columns[separation, mean_corr, std_corr]. The first row is[0, 1.0, 0.0](self correlation).If
return_all_pairs=True:(array, pair_dict)wherepair_dict["i-j"]is the mean correlation for that specific residue pair.
- Return type:
np.ndarray OR tuple of (np.ndarray, dict)
Example
>>> decay = protein.get_angle_decay() >>> # decay[:, 0] is separation, decay[:, 1] is mean correlation