ssprotein

Overview

SSProtein is the core analysis class in SOURSOP and the primary interface for characterizing the conformational behaviour of a single protein chain across a simulation trajectory.

SSProtein objects are not created directly. Instead, they are extracted from an SSTrajectory object after reading a trajectory from disk. Each protein chain in the system is represented by its own SSProtein instance, accessible via SSTrajectory.proteinTrajectoryList.

Analyses fall into several broad categories:

  • Residue utilities — sequence retrieval, atom/CA index lookup, residue center-of-mass positions and masses.

  • Inter-residue distances — pairwise CA or COM distance matrices, distance maps, and polymer-scaled distance maps.

  • Global size and shape — radius of gyration, hydrodynamic radius, end-to-end distance, asphericity, gyration tensor, and the \(t\)-parameter.

  • Secondary structure — per-frame DSSP assignments and BBSEG backbone-torsion-based classification.

  • Polymer scaling — internal scaling profiles (\(\langle r^2 \rangle\) vs sequence separation), the scaling exponent \(\nu\), and local heterogeneity in scaling behaviour.

  • Contact and RMSD analysis — contact maps (with configurable threshold and mode), RMSD to a reference structure, and fraction of native contacts \(Q\).

  • Solvent accessibility — per-residue and region-level SASA via get_all_SASA, get_regional_SASA, and get_site_accessibility.

  • Local dynamics — local collapse profiles, sidechain alignment angles, dihedral mutual information, local-to-global correlation, and angle decay.

  • Clustering and overlap — conformational clustering via get_clusters, overlap concentration \(c^*\).

Most functions return NumPy arrays; per-frame results have shape (n_frames,) or (n_frames, ...) so standard NumPy operations (np.mean, np.std, etc.) apply directly.

By way of an example:

from soursop.sstrajectory import SSTrajectory
import numpy as np

# read in a trajectory
TrajOb = SSTrajectory('traj.xtc', 'start.pdb')

# get the first protein chain (this is an SSProtein object)
ProtObj = TrajOb.proteinTrajectoryList[0]

# print the mean end-to-end distance
mean_e2e = np.mean(ProtObj.get_end_to_end_distance())
print(mean_e2e)

SSProtein Properties

SSProtein objects have a set of object variables associated with them.

class soursop.ssprotein.SSProtein(traj, debug=False, check_one_bead_per_residue=True)[source]
resid_with_CA

Resids of residues that have a CA (alpha-carbon) atom.

ACE and NME capping residues, water, ions, and any non-standard residue without a CA are excluded. Use this whenever you need to iterate over “real” residues and not the cap pseudo-residues.

Returns:

Resids (0-indexed) of residues with a CA atom, in ascending order.

Return type:

list of int

Example

>>> protein.resid_with_CA
[1, 2, 3, ..., 92]   # ctl9_AA has ACE at 0 and NME at 93
ncap

True if an N-terminal capping residue (e.g. ACE) is present.

Inferred from whether resid 0 has a CA atom: if it does not, a cap is assumed to occupy resid 0.

Returns:

True if an N-terminal cap is present, False otherwise.

Return type:

bool

Example

>>> protein.ncap
True
ccap

True if a C-terminal capping residue (e.g. NME) is present.

Inferred from whether the last resid has a CA atom: if it does not, a cap is assumed to occupy that resid.

Returns:

True if a C-terminal cap is present, False otherwise.

Return type:

bool

Example

>>> protein.ccap
True
n_frames

Number of frames in the underlying mdtraj trajectory.

Returns:

Total frame count of the simulation trajectory.

Return type:

int

Example

>>> protein.n_frames
1000
n_residues

Number of residues in the protein, including any cap residues.

Returns:

Total residue count, counting ACE / NME caps if present.

Return type:

int

Example

>>> protein.n_residues
94
residue_index_list

All resids associated with this protein, extracted from the topology.

As of v0.1.3 the list always starts at 0 and increments by one up to n_residues - 1. It is extracted live from the underlying mdtraj.topology (and then cached) so it is useful as a ground-truth reference when debugging residue-index issues.

Returns:

Resids in ascending order; length equals n_residues.

Return type:

list of int

Example

>>> protein.residue_index_list[:5]
[0, 1, 2, 3, 4]
length = <function SSProtein.length>[source]

SSProtein Functions

class soursop.ssprotein.SSProtein(traj, debug=False, check_one_bead_per_residue=True)[source]
reset_cache()[source]

Clear every value the SSProtein has memoised so far.

Many SSProtein methods cache their first-call result (sequences, atom index tables, per-residue centres of mass, SASA results, dihedral angles). This method discards those caches and re-initialises the resid_with_CA / cap flags from the underlying topology. It is rarely necessary in normal use; reach for it if you have mutated the trajectory in-place and need the SSProtein to re-derive its tables, or if you are profiling cold-vs-warm performance.

Return type:

None

Example

>>> protein.get_radius_of_gyration()    # first call: populates caches
>>> protein.reset_cache()               # discard all cached values
>>> protein.get_radius_of_gyration()    # cold-cache call again
print_residues(verbose=True)[source]

Map each zero-indexed residue position to its PDB resname-resid label.

Useful when debugging cap/non-cap offsets or aligning SOURSOP residue indices against a published PDB numbering.

Parameters:

verbose (bool, optional) – If True (default), also prints each "<idx> --> <resname-resid>" line to stdout. If False, the list is returned silently.

Returns:

One [index, resname-resid] pair per residue, in topology order.

Return type:

list of [int, str]

Example

>>> mapping = protein.print_residues(verbose=False)
>>> mapping[:3]
[[0, 'ACE-0'], [1, 'MET-1'], [2, 'ALA-2']]
get_amino_acid_sequence(oneletter=False, numbered=True)[source]

Return the amino-acid sequence of the protein in several formats.

Caps (ACE / NME) and other non-standard residues are included in the sequence using their three-letter or one-letter codes as recorded by the topology.

Parameters:
  • oneletter (bool, optional) – If True, use single-letter amino-acid codes. If False (default), use three-letter codes.

  • numbered (bool, optional) – If True (default), include the resid in each entry as a "RESNAME-RESID" string. If False, just the resname.

Returns:

  • oneletter=False, numbered=True -> ['MET-0', 'ALA-1', ...]

  • oneletter=False, numbered=False -> ['MET', 'ALA', ...]

  • oneletter=True, numbered=True -> the full one-letter string, e.g. "MA..."

  • oneletter=True, numbered=False -> ['M', 'A', ...]

Return type:

list of str OR str

Example

>>> protein.get_amino_acid_sequence(oneletter=True, numbered=True)
'MAAEELANAK...'
>>> protein.get_amino_acid_sequence()[:3]
['MET-0', 'ALA-1', 'ALA-2']
get_residue_atom_indices(resid, atom_name=None)[source]

Look up the atom indices that belong to a specific residue.

Results are memoised on first lookup so repeated calls for the same (resid, atom_name) pair are essentially free. The previous name of this method was residue_atom_com().

Parameters:
  • resid (int) – Resid of the residue to look up.

  • atom_name (str, optional) – If given, restrict the result to atoms whose name matches this string (e.g. "CA", "CB", "N"). If None (default), all atoms of the residue are returned.

Returns:

Atom indices into the underlying mdtraj topology that satisfy the resid / atom_name filter.

Return type:

list of int

Example

>>> ca_atoms = protein.get_residue_atom_indices(5, atom_name='CA')
>>> all_atoms = protein.get_residue_atom_indices(5)
get_CA_index(resid)[source]

Return the atom index of the alpha-carbon (CA) for a residue.

Results are memoised on first lookup so repeated calls are essentially free. Raises SSException if the residue does not have exactly one CA atom (e.g. cap residues like ACE / NME have none).

Parameters:

resid (int) – Resid whose CA atom should be returned.

Returns:

The mdtraj atom index of the CA atom of residue resid.

Return type:

int

Raises:

SSException – If the residue has no CA atom (caps, non-standard residues, or coarse-grained beads named differently).

Example

>>> protein.get_CA_index(5)
37
get_multiple_CA_index(resID_list=None)[source]

Return CA atom indices for many residues at once.

Convenience wrapper around get_CA_index() that handles three input forms: a single integer, a list of integers, or None (meaning “every residue with a CA”). Residues without a CA are silently skipped rather than raising.

Parameters:

resID_list (int, list of int, or None, optional) –

  • None (default) -> use resid_with_CA (every residue that has a CA).

  • A single int -> wrap in a list, return one-element list.

  • A list[int] -> look up each in turn, skipping any that raise.

Returns:

CA atom indices, in the same order as the input residues.

Return type:

list of int

Example

>>> protein.get_multiple_CA_index([5, 10, 15])
[37, 86, 134]
>>> all_ca = protein.get_multiple_CA_index()   # every CA
get_residue_COM(resid, atom_name=None)[source]

Per-frame centre of mass (COM) of a residue or a specific atom of it.

The COM is computed once and cached, so repeated calls for the same (resid, atom_name) pair are essentially free. All positions are returned in Angstroms.

Parameters:
  • resid (int) – Resid of the residue to extract.

  • atom_name (str, optional) – If given, only the named atom contributes to the COM (a single-atom COM is just the atom’s position). If None (default), every atom of the residue contributes.

Returns:

Array of shape (n_frames, 3) giving the COM (x, y, z) of the residue in each frame.

Return type:

np.ndarray

Example

>>> com = protein.get_residue_COM(5)              # full-residue COM
>>> ca  = protein.get_residue_COM(5, 'CA')        # CA atom position
get_residue_mass(R1)[source]

Total mass of every atom belonging to a residue.

The mass is summed from each atom’s element.mass as recorded in the underlying mdtraj topology, in atomic mass units (g/mol).

Parameters:

R1 (int) – Resid whose mass should be summed.

Returns:

Total mass of residue R1 in atomic mass units.

Return type:

float

Example

>>> protein.get_residue_mass(5)   # e.g. leucine
113.16
calculate_all_CA_distances(residueIndex, mode='CA', only_C_terminal_residues=True, stride=1)[source]

Distances between a reference residue and every other CA residue.

Computes either CA-CA or COM-COM distances from residueIndex to all other residues in the chain that have a CA atom. By default only residues C-terminal of the reference are considered, which is the right choice when building an all-vs-all upper-triangular matrix.

Distance is returned in Angstroms.

Parameters:
  • residueIndex (int) – Reference residue. Must have a CA atom; if not, the function returns -1 rather than raising.

  • mode ({'CA', 'COM'}, optional) –

    • 'CA' (default) - distance between alpha-carbon atoms.

    • 'COM' - distance between residue centres of mass.

  • only_C_terminal_residues (bool, optional) – If True (default) only consider residues with index strictly greater than residueIndex. If False, also include residues N-terminal of the reference.

  • stride (int, optional) – Use every stride-th frame. Default is 1.

Returns:

  • On success: array of shape (n_frames_after_stride, M) where M is the number of other residues considered.

  • If residueIndex has no CA: the integer -1.

Return type:

np.ndarray or int

Raises:

SSException – If mode is not one of 'CA' or 'COM'.

Example

>>> d = protein.calculate_all_CA_distances(5)
>>> d.shape
(1000, 86)
get_inter_residue_COM_distance(R1, R2, stride=1, weights=False, etol=1e-07)[source]

Per-frame distance between two residues’ centres of mass.

Distances are in Angstroms.

Parameters:
  • R1 (int) – Resid of the first residue.

  • R2 (int) – Resid of the second residue.

  • stride (int, optional) – Use every stride-th frame. Default is 1.

  • weights (array_like or False, optional) – Per-frame re-weighting vector, one entry per trajectory frame (len == n_frames), validated by soursop.ssutils.validate_weights() (stride-subsampled and renormalised to align with the strided frames). False (default) returns the per-frame array; if supplied the frame axis is collapsed and the scalar deterministic weighted-mean distance is returned.

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

If weights is False: 1D array of length ceil(n_frames / stride) with the inter-residue COM distance for each (strided) frame. If weights is supplied: the scalar weighted-mean distance. Angstroms.

Return type:

np.ndarray or float

Example

>>> d = protein.get_inter_residue_COM_distance(5, 45)
>>> d.mean(), d.std()
(12.3, 1.1)
get_inter_residue_COM_vector(R1, R2)[source]

Per-frame displacement vector COM(R1) - COM(R2) between residue COMs.

Unlike get_inter_residue_COM_distance(), this returns the full (dx, dy, dz) separation rather than its magnitude. The convention is COM(R1) - COM(R2), i.e. the vector that points from residue R2’s COM to residue R1’s COM. Units are Angstroms.

Note

Units changed in v0.2.0; prior to that release the vector was returned in nanometres.

Parameters:
  • R1 (int) – Resid of the first residue (the “head” of the vector).

  • R2 (int) – Resid of the second residue (the “tail” of the vector).

Returns:

Array of shape (n_frames, 3) giving the displacement COM(R1) - COM(R2) in each frame, in Angstroms.

Return type:

np.ndarray

Example

>>> v = protein.get_inter_residue_COM_vector(5, 45)
>>> v.shape
(1000, 3)
get_inter_residue_atomic_distance(R1, R2, A1='CA', A2='CA', mode='atom', stride=1, weights=False, etol=1e-07)[source]

Per-frame distance between two residues, with mode-dependent semantics.

For mode='atom' the distance is between the named atoms A1 of residue R1 and A2 of residue R2. For all other modes the A1/A2 arguments are ignored and the distance is the one returned by mdtraj.compute_contacts under the chosen scheme.

SOURSOP does no sanity checking on atom names — if a name is invalid, an SSException is raised with a hint about likely mistypes.

Distance is returned in Angstroms.

Parameters:
  • R1 (int) – Resids of the two residues to measure between.

  • R2 (int) – Resids of the two residues to measure between.

  • A1 (str, optional) – For mode='atom', the atom names within R1 and R2. Defaults are 'CA'. Ignored for all other modes.

  • A2 (str, optional) – For mode='atom', the atom names within R1 and R2. Defaults are 'CA'. Ignored for all other modes.

  • mode ({'atom', 'ca', 'closest', 'closest-heavy', 'sidechain', 'sidechain-heavy'}, optional) –

    • 'atom' (default) - distance between atoms A1 and A2.

    • 'ca' - equivalent to mode='atom', A1=A2='CA'.

    • 'closest' - closest approach between any atom of R1 and any atom of R2 in each frame.

    • 'closest-heavy' - same as 'closest' but ignoring hydrogens.

    • 'sidechain' - closest sidechain atom of R1 to closest sidechain atom of R2 (requires mdtraj >= 1.8.0).

    • 'sidechain-heavy' - sidechain heavy-atom version (requires mdtraj >= 1.8.0). Raises for glycine residues.

  • stride (int, optional) – Use every stride-th frame. Default is 1.

  • weights (array_like or False, optional) – Per-frame re-weighting vector, one entry per trajectory frame (len == n_frames), validated by soursop.ssutils.validate_weights() and stride-aligned. False (default) returns the per-frame array; if supplied the frame axis is collapsed to the scalar weighted mean.

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

If weights is False: 1D array of length ceil(n_frames / stride) with the inter-residue distance for each (strided) frame. If weights is supplied: the scalar deterministic weighted-mean distance. Angstroms.

Return type:

np.ndarray or float

Raises:

SSException – If mode is invalid, an atom name cannot be found in the corresponding residue, or the weight vector fails validation.

Example

>>> d = protein.get_inter_residue_atomic_distance(5, 45)               # CA-CA by default
>>> d_min = protein.get_inter_residue_atomic_distance(5, 45, mode='closest-heavy')
get_distance_map(mode='CA', RMS=False, stride=1, return_instantaneous_maps=False, weights=False, verbose=True)[source]

Inter-residue distance map (and its standard deviation).

Builds the full N x N matrix of inter-residue distances where N is the number of residues with a CA atom (ACE / NME caps are therefore excluded). Distances are returned in Angstroms. The matrix is upper triangular (entries below the diagonal are zero).

Parameters:
  • mode ({'CA', 'COM'}, optional) –

    • 'CA' (default) - distances between alpha-carbon atoms.

    • 'COM' - distances between residue centres of mass.

  • RMS (bool, optional) – If True, report root-mean-square distances \(\sqrt{\langle r_{ij}^2 \rangle}\), the polymer-physics order parameter. If False (default), report the ensemble mean \(\langle r_{ij} \rangle\).

  • stride (int, optional) – Use every stride-th frame. Default is 1 (every frame).

  • return_instantaneous_maps (bool, optional) – If True, the first element of the returned tuple is a 3D array (n_frames, N, N) of per-frame distance maps rather than the 2D ensemble-average map. Default is False.

  • weights (list or np.ndarray, optional) – Per-frame weights for re-weighted averaging (e.g. T-WHAM output). Default False means uniform weighting; in that case the std-map is also returned, otherwise it is None.

  • verbose (bool, optional) – If True (default), print one status line per row of the matrix. This calculation can be slow on long trajectories.

Returns:

(distance_map, std_distance_map) where:

  • distance_map is (N, N) ensemble-mean distances (or (n_frames, N, N) per-frame distances if return_instantaneous_maps=True).

  • std_distance_map is (N, N) per-pair standard deviations across frames; None if weights was given.

Return type:

tuple of (np.ndarray, np.ndarray)

Example

>>> mean_map, std_map = protein.get_distance_map(verbose=False)
>>> mean_map.shape
(92, 92)
>>> rms_map, _ = protein.get_distance_map(mode='COM', RMS=True, verbose=False)
get_polymer_scaled_distance_map(nu=None, A0=None, min_separation=10, mode='fractional-change', stride=1, weights=False, etol=1e-07, verbose=True)[source]

Quantify how each inter-residue distance deviates from a homopolymer model.

For a standard polymer the equilibrium distance scales as \(\langle r_{ij} \rangle = A_0 |i-j|^{\nu}\). This method builds an N x N matrix where each entry quantifies how far the observed mean distance is from that homopolymer prediction. The deviation can be expressed in several ways (see mode).

Note that \(A_0\) here is the prefactor of the inter-residue distance scaling and is NOT the same numerical value as the \(R_0\) prefactor that defines \(R_g = R_0 N^{\nu}\). The scaling exponent \(\nu\) should typically lie in [0.33, 0.598] for real polymers.

If nu and A0 are not supplied, the homopolymer model is fit automatically by get_scaling_exponent() (with default settings) and that fit is used as the reference model.

Parameters:
  • nu (float, optional) – Scaling exponent used as the homopolymer reference. If supplied, A0 must also be supplied. If None (default), both are obtained by calling get_scaling_exponent().

  • A0 (float, optional) – Scaling prefactor for the homopolymer reference; paired with nu.

  • min_separation (int, optional) – Minimum sequence separation |i-j| for which a deviation is computed. Entries with separations below this threshold are filled with the mode-dependent default value (0 for the change modes, 1 for 'scaled'). Default is 10.

  • mode ({'fractional-change', 'signed-fractional-change', 'signed-absolute-change', 'scaled'}, optional) –

    How to quantify the deviation between the observed mean distance \(r_{ij}\) and the homopolymer prediction \(p_{ij}\):

    • 'fractional-change' (default): \(|r_{ij} - p_{ij}| / p_{ij}\). Unsigned magnitude.

    • 'signed-fractional-change': \((r_{ij} - p_{ij}) / p_{ij}\). Positive => expansion vs. the homopolymer, negative => contraction.

    • 'signed-absolute-change': \(r_{ij} - p_{ij}\) (Angstroms).

    • 'scaled': \(r_{ij} / p_{ij}\) (default value 1 on near-diagonal).

  • stride (int, optional) – Use every stride-th frame for the distance map. Default is 1.

  • weights (list or np.ndarray, optional) – Per-frame weights for re-weighted analysis (e.g. T-WHAM output). Default False means uniform weighting.

  • etol (float, optional) – Tolerance for the weights-sum-to-one check. Default 1e-7.

  • verbose (bool, optional) – If True (default), print status updates during the distance map and (when needed) the scaling-exponent fit.

Returns:

(map, nu, A0, redchi) where:

  • map is an (N, N) numpy matrix of deviations (units depend on mode).

  • nu is the homopolymer scaling exponent used.

  • A0 is the homopolymer scaling prefactor used.

  • redchi is the reduced chi-squared of the homopolymer fit (-1 if nu / A0 were supplied rather than fit).

Return type:

tuple of (np.ndarray, float, float, float)

Raises:

SSException – If mode is not one of the four allowed values, if min_separation < 1, if exactly one of nu/A0 is given, or if the supplied nu/A0 are out of physical range.

Example

>>> m, nu, A0, chi2 = protein.get_polymer_scaled_distance_map(verbose=False)
>>> m.shape, round(nu, 2)
((92, 92), 0.6)
get_radius_of_gyration(R1=None, R2=None, weights=False, etol=1e-07)[source]

Per-frame radius of gyration of the chain (or a sub-region).

\(R_g = \sqrt{(1/N) \sum_i (r_i - R)^2}\) is the canonical polymer-physics measure of overall chain size. It is computed by mdtraj.compute_rg on every heavy/light atom in the selection and returned in Angstroms.

Parameters:
  • R1 (int, optional) – First residue of the region. None (default) means the first residue including caps.

  • R2 (int, optional) – Last residue of the region. None (default) means the last residue including caps.

  • weights (array_like or False, optional) – Per-frame re-weighting vector (validated by soursop.ssutils.validate_weights()). False (default) returns the unmodified per-frame array. If supplied, the frame axis is collapsed and the single deterministic weighted-mean \(R_g\) is returned instead.

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

If weights is False: 1D array of length n_frames with the per-frame \(R_g\) in Angstroms. If weights is supplied: the scalar weighted-mean \(R_g\).

Return type:

np.ndarray or float

Example

>>> rg = protein.get_radius_of_gyration()
>>> rg.mean()
33.4
get_hydrodynamic_radius(R1=None, R2=None, mode='nygaard', alpha1=0.216, alpha2=4.06, alpha3=0.821, distance_mode='CA', weights=False, etol=1e-07)[source]

Per-frame apparent hydrodynamic radius \(R_h\) (in Angstroms).

Two estimators are available:

  • mode='nygaard' - empirical Rg-to-Rh conversion of Nygaard et al. (2017), using the three parameters alpha1, alpha2, alpha3.

  • mode='kr' - Kirkwood-Riseman equation (Kirkwood & Riseman 1948), recommended for comparison with PFG-NMR-derived \(R_h\) values (Pesce et al. 2022).

Both approximations hold for fully flexible disordered proteins and become less accurate for larger / more folded domains.

Parameters:
  • R1 (int, optional) – First residue of the region. None (default) means the first residue including caps.

  • R2 (int, optional) – Last residue of the region. None (default) means the last residue including caps.

  • mode ({'nygaard', 'kr'}, optional) – Which estimator to use. Default is 'nygaard'.

  • alpha1 (float, optional) – Parameters of equation (7) of Nygaard et al. Defaults reproduce the published values (0.216, 4.06, 0.821). Ignored for mode='kr'.

  • alpha2 (float, optional) – Parameters of equation (7) of Nygaard et al. Defaults reproduce the published values (0.216, 4.06, 0.821). Ignored for mode='kr'.

  • alpha3 (float, optional) – Parameters of equation (7) of Nygaard et al. Defaults reproduce the published values (0.216, 4.06, 0.821). Ignored for mode='kr'.

  • distance_mode ({'CA', 'COM'}, optional) – For mode='kr', which inter-residue distance to feed into the Kirkwood-Riseman sum. Default is 'CA'. Ignored for mode='nygaard'.

  • weights (array_like or False, optional) – Per-frame re-weighting vector. False (default) returns the per-frame array; if supplied the scalar deterministic weighted-mean \(R_h\) is returned.

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Per-frame \(R_h\) (length n_frames), or the scalar weighted mean if weights is supplied. Angstroms.

Return type:

np.ndarray or float

Example

>>> rh_nyg = protein.get_hydrodynamic_radius()
>>> rh_kr  = protein.get_hydrodynamic_radius(mode='kr')
>>> rh_nyg.mean(), rh_kr.mean()
(24.1, 23.8)

References

Nygaard M, Kragelund BB, Papaleo E, Lindorff-Larsen K. An Efficient Method for Estimating the Hydrodynamic Radius of Disordered Protein Conformations. Biophys J. 2017;113:550-557.

Kirkwood JG, Riseman J. The Intrinsic Viscosities and Diffusion Constants of Flexible Macromolecules in Solution. J Chem Phys. 1948;16(6):565-573.

Pesce F et al. Assessment of models for calculating the hydrodynamic radius of intrinsically disordered proteins. Biophysical Journal. 2022. doi:10.1016/j.bpj.2022.12.013

get_end_to_end_distance(mode='COM', weights=False, etol=1e-07)[source]

Per-frame end-to-end distance between the first and last CA residues.

Caps (ACE / NME) are excluded — the “ends” are the first and last residues with a CA atom, so the two modes operate on the same pair of residues.

End-to-end distance is returned in Angstroms.

Parameters:
  • mode ({'CA', 'COM'}, optional) –

    • 'COM' (default) - distance between residue centres of mass.

    • 'CA' - distance between alpha-carbon atoms.

  • weights (array_like or False, optional) – Per-frame re-weighting vector. False (default) returns the per-frame array; if supplied the scalar deterministic weighted-mean end-to-end distance is returned.

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Per-frame end-to-end distance (length n_frames), or the scalar weighted mean if weights is supplied. Angstroms.

Return type:

np.ndarray or float

Example

>>> ee = protein.get_end_to_end_distance(mode='CA')
>>> ee.mean(), ee.std()
(30.1, 4.4)
get_end_to_end_vs_rg_correlation(mode='COM', weights=False, etol=1e-07)[source]

Pearson correlation between per-frame \(R_e^2\) and \(R_g^2\).

For a Gaussian chain Debye’s result gives \(R_e^2 = 6 R_g^2\), but in practice the correlation is a more useful diagnostic of how fractal-like (or otherwise) the chain is. This method returns the Pearson correlation across frames between the end-to-end distance squared and the radius of gyration squared.

Parameters:

mode ({'COM', 'CA'}, optional) – Type of distance to use for the end-to-end distance. Default 'COM'.

Returns:

Pearson correlation coefficient in [-1, 1].

Return type:

float

Example

>>> protein.get_end_to_end_vs_rg_correlation()
0.92
get_asphericity(R1=None, R2=None, verbose=True, weights=False, etol=1e-07)[source]

Per-frame asphericity of the chain (or a sub-region).

Asphericity is a dimensionless shape descriptor computed from the eigenvalues of the gyration tensor: it is 0 for a perfectly spherical distribution of mass and approaches 1 as the chain becomes increasingly extended/rod-like.

See p. 65 of Andreas Vitalis’ thesis (Probing the Early Stages of Polyglutamine Aggregation with Computational Methods, 2009, WashU) for a canonical derivation.

Parameters:
  • R1 (int, optional) – First residue of the region to consider. None (default) means the first residue in the sequence (including caps).

  • R2 (int, optional) – Last residue of the region to consider. None (default) means the last residue in the sequence (including caps).

  • verbose (bool, optional) – If True (default), print one status line every 500 frames during the gyration-tensor computation.

  • weights (array_like or False, optional) – Per-frame re-weighting vector. False (default) returns the per-frame array; if supplied the frame axis is collapsed and the scalar deterministic weighted-mean asphericity is returned.

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Per-frame asphericity (length n_frames), or the scalar weighted mean if weights is supplied.

Return type:

np.ndarray or float

Example

>>> a = protein.get_asphericity(verbose=False)
>>> a.mean()
0.18
get_t(R1=None, R2=None, weights=False, etol=1e-07)[source]

Per-frame dimensionless size parameter \(\langle t \rangle\).

\(\langle t \rangle\) is the Vitalis-thesis size parameter (Vitalis 2009) that scales the radius of gyration against the contour length \(L = 3.6 N\) (Angstroms). It is dimensionless and useful for comparing chains of different lengths.

Parameters:
  • R1 (int, optional) – First residue of the region. None (default) means the first residue.

  • R2 (int, optional) – Last residue of the region. None (default) means the last residue.

  • weights (array_like or False, optional) – Per-frame re-weighting vector (len == n_frames), validated by soursop.ssutils.validate_weights(). False (default) returns the per-frame array; if supplied the frame axis is collapsed and the scalar deterministic weighted-mean \(\langle t \rangle\) is returned.

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Per-frame \(\langle t \rangle\) (length n_frames), or the scalar weighted mean if weights is supplied.

Return type:

np.ndarray or float

Example

>>> t = protein.get_t()
>>> t.mean()
2.4

References

Vitalis, A. (2009). Probing the Early Stages of Polyglutamine Aggregation with Computational Methods (R. Pappu, ed.) [Ph.D. thesis from Washington University in St. Louis].

get_gyration_tensor(R1=None, R2=None, verbose=True, weights=False, etol=1e-07)[source]

Per-frame 3 x 3 gyration tensor of the chain (or a sub-region).

The gyration tensor \(T_{ab} = (1/N) \sum_i (r_{i,a} - R_a)(r_{i,b} - R_b)\) is the second moment of mass about the centre of mass. Its eigenvalues are the principal moments of inertia divided by mass and underlie derived quantities like get_radius_of_gyration() and get_asphericity(). Units are Angstroms^2.

Parameters:
  • R1 (int, optional) – First residue of the region to consider. None (default) means the first residue (including caps).

  • R2 (int, optional) – Last residue of the region to consider. None (default) means the last residue (including caps).

  • verbose (bool, optional) – If True (default), print one status line every 500 frames.

  • weights (array_like or False, optional) – Per-frame re-weighting vector. False (default) returns the per-frame (n_frames, 3, 3) array; if supplied the frame axis is collapsed and a single deterministic weighted-mean (3, 3) tensor is returned.

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

(n_frames, 3, 3) per-frame tensors, or a single (3, 3) weighted-mean tensor if weights is supplied.

Return type:

np.ndarray

Example

>>> T = protein.get_gyration_tensor(verbose=False)
>>> T.shape
(1000, 3, 3)
get_angles(angle_name)[source]

Per-frame backbone or sidechain dihedral angles for every applicable residue.

Computes the named dihedral using the corresponding mdtraj helper (compute_phi, compute_psi, …, compute_chi5) and converts the result to degrees. The first call for a given angle_name is memoised so repeated calls are essentially free.

Parameters:

angle_name ({'phi', 'psi', 'omega', 'chi1', 'chi2', 'chi3', 'chi4', 'chi5'}) – Dihedral to compute. chi3-chi5 only exist for residues with long sidechains (Arg, Lys, Met, etc.), so the result will contain fewer entries than n_residues.

Returns:

A two-element list:

  • [0] is a list-of-lists; each sub-list holds the four mdtraj.Atom objects that define one dihedral.

  • [1] is an array of shape (n_dihedrals, n_frames) of the per-frame angles in degrees.

Return type:

list of [list_of_mdtraj_atoms, np.ndarray]

Raises:

SSException – If angle_name is not one of the eight allowed strings.

Example

For a protein with two arginine residues:

>>> chi5 = protein.get_angles('chi5')
>>> chi5[0][1]                  # atoms defining 2nd arginine's chi5
>>> chi5[1][1]                  # per-frame chi5 angles for that residue
get_secondary_structure_DSSP(R1=None, R2=None, return_per_frame=False, weights=False, etol=1e-07)[source]

Per-residue secondary structure (helix / extended / coil) via DSSP.

Calls mdtraj.compute_dssp on the chosen residue range and summarises the output into per-residue fractional occupancies (the default) or per-frame binary masks (with return_per_frame=True).

The three buckets are H (helix), E (extended/beta), C (coil). At every residue the three values sum to 1 (default mode).

Parameters:
  • R1 (int, optional) – First residue of the region. None (default) means the first CA-bearing residue.

  • R2 (int, optional) – Last residue of the region. None (default) means the last CA-bearing residue.

  • return_per_frame (bool, optional) –

    • False (default): return per-residue fractional occupancies.

    • True: return per-frame binary 1/0 masks.

  • weights (array_like or False, optional) – Per-frame re-weighting vector (len == n_frames), validated by soursop.ssutils.validate_weights(). Only valid with return_per_frame=False: the per-residue fractional occupancies become the deterministic frame-weighted fractions. Supplying weights with return_per_frame=True raises an SSException (a per-frame mask is not an ensemble average). Default False (unweighted).

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

(resid_list, helix, extended, coil) where:

  • resid_list (list of int): residue indices covered.

  • If return_per_frame=False: helix, extended, coil are 1D arrays of length len(resid_list) with per-residue fractional time in each bucket.

  • If return_per_frame=True: helix, extended, coil are 2D arrays of shape (n_frames, len(resid_list)) with 1 / 0 entries indicating presence of that bucket in that frame.

Return type:

tuple of length 4

Example

>>> resids, h, e, c = protein.get_secondary_structure_DSSP()
>>> h.mean()    # mean helicity across the chain
0.32
get_secondary_structure_BBSEG(R1=None, R2=None, return_per_frame=False, weights=False, etol=1e-07)[source]

Per-residue secondary structure via the BBSEG2 phi/psi classification.

Classifies each (phi, psi) pair into one of 9 BBSEG2 bins using a lookup table distributed with CAMPARI. Implementation is in pure Python, so the result is version-stable across mdtraj releases.

BBSEG2 classes:

  • 0 - unclassified

  • 1 - beta (turn/sheet)

  • 2 - PII (polyproline type II helix)

  • 3 - unusual region

  • 4 - right-handed alpha helix

  • 5 - inverse C7-equatorial (gamma-prime turn)

  • 6 - classic C7-equatorial (gamma turn)

  • 7 - 7-residue-per-turn helix

  • 8 - left-handed alpha helix

Parameters:
  • R1 (int, optional) – First residue of the region. None (default) means the first residue with phi/psi defined.

  • R2 (int, optional) – Last residue of the region. None (default) means the last residue with phi/psi defined.

  • return_per_frame (bool, optional) –

    • False (default): return per-residue fractional occupancies.

    • True: return per-frame binary masks.

  • weights (array_like or False, optional) – Per-frame re-weighting vector (len == n_frames), validated by soursop.ssutils.validate_weights(). Only valid with return_per_frame=False: the per-class per-residue fractional occupancies become the deterministic frame-weighted fractions. Supplying weights with return_per_frame=True raises an SSException. Default False (unweighted).

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

(resid_list, per_class) where:

  • resid_list (list of int): residue indices covered.

  • per_class (dict): keys 0..8 (int), values are either 1D fractional-occupancy arrays (length len(resid_list)) or 2D (n_frames, len(resid_list)) binary masks depending on return_per_frame.

Return type:

tuple of (list, dict)

Example

>>> resids, classes = protein.get_secondary_structure_BBSEG()
>>> classes[4].mean()    # mean right-handed-alpha occupancy across chain
get_internal_scaling(R1=None, R2=None, mode='COM', mean_vals=False, stride=1, weights=False, etol=1e-07, verbose=True)[source]

Internal scaling profile: sequence separation vs. inter-residue distance.

For every sequence separation \(|i-j|\) from 0 to R2-R1, this method collects the inter-residue distances of every pair at that separation across every frame. The result is the raw material for an internal-scaling plot — a foundational polymer-physics observable for IDPs and IDRs (Mao et al. 2013; Pappu et al. 2008).

ACE / NME peptide caps are excluded. (Note: this differs from CAMPARI, which includes them.)

Distances are in Angstroms.

Parameters:
  • R1 (int, optional) – First residue of the region. None (default) means the first CA-bearing residue.

  • R2 (int, optional) – Last residue of the region. None (default) means the last CA-bearing residue.

  • mode ({'COM', 'CA'}, optional) –

    • 'COM' (default) - distances between residue centres of mass.

    • 'CA' - distances between alpha-carbon atoms.

  • mean_vals (bool, optional) – If True, the per-separation distances are reduced to their mean before being returned. If False (default), every individual inter-residue distance is kept (useful for plotting the full distribution).

  • stride (int, optional) – Use every stride-th frame. Default is 1. Larger values are recommended for long trajectories of large proteins.

  • weights (list or np.ndarray, optional) – Per-frame weights for re-weighted averaging. Default False means uniform weighting.

  • etol (float, optional) – Tolerance for the weights-sum-to-one check. Default 1e-7.

  • verbose (bool, optional) – If True (default), print one status line per sequence separation.

Returns:

(seq_sep, distances) where:

  • seq_sep is a list of integer sequence separations [0, 1, ..., R2-R1].

  • If mean_vals=True, distances is a list of per-separation mean distances (one float per separation).

  • If mean_vals=False (default), distances is a list of 1D arrays — one array per separation containing every individual inter-residue distance at that separation.

Return type:

tuple of (list, list)

Example

>>> sep, mean_d = protein.get_internal_scaling(mean_vals=True, verbose=False)
>>> # plot sep vs mean_d for an internal-scaling profile

References

Mao AH, Lyle N, Pappu RV. Describing sequence-ensemble relationships for intrinsically disordered proteins. Biochem J. 2013;449:307-318.

Pappu RV, Wang X, Vitalis A, Crick SL. A polymer-physics perspective on driving forces and mechanisms for protein aggregation. Arch Biochem Biophys. 2008;469:132-141.

Notes

Computational cost scales rapidly with chain length; consider increasing stride for trajectories of long proteins.

get_internal_scaling_RMS(R1=None, R2=None, mode='COM', stride=1, weights=False, etol=1e-07, verbose=True)[source]

Root-mean-square internal scaling profile: separation vs. RMS distance.

Like get_internal_scaling() but reports \(\sqrt{\langle r_{ij}^2 \rangle}\) for each sequence separation — the formally correct order parameter for polymer-physics analyses (Mao et al. 2013; Pappu et al. 2008). ACE / NME caps are excluded. Distances in Angstroms.

Parameters:
  • R1 (int, optional) – First residue of the region. None (default) means the first CA-bearing residue.

  • R2 (int, optional) – Last residue of the region. None (default) means the last CA-bearing residue.

  • mode ({'COM', 'CA'}, optional) –

    • 'COM' (default) - distances between residue centres of mass.

    • 'CA' - distances between alpha-carbon atoms.

  • stride (int, optional) – Use every stride-th frame. Default is 1.

  • weights (list or np.ndarray, optional) – Per-frame weights for re-weighted averaging. Default False.

  • etol (float, optional) – Tolerance for the weights-sum-to-one check. Default 1e-7.

  • verbose (bool, optional) – If True (default), print one status line per sequence separation.

Returns:

(seq_sep, rms_distance) where seq_sep is the list of integer separations and rms_distance is the corresponding list of RMS inter-residue distances in Angstroms.

Return type:

tuple of (list, list)

Example

>>> sep, rms = protein.get_internal_scaling_RMS(verbose=False)
>>> import numpy as np
>>> # log-log fit of rms vs sep recovers the apparent scaling exponent
>>> nu, _ = np.polyfit(np.log(sep[1:]), np.log(rms[1:]), 1)

References

Mao AH, Lyle N, Pappu RV. Describing sequence-ensemble relationships for intrinsically disordered proteins. Biochem J. 2013;449:307-318.

Pappu RV, Wang X, Vitalis A, Crick SL. A polymer-physics perspective on driving forces and mechanisms for protein aggregation. Arch Biochem Biophys. 2008;469:132-141.

get_scaling_exponent(inter_residue_min=15, end_effect=5, subdivision_batch_size=20, mode='COM', num_fitting_points=40, fraction_of_points=0.5, fraction_override=False, etol=1e-07, stride=1, weights=False, verbose=True)[source]

Fit the apparent polymer scaling exponent and prefactor by loglog regression.

Fits the homopolymer scaling relationship \(\sqrt{\langle r_{ij}^2 \rangle} = A_0 |i-j|^{\nu_{app}}\) to the internal-scaling profile of the chain. The exponent \(\nu_{app}\) (“nu-app”) reports on solvent quality and the prefactor \(A_0\) captures chain stiffness / segment volume.

Bootstrap error estimates for nu and A0 come from randomly subdividing the trajectory into subdivision_batch_size chunks, fitting each, and taking the min/max across the chunks.

Note

Despite the precision of the fit, nu_app and A0 are qualitative metrics subject to finite-chain-size effects, and are most rigorously interpreted for genuine homopolymers. For heteropolymers consider get_polymer_scaled_distance_map() which avoids the homopolymer assumption.

Parameters:
  • inter_residue_min (int, optional) – Minimum sequence separation |i-j| used in the fit. Helps avoid the short-distance regime where local steric effects dominate. Default is 15.

  • end_effect (int, optional) – Exclude pairs in which one residue is within end_effect residues of either chain end. Default is 5.

  • subdivision_batch_size (int, optional) – Target chunk size when bootstrapping fit-quality errors. Default is 20.

  • mode ({'COM', 'CA'}, optional) – Distance type. 'COM' (default) is preferred — CA-CA tends to inflate the apparent profile.

  • num_fitting_points (int, optional) – Maximum number of evenly-spaced loglog points used in the linear fit. Default is 40.

  • fraction_of_points (float, optional) – When fraction_override is True, or when the chain is too short to provide num_fitting_points points, this fraction of the valid sequence separations is used instead. Default is 0.5.

  • fraction_override (bool, optional) – If True, always use fraction_of_points and ignore num_fitting_points. Default is False.

  • etol (float, optional) – Tolerance for the weights-sum-to-one check. Default 1e-7.

  • stride (int, optional) – Use every stride-th frame. Default is 1.

  • weights (list or np.ndarray, optional) – Per-frame weights for re-weighted averaging. Default False.

  • verbose (bool, optional) – If True (default), print status updates during the internal-scaling pass.

Returns:

(best_nu, best_A0, min_nu, max_nu, min_A0, max_A0, redchi_fit_region, redchi_all_points, fit_region_data, all_points_data) where:

  • best_nu, best_A0 - point estimates from the full fit.

  • min_nu..``max_A0`` - bootstrap min/max bounds.

  • redchi_fit_region - reduced chi^2 on the points used in the linear fit (loglog-uniform).

  • redchi_all_points - reduced chi^2 across every observed \(|i-j|\).

  • fit_region_data - shape (2, K); col 0 is sequence separation, col 1 is RMS distance for the fit-region points.

  • all_points_data - shape (3, M); col 0 is sequence separation, col 1 is observed RMS distance, col 2 is the best-fit prediction.

Return type:

tuple of length 10

Raises:

SSException – If fraction_of_points > 1.0 or if too few points remain to fit a line (< 3).

Example

>>> import numpy as np
>>> np.random.seed(42)   # error bootstrap consumes the RNG
>>> result = protein.get_scaling_exponent(verbose=False)
>>> nu, A0 = result[0], result[1]
get_local_heterogeneity(fragment_size=10, bins=None, stride=20, verbose=True, weights=False, etol=1e-07)[source]

Sliding-window heterogeneity: per-position distribution of intra-window RMSDs.

At each starting residue i (from 0 to n_residues - fragment_size) the method computes the RMSD of the local fragment [i, i+fragment_size] across every (strided) frame, then summarises that distribution by its mean, standard deviation, and histogram. The Phi parameter of Lyle, Das & Pappu (2013) is built from this same family of D-values.

Computational cost scales with n_residues * (n_frames / stride); the default stride=20 keeps things tractable on long trajectories.

Note

weights / etol are accepted for API uniformity with the other ensemble-average methods, but supplying weights raises an SSException: this is a pairwise frame-vs-frame measure for which a single per-frame probability vector is not well-defined.

Parameters:
  • fragment_size (int, optional) – Size of the sliding window (residues). Must be >= 2 and <= n_residues. Default is 10.

  • bins (np.ndarray or list, optional) – Histogram bin edges. If None (default), uses np.arange(0, 10, 0.01). Must contain at least two evenly- spaced floats.

  • stride (int, optional) – Use every stride-th frame in the inner RMSD pass. Default 20.

  • verbose (bool, optional) – If True (default), print one status line per starting residue.

Returns:

(mean, std, histo, bins) where:

  • mean (list of float, length n_residues - fragment_size): mean intra-window RMSD per starting residue.

  • std (list of float, same length): standard deviation of the same RMSD distribution.

  • histo (list of np.ndarray, same length): histogram counts per starting residue.

  • bins (np.ndarray): the bin edges used (echoed back).

Return type:

tuple of (list, list, list, np.ndarray)

Raises:

SSException – If fragment_size is out of range, or bins is not a valid evenly-spaced 1D vector.

Example

>>> mean, std, hist, bins = protein.get_local_heterogeneity(fragment_size=8)

References

Lyle N, Das RK, Pappu RV. A quantitative measure for protein conformational heterogeneity. J Chem Phys. 2013;139(12):121907.

get_D_vector(stride=20, verbose=True)[source]

Pairwise frame-vs-frame conformational dissimilarity vector \(D\).

For every pair of (strided) frames \((A, B)\) the method computes a single dissimilarity scalar \(D_{AB} = 1 - V_A \cdot V_B / (||V_A|| ||V_B||)\) where \(V\) is the flattened upper-triangular CA-CA distance vector for that frame. This is the input to the Phi parameter of Lyle, Das, Pappu (2013).

Using CA-CA distances (rather than full inter-atomic distances) makes the measure backbone-specific and much faster than the original formulation.

Parameters:
  • stride (int, optional) – Use every stride-th frame. Default is 20.

  • verbose (bool, optional) – If True (default), print one status line per outer-loop frame.

Returns:

1D array of length \(\binom{n_{frames}/\text{stride}}{2}\) containing the pairwise dissimilarity values.

Return type:

np.ndarray

Example

>>> D = protein.get_D_vector(stride=20)
>>> D.mean()   # mean pairwise dissimilarity
0.27

References

Lyle N, Das RK, Pappu RV. A quantitative measure for protein conformational heterogeneity. J Chem Phys. 2013;139(12):121907.

get_RMSD(frame1, frame2=-1, region=None, backbone=True, stride=1)[source]

Aligned root-mean-square deviation (RMSD) of one frame vs. another (or all).

Both modes superpose the target onto the reference before computing the RMSD, so the result is rotation/translation invariant. Units are Angstroms.

Parameters:
  • frame1 (int) – Index of the reference frame.

  • frame2 (int, optional) – Index of the comparison frame. -1 (default) means “every frame” — i.e. compute an RMSD trace of frame1 against the entire trajectory (subsampled by stride).

  • region (list or tuple of length 2, optional) – [first_resid, last_resid] (inclusive) restricting the atoms used to compute and align the RMSD. None (default) uses the full chain.

  • backbone (bool, optional) – If True (default), use only the backbone heavy atoms (N, CA, C, O). If False, use every atom in the selection.

  • stride (int, optional) – Used only when frame2 == -1. Compare frame1 against every stride-th frame of the trajectory. Default is 1.

Returns:

  • 1-element array if frame2 is a specific frame index.

  • Length-ceil(n_frames/stride) array if frame2 == -1.

Values are in Angstroms.

Return type:

np.ndarray

Example

>>> rmsd_traj = protein.get_RMSD(frame1=0)              # frame 0 vs all
>>> rmsd_pair = protein.get_RMSD(frame1=0, frame2=42)   # frame 0 vs 42
get_Q(protein_average=True, region=None, beta_const=50.0, lambda_const=1.8, native_contact_threshold=4.5, stride=1, native_state_reference_frame=0, weights=False)[source]

Fraction-of-native-contacts order parameter \(Q\) (Best et al.).

Native contacts are defined from the reference frame (frame 0) as heavy-atom pairs separated by at least 4 residues in sequence and less than native_contact_threshold Angstroms in space. Each frame’s \(Q\) is then a sigmoid-weighted fraction of those contacts that remain “native-like” in that frame, per the Best, Hummer, Eaton formula (Best et al. 2013).

The reference frame is hard-coded to frame 0 because preserving a variable reference frame became unwieldy with re-weighted ensembles.

Parameters:
  • protein_average (bool, optional) –

    • If True (default), return one Q value per frame (the protein average).

    • If False, return per-native-contact and per-residue breakdowns in a 5-tuple — useful for visualising which contacts are most persistent.

  • region (list or tuple of length 2, optional) – [first_resid, last_resid] (inclusive) restricting the residues used to identify native contacts. None (default) uses the full protein.

  • beta_const (float, optional) – Steepness of the sigmoid weighting (1/nm). Default 50. Do not change without strong justification.

  • lambda_const (float, optional) – Sigmoid width factor (all-atom default 1.8). Do not change without strong justification.

  • native_contact_threshold (float, optional) – Distance cutoff (Angstroms) that defines a contact in the reference frame. Default is 4.5.

  • stride (int, optional) – Use every stride-th frame. Default is 1. May be combined with weights (the weight vector is subsampled with the same stride and renormalised internally).

  • native_state_reference_frame (int, optional) – Currently ignored — the reference frame is always frame 0. Kept in the signature for backward compatibility.

  • weights (list or np.ndarray, optional) – Per-frame weights for re-weighted analysis, one entry per trajectory frame (len(weights) == n_frames), each in [0, 1] and summing to 1. Validated by soursop.ssutils.validate_weights(). Only used when protein_average=False (the per-contact / per-residue breakdown); for protein_average=True see Raises. When a stride is given the vector is strided and renormalised, and the native reference frame (frame 0) is excluded and the remainder renormalised so the weighted average is a proper expectation over the non-native (strided) frames. Default False.

Returns:

  • If protein_average=True (default): 1D array of length ceil(n_frames/stride) with the per-frame Q value.

  • If protein_average=False: a 5-tuple

    1. per_contact_fraction - 1D array of length N_NATIVE_CONTACTS; fraction of time each native contact was native-like over the trajectory.

    2. native_contact_pairs - (N_NATIVE_CONTACTS, 2) int array; the atom-pair definitions of native contacts.

    3. per_residue_contacts - dict keyed by "RESNAME-RESID"; each value is a list of fractional-native-contact scores for atoms in that residue. Take the mean per key to get the per-residue Q.

    4. ordered_residue_keys - the keys of (2) in residue order.

    5. res_res_q_matrix - (n_residues, n_residues) array of inter-residue Q values (symmetric).

Return type:

np.ndarray OR tuple of length 5

Raises:

SSException – If weights is given with protein_average=True (frame-averaged Q must be reweighted outside SOURSOP); if the weight vector fails validation (wrong length, out of [0, 1], non-finite, or not summing to 1); if excluding the native reference frame leaves a zero-sum weight vector; or if any constant cannot be coerced to float.

Example

>>> q = protein.get_Q()                          # per-frame Q
>>> q.mean()
0.85
>>> per_c, pairs, per_res, keys, res_res = protein.get_Q(protein_average=False)

References

Best RB, Hummer G, Eaton WA. Native contacts determine protein folding mechanisms in atomistic simulations. PNAS 2013. doi:10.1073/pnas.1311599110

get_contact_map(distance_thresh=5.0, mode='closest-heavy', stride=1, weights=False)[source]

Inter-residue contact map and per-residue contact-order vector.

For every pair of residues this returns the fraction of frames in which the chosen inter-residue distance is below distance_thresh. The companion contact-order vector reduces the 2D map to a per-residue summary by averaging over the appropriate neighbours-excluded denominator.

i to i, i to i+1, and i to i+2 pairs are always excluded. ACE / NME caps are excluded from the residue set.

Parameters:
  • distance_thresh (float, optional) – Contact threshold in Angstroms. A pair is “in contact” in a given frame if its distance under the chosen mode is below this value. Default is 5.0.

  • mode ({'closest-heavy', 'ca', 'closest', 'sidechain', 'sidechain-heavy'}, optional) –

    How the inter-residue distance is defined (the schemes from mdtraj.compute_contacts):

    • 'closest-heavy' (default) - closest pair of heavy atoms.

    • 'ca' - alpha-carbon distance.

    • 'closest' - closest pair of any atoms (incl. hydrogens).

    • 'sidechain' - closest pair of sidechain atoms (mdtraj >= 1.8).

    • 'sidechain-heavy' - closest pair of sidechain heavy atoms (mdtraj >= 1.8). Raises if any residue is glycine.

  • stride (int, optional) – Use every stride-th frame. Default is 1. May be combined with weights (the weight vector is subsampled with the same stride and renormalised internally so it aligns with the strided frames).

  • weights (list or np.ndarray, optional) – Per-frame weights for re-weighted averaging, one entry per trajectory frame (len(weights) == n_frames), each in [0, 1] and summing to 1. Validated by soursop.ssutils.validate_weights(). When supplied the contact map is the deterministic weighted mean of the per-frame contact maps over the (strided) frames. Default False.

Returns:

(contact_map, contact_order) where:

  • contact_map is an (N, N) matrix of contact fractions in [0, 1].

  • contact_order is a 1D vector of length N giving the mean fractional contact count per residue, normalised by the number of valid partners for each residue.

Return type:

tuple of (np.ndarray, np.ndarray)

Raises:

SSException – For mode='sidechain-heavy' when any glycine is present, or if the weight vector fails validation (wrong length, out of [0, 1], non-finite, or not summing to 1).

Example

>>> cmap, corder = protein.get_contact_map()
>>> cmap.shape
(92, 92)
get_clusters(region=None, n_clusters=10, backbone=True, stride=20)[source]

Ward hierarchical clustering of conformations by RMSD.

Builds a strided all-vs-all RMSD matrix and groups frames using Ward’s agglomerative clustering. The number of clusters (n_clusters) must be specified up front. The chosen “centroid” of each cluster is the member whose summed exponential RMSD weight is largest (i.e. the most internally-connected frame).

Parameters:
  • region (list of length 2, optional) – [first_resid, last_resid] (inclusive) restricting the atoms used for the RMSD calculation. None (default) uses the full chain.

  • n_clusters (int, optional) – Target number of clusters. Default is 10. Note that the actual number of clusters returned may be smaller if some labels are never assigned.

  • backbone (bool, optional) – If True (default), use only backbone heavy atoms for the RMSD (much faster). If False, use every atom in the selection.

  • stride (int, optional) – Use every stride-th frame. Default is 20. stride=1 does an exact all-vs-all but can be very slow.

Returns:

(cluster_members, cluster_trajs, cluster_distance_matrices, cluster_centroids, cluster_frames) where:

  • cluster_members (list of int): number of frames in each cluster.

  • cluster_trajs (list of mdtraj.Trajectory): one sub-trajectory per cluster containing the assigned frames.

  • cluster_distance_matrices (list of np.ndarray): per-cluster all-vs-all RMSD submatrix (Angstroms).

  • cluster_centroids (list of int): for each cluster, the index within that cluster of the representative frame.

  • cluster_frames (list of np.ndarray): frame indices in the original trajectory that were assigned to each cluster.

Return type:

tuple of length 5

Example

>>> members, trajs, dmats, centroids, frames = protein.get_clusters(n_clusters=5)
>>> members
[120, 95, 80, 60, 45]
get_all_SASA(probe_radius=1.4, mode='residue', stride=20, weights=False, etol=1e-07)[source]

Solvent-accessible surface area (SASA) per residue or per atom.

Internally uses mdtraj’s Shrake-Rupley implementation (Golden-Spiral algorithm). Results are memoised per (stride, mode, probe_radius) triple so repeated calls with the same parameters are free.

SASA is returned in Angstroms^2; probe_radius is in Angstroms. Because the calculation is expensive the default stride is 20.

Parameters:
  • probe_radius (float, optional) – Solvent probe radius in Angstroms. Default 1.4 (water).

  • mode ({'residue', 'atom', 'sidechain', 'backbone', 'all'}, optional) –

    Granularity:

    • 'residue' (default) - per-residue SASA, shape (n_frames_strided, n_residues).

    • 'atom' - per-atom SASA, shape (n_frames_strided, n_atoms).

    • 'sidechain' - per-residue SASA summed over sidechain atoms (excluding backbone H, HA, HA2, HA3 which mdtraj’s ‘sidechain’ selector erroneously includes).

    • 'backbone' - per-residue SASA summed over backbone atoms (including the backbone hydrogens that mdtraj’s ‘backbone’ selector erroneously excludes).

    • 'all' - returns a 3-tuple of (residue, sidechain, backbone) arrays.

  • stride (int, optional) – Use every stride-th frame. Default is 20.

  • weights (array_like or False, optional) – Per-frame re-weighting vector (validated against the strided frame axis). False (default) returns the per-frame array (and uses the memo cache). If supplied, the strided frame axis is collapsed and the deterministic weighted-mean SASA is returned (not cached).

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

See mode description. All arrays are in Angstroms^2. If weights is supplied the frame axis is collapsed (e.g. (n_residues,) for mode='residue'; a 3-tuple of such arrays for mode='all').

Return type:

np.ndarray or tuple of np.ndarray

Example

>>> sasa = protein.get_all_SASA(mode='residue', stride=10)
>>> sasa.mean(axis=0)   # mean per-residue SASA across frames
get_regional_SASA(R1, R2, probe_radius=1.4, stride=20, weights=False, etol=1e-07)[source]

Mean total SASA summed over a contiguous residue range.

Calls get_all_SASA() (whose result is memoised) and sums the time-averaged per-residue SASA values from R1 to R2-1. Note the SASA must be computed on the full protein first because atoms outside the region can occlude region atoms.

SASA is returned in Angstroms^2.

Parameters:
  • R1 (int) – First residue of the region.

  • R2 (int) – Last residue of the region (exclusive in the internal sum).

  • probe_radius (float, optional) – Solvent probe radius in Angstroms. Default 1.4 (water).

  • stride (int, optional) – Use every stride-th frame. Default is 20.

  • weights (array_like or False, optional) – Per-frame re-weighting vector (validated against the strided frame axis). False (default) is byte-identical to before; if supplied each residue’s time-average is the deterministic weighted mean before summing.

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Sum over residues [R1, R2) of the time-averaged per-residue SASA in Angstroms^2.

Return type:

float

Example

>>> protein.get_regional_SASA(5, 25)
2150.6
get_site_accessibility(input_list, probe_radius=1.4, mode='residue_type', stride=20, weights=False, etol=1e-07)[source]

Mean & std SASA for selected residue types or specific residue indices.

Under mode='residue_type', input_list is interpreted as canonical three-letter residue names and every matching residue in the chain is summarised. Under mode='resid' the input is treated as a list of explicit residue indices.

Useful for comparing accessibility of chemically-equivalent residues at different positions in the chain.

Parameters:
  • input_list (list) –

    Either:

    • three-letter residue names, e.g. ['TRP', 'TYR', 'GLN'] (when mode='residue_type').

    • residue indices, e.g. [1, 2, 3, 4] (when mode='resid').

  • probe_radius (float, optional) – Solvent probe radius in Angstroms. Default 1.4 (water).

  • mode ({'residue_type', 'resid'}, optional) – How to interpret input_list. Default 'residue_type'.

  • stride (int, optional) – Use every stride-th frame. Default is 20.

  • weights (array_like or False, optional) – Per-frame re-weighting vector (validated against the strided frame axis). False (default) gives the ordinary mean/std (byte-identical to before); if supplied the per-residue [mean, std] is the deterministic weighted mean and weighted (population) std.

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Keys are "RESNAME-RESID" strings; each value is [mean_SASA, std_SASA] in Angstroms^2.

Return type:

dict

Raises:

SSException – If input_list contains a residue name not in the canonical 20 + caps set when mode='residue_type'.

Example

>>> protein.get_site_accessibility(['LEU', 'VAL'])
{'LEU-5': [42.1, 6.0], 'LEU-12': [55.4, 8.3], 'VAL-21': [38.0, 5.5], ...}
get_local_collapse(window_size=10, bins=None, verbose=True, weights=False, etol=1e-07)[source]

Sliding-window local radius of gyration profile.

At every starting residue i the local Rg is computed over the window [i, i + window_size], producing a per-position profile of compaction along the chain. Useful for identifying locally collapsed regions in otherwise extended IDPs (and vice versa).

Parameters:
  • window_size (int, optional) – Size of the sliding window in residues. Must be <= n_residues. Default 10.

  • bins (np.ndarray or list, optional) – Histogram bin edges (evenly spaced). If None (default), np.arange(0, 10, 0.01) is used.

  • verbose (bool, optional) – If True (default), print one status line per starting residue.

  • weights (array_like or False, optional) – Per-frame re-weighting vector. False (default) is byte-identical to before. If supplied, each window’s mean/std is the deterministic weighted mean / weighted (population) std and the histogram counts are frame-weighted.

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

(mean, std, histo, bins) where:

  • mean (list of float, length n_residues - window_size + 1): mean local Rg per starting residue.

  • std (same length): standard deviation of local Rg per starting residue.

  • histo (list of np.ndarray, same length): histogram counts.

  • bins (np.ndarray): the bin edges actually used.

Return type:

tuple of (list, list, list, np.ndarray)

Raises:

SSException – If window_size > n_residues or bins is malformed.

Example

>>> mean, std, hist, bins = protein.get_local_collapse(window_size=8)
>>> # plot mean vs residue index to see where the chain collapses

Returns (legacy enumeration)

  • [0] - list of floats (length = n)

    Each float reports on the mean Rg at a specific position along the sequence.

  • [1] - list of floats (length = n)

    Each float reports on the standard deviation of the Rg distribution at a specific position along the sequence.

  • [2] - list of np.ndarrays (length = n)

    Histogram counts associated with the local Rg at a given position along the sequence. Basically, this is the emprical distribution that the standard devaitions and mean report on

  • [3] - np.ndarray (length = n)

    Bin values for histogram counts returned in [2]

get_sidechain_alignment_angle(R1, R2, sidechain_atom_1='default', sidechain_atom_2='default')[source]

Per-frame angle between two residues’ sidechain orientation vectors.

For each residue a “sidechain vector” is the CA-to-tip vector, where the tip atom is the canonical sidechain terminus (e.g. CB for Ala, CZ for Phe/Tyr/Arg). The returned angle is between those two vectors. Useful for asking whether two sidechains tend to point towards each other (small angle) or away (~180 degrees).

Default tip atoms (from the OPLS-AA force field):

ALA: CB    CYS: SG    ASP: CG    GLU: CD    PHE: CZ
HIS: NE2   ILE: CD1   LYS: NZ    LEU: CG    MET: CE
ASN: CG    PRO: CG    GLN: CD    ARG: CZ    SER: OG
THR: CB    VAL: CB    TRP: CG    TYR: CZ

GLY, ACE, NME have no sidechain and raise.

Parameters:
  • R1 (int) – Resid of the first residue. Must not be GLY/ACE/NME.

  • R2 (int) – Resid of the second residue. Must not be GLY/ACE/NME.

  • sidechain_atom_1 (str, optional) – Override the tip atom of R1. 'default' (default) uses the canonical OPLS-AA atom from the table above.

  • sidechain_atom_2 (str, optional) – Override the tip atom of R2. Same convention as sidechain_atom_1.

Returns:

1D array of length n_frames with the per-frame alignment angle in degrees.

Return type:

np.ndarray

Raises:

SSException – If either residue is glycine or a cap, if a residue index is out of range, or if a custom sidechain atom name cannot be found.

Example

>>> theta = protein.get_sidechain_alignment_angle(5, 45)
>>> theta.mean()
82.4
get_dihedral_mutual_information(angle_name='psi', bwidth=0.6283185307179586, stride=1, weights=False, normalize=False)[source]

Mutual-information matrix between every pair of a chosen dihedral.

For each pair of dihedrals \((\phi_i, \phi_j)\) of the same type, this computes the mutual information

\[I(\phi_i; \phi_j) = H(\phi_i) + H(\phi_j) - H(\phi_i, \phi_j)\]

where the Shannon entropies \(H\) come from binned histograms. The natural baseline for interpretation is the equivalent matrix computed on a polymer-model reference ensemble (e.g. an excluded-volume simulation), since the absolute values are bin-width-dependent. Setting normalize=True divides each entry by the joint entropy, which can help with comparisons across ensembles of different sizes.

Parameters:
  • angle_name ({'phi', 'psi', 'omega', 'chi1'}, optional) – Dihedral to use. Default 'psi'. (chi2-chi5 are not supported by this method.)

  • bwidth (float, optional) – Histogram bin width on the [-pi, pi] interval. Smaller widths give a finer-grained estimate but need more data. Default pi/5.

  • stride (int, optional) – Use every stride-th frame. Default 1.

  • weights (array_like, optional) – Per-frame weights for re-weighted MI estimation. Must satisfy len(weights) == n_frames // stride. Default False.

  • normalize (bool, optional) – If True, divide each MI value by the joint entropy. Default False.

Returns:

Symmetric (n_dihedrals, n_dihedrals) mutual-information matrix. The diagonal is the self-information of each dihedral.

Return type:

np.ndarray

Raises:

SSException – If bwidth is not in (0, 2*pi] or angle_name is not one of the four allowed strings.

Example

>>> MI = protein.get_dihedral_mutual_information(angle_name='psi')
>>> MI.shape
(92, 92)
get_local_to_global_correlation(mode='COM', n_cycles=100, max_num_pairs=10, stride=20, weights=False, verbose=True)[source]

How well finite subsets of inter-residue distances predict global Rg.

For each k in [1, max_num_pairs), this randomly picks k inter-residue distance pairs n_cycles times and computes the correlation between the chain’s \(R_g^2\) and the mean-squared of those k distances. Averaging the correlations across cycles gives a per-k measure of how informative a finite local measurement is about the global chain dimensions.

This is intended to inform experimental design (e.g. how many smFRET pairs are “enough” to pin down \(R_g\)).

Parameters:
  • mode ({'COM', 'CA'}, optional) – Distance type. Default 'COM'.

  • n_cycles (int, optional) – Number of random pair-resamples per k. Default 100. Larger n_cycles reduces sampling noise; values below ~50 are not recommended.

  • max_num_pairs (int, optional) – Upper bound on the number of pairs sampled. Default 10. Beyond 10-20 the average correlation typically saturates close to 1.

  • stride (int, optional) – Use every stride-th frame. Default 20.

  • weights (array_like or False, optional) – Per-frame weights for the covariance computation. Default False (uniform weighting).

  • verbose (bool, optional) – If True (default), print one status line per k.

Returns:

(raw, n_pairs, mean_corr, std_corr) where:

  • raw (np.ndarray of shape (n_cycles * (max_num_pairs - 1), 2)): one row per resample; col 0 is k and col 1 is the \(R_g^2\) vs. local-mean-square correlation for that resample.

  • n_pairs (np.ndarray): the k values used, [1, 2, ..., max_num_pairs - 1].

  • mean_corr (np.ndarray): mean correlation per k.

  • std_corr (np.ndarray): standard deviation of correlation per k (assumes Gaussian distribution, may be approximate).

Return type:

tuple of length 4

Raises:

SSException – If mode is invalid, the strided Rg length disagrees with the inter-residue distance length, or the installed numpy version cannot accept aweights in numpy.cov.

Example

>>> import numpy as np
>>> np.random.seed(42)   # the pair-resampling consumes the RNG
>>> raw, ks, mean_c, std_c = protein.get_local_to_global_correlation()
get_overlap_concentration()[source]

Overlap concentration \(c^*\) of the chain, in Molar.

Defined as the polymer concentration at which independent chains begin to occupy the same volume in trans, computed by treating each chain as a sphere of radius equal to its mean \(R_g\). Useful as a coarse limit on dilute-regime behaviour.

Returns:

Overlap concentration \(c^*\) in Molar.

Return type:

float

Example

>>> protein.get_overlap_concentration()
0.0042
get_angle_decay(atom1='C', atom2='N', return_all_pairs=False, weights=False, etol=1e-07)[source]

Bond-vector autocorrelation along the chain (persistence-length proxy).

For each residue we form a single intra-residue vector \(\vec{v}_i\) (default C -> N, the peptide bond direction). For every pair \((i, j)\) we compute \(\langle \hat{v}_i \cdot \hat{v}_j \rangle\) across frames; averaging over all pairs with the same sequence separation \(|i-j|\) yields the decay profile. A faster initial decay implies a less rigid chain.

Both atom1 and atom2 must exist in every residue, so the peptide C->N vector is essentially the only reasonable choice.

Parameters:
  • atom1 (str, optional) – Origin atom name. Default 'C'.

  • atom2 (str, optional) – Tip atom name. Default 'N'.

  • return_all_pairs (bool, optional) – If True, additionally return a dict of every individual "i-j" residue-pair correlation. Default is False.

  • weights (array_like or False, optional) – Per-frame re-weighting vector. False (default) is byte-identical to before. If supplied, each residue-pair’s mean cos(theta) is the deterministic frame-weighted mean (the over-pairs averaging per separation is unchanged).

  • etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

  • If return_all_pairs=False (default): array of shape (n_separations, 3) with columns [separation, mean_corr, std_corr]. The first row is [0, 1.0, 0.0] (self correlation).

  • If return_all_pairs=True: (array, pair_dict) where pair_dict["i-j"] is the mean correlation for that specific residue pair.

Return type:

np.ndarray OR tuple of (np.ndarray, dict)

Example

>>> decay = protein.get_angle_decay()
>>> # decay[:, 0] is separation, decay[:, 1] is mean correlation