ssprotein

Overview

SSProtein is the core analysis class in SOURSOP and the primary interface for characterizing the conformational behaviour of a single protein chain across a simulation trajectory.

SSProtein objects are not created directly. Instead, they are extracted from an SSTrajectory object after reading a trajectory from disk. Each protein chain in the system is represented by its own SSProtein instance, accessible via SSTrajectory.proteinTrajectoryList.

Analyses fall into several broad categories:

Residue utilities — sequence retrieval, atom/CA index lookup, residue center-of-mass positions and masses.
Inter-residue distances — pairwise CA or COM distance matrices, distance maps, and polymer-scaled distance maps.
Global size and shape — radius of gyration, hydrodynamic radius, end-to-end distance, asphericity, acylindricity, prolateness, gyration tensor, and the \(t\)-parameter.
Secondary structure — per-frame DSSP assignments and BBSEG backbone-torsion-based classification.
Polymer scaling — internal scaling profiles (\(\langle r^2 \rangle\) vs sequence separation), the scaling exponent \(\nu\), and local heterogeneity in scaling behaviour.
Contact and RMSD analysis — contact maps (with configurable threshold and mode), RMSD to a reference structure, and fraction of native contacts \(Q\).
Solvent accessibility — per-residue and region-level SASA via get_all_SASA, get_regional_SASA, and get_site_accessibility.
Local dynamics — local collapse profiles, sidechain alignment angles, dihedral mutual information, local-to-global correlation, and angle decay.
Clustering and overlap — conformational clustering via get_clusters, overlap concentration \(c^*\).

Most functions return NumPy arrays; per-frame results have shape (n_frames,) or (n_frames, ...) so standard NumPy operations (np.mean, np.std, etc.) apply directly.

By way of an example:

from soursop.sstrajectory import SSTrajectory
import numpy as np

# read in a trajectory
TrajOb = SSTrajectory('traj.xtc', 'start.pdb')

# get the first protein chain (this is an SSProtein object)
ProtObj = TrajOb.proteinTrajectoryList[0]

# print the mean end-to-end distance
mean_e2e = np.mean(ProtObj.get_end_to_end_distance())
print(mean_e2e)

SSProtein Properties

SSProtein objects have a set of object variables associated with them.

class soursop.ssprotein.SSProtein(traj, debug=False, check_one_bead_per_residue=True, swan=False)[source]

resid_with_CA

Resids of residues that have a CA (alpha-carbon) atom.

ACE and NME capping residues, water, ions, and any non-standard residue without a CA are excluded. Use this whenever you need to iterate over “real” residues and not the cap pseudo-residues.

Returns:: Resids (0-indexed) of residues with a CA atom, in ascending order.
Return type:: list of int

Example

>>> protein.resid_with_CA
[1, 2, 3, ..., 92]   # ctl9_AA has ACE at 0 and NME at 93

ncap

True if an N-terminal capping residue (e.g. ACE) is present.

Inferred from whether resid 0 has a CA atom: if it does not, a cap is assumed to occupy resid 0.

Returns:: True if an N-terminal cap is present, False otherwise.
Return type:: bool

Example

>>> protein.ncap
True

ccap

True if a C-terminal capping residue (e.g. NME) is present.

Inferred from whether the last resid has a CA atom: if it does not, a cap is assumed to occupy that resid.

Returns:: True if a C-terminal cap is present, False otherwise.
Return type:: bool

Example

>>> protein.ccap
True

n_frames

Number of frames in the underlying mdtraj trajectory.

Returns:: Total frame count of the simulation trajectory.
Return type:: int

Example

>>> protein.n_frames
1000

n_residues

Number of residues in the protein, including any cap residues.

Returns:: Total residue count, counting ACE / NME caps if present.
Return type:: int

Example

>>> protein.n_residues
94

residue_index_list

All resids associated with this protein, extracted from the topology.

As of v0.1.3 the list always starts at 0 and increments by one up to n_residues - 1. It is extracted live from the underlying mdtraj.topology (and then cached) so it is useful as a ground-truth reference when debugging residue-index issues.

Returns:: Resids in ascending order; length equals n_residues.
Return type:: list of int

Example

>>> protein.residue_index_list[:5]
[0, 1, 2, 3, 4]

length = <function SSProtein.length>[source]

SSProtein Functions

class soursop.ssprotein.SSProtein(traj, debug=False, check_one_bead_per_residue=True, swan=False)[source]

reset_cache()[source]

Clear every value the SSProtein has memoised so far.

Many SSProtein methods cache their first-call result (sequences, atom index tables, per-residue centres of mass, SASA results, dihedral angles). This method discards those caches and re-initialises the resid_with_CA / cap flags from the underlying topology. It is rarely necessary in normal use; reach for it if you have mutated the trajectory in-place and need the SSProtein to re-derive its tables, or if you are profiling cold-vs-warm performance.

Return type:: None

Example

>>> protein.get_radius_of_gyration()    # first call: populates caches
>>> protein.reset_cache()               # discard all cached values
>>> protein.get_radius_of_gyration()    # cold-cache call again

print_residues(verbose=True)[source]

Map each zero-indexed residue position to its PDB resname-resid label.

Useful when debugging cap/non-cap offsets or aligning SOURSOP residue indices against a published PDB numbering.

Parameters:: verbose (bool, optional) – If True (default), also prints each "<idx> --> <resname-resid>" line to stdout. If False, the list is returned silently.
Returns:: One [index, resname-resid] pair per residue, in topology order.
Return type:: list of [int, str]

Example

>>> mapping = protein.print_residues(verbose=False)
>>> mapping[:3]
[[0, 'ACE-0'], [1, 'MET-1'], [2, 'ALA-2']]

get_amino_acid_sequence(oneletter=False, numbered=True)[source]

Return the amino-acid sequence of the protein in several formats.

Caps (ACE / NME) and other non-standard residues are included in the sequence using their three-letter or one-letter codes as recorded by the topology.

Parameters:

oneletter (bool, optional) – If True, use single-letter amino-acid codes. If False (default), use three-letter codes.
numbered (bool, optional) – If True (default), include the resid in each entry as a "RESNAME-RESID" string. If False, just the resname.

Returns:

oneletter=False, numbered=True -> ['MET-0', 'ALA-1', ...]
oneletter=False, numbered=False -> ['MET', 'ALA', ...]
oneletter=True, numbered=True -> the full one-letter string, e.g. "MA..."
oneletter=True, numbered=False -> ['M', 'A', ...]

Return type:

list of str OR str

Example

>>> protein.get_amino_acid_sequence(oneletter=True, numbered=True)
'MAAEELANAK...'
>>> protein.get_amino_acid_sequence()[:3]
['MET-0', 'ALA-1', 'ALA-2']

get_residue_atom_indices(resid, atom_name=None)[source]

Look up the atom indices that belong to a specific residue.

Results are memoised on first lookup so repeated calls for the same (resid, atom_name) pair are essentially free. The previous name of this method was residue_atom_com().

Parameters:

resid (int) – Resid of the residue to look up.
atom_name (str, optional) – If given, restrict the result to atoms whose name matches this string (e.g. "CA", "CB", "N"). If None (default), all atoms of the residue are returned.

Returns:

Atom indices into the underlying mdtraj topology that satisfy the resid / atom_name filter.

Return type:

list of int

Example

>>> ca_atoms = protein.get_residue_atom_indices(5, atom_name='CA')
>>> all_atoms = protein.get_residue_atom_indices(5)

get_CA_index(resid)[source]

Return the atom index of the alpha-carbon (CA) for a residue.

Results are memoised on first lookup so repeated calls are essentially free. Raises SSException if the residue does not have exactly one CA atom (e.g. cap residues like ACE / NME have none).

Parameters:: resid (int) – Resid whose CA atom should be returned.
Returns:: The mdtraj atom index of the CA atom of residue resid.
Return type:: int
Raises:: SSException – If the residue has no CA atom (caps, non-standard residues, or coarse-grained beads named differently).

Example

>>> protein.get_CA_index(5)
37

get_multiple_CA_index(resID_list=None)[source]

Return CA atom indices for many residues at once.

Convenience wrapper around get_CA_index() that handles three input forms: a single integer, a list of integers, or None (meaning “every residue with a CA”). Residues without a CA are silently skipped rather than raising.

Parameters:

resID_list (int, list of int, or None, optional) –

None (default) -> use resid_with_CA (every residue that has a CA).
A single int -> wrap in a list, return one-element list.
A list[int] -> look up each in turn, skipping any that raise.

Returns:

CA atom indices, in the same order as the input residues.

Return type:

list of int

Example

>>> protein.get_multiple_CA_index([5, 10, 15])
[37, 86, 134]
>>> all_ca = protein.get_multiple_CA_index()   # every CA

get_residue_COM(resid, atom_name=None)[source]

Per-frame centre of mass (COM) of a residue or a specific atom of it.

The COM is computed once and cached, so repeated calls for the same (resid, atom_name) pair are essentially free. All positions are returned in Angstroms.

Parameters:

resid (int) – Resid of the residue to extract.
atom_name (str, optional) – If given, only the named atom contributes to the COM (a single-atom COM is just the atom’s position). If None (default), every atom of the residue contributes.

Returns:

Array of shape (n_frames, 3) giving the COM (x, y, z) of the residue in each frame.

Return type:

np.ndarray

Example

>>> com = protein.get_residue_COM(5)              # full-residue COM
>>> ca  = protein.get_residue_COM(5, 'CA')        # CA atom position

get_residue_mass(R1)[source]

Total mass of every atom belonging to a residue.

The mass is summed from each atom’s element.mass as recorded in the underlying mdtraj topology, in atomic mass units (g/mol).

Parameters:: R1 (int) – Resid whose mass should be summed.
Returns:: Total mass of residue R1 in atomic mass units.
Return type:: float

Example

>>> protein.get_residue_mass(5)   # e.g. leucine
113.16

calculate_all_CA_distances(residueIndex, mode='CA', only_C_terminal_residues=True, stride=1)[source]

Distances between a reference residue and every other CA residue.

Computes either CA-CA or COM-COM distances from residueIndex to all other residues in the chain that have a CA atom. By default only residues C-terminal of the reference are considered, which is the right choice when building an all-vs-all upper-triangular matrix.

Distance is returned in Angstroms.

Parameters:

residueIndex (int) – Reference residue. Must have a CA atom; if not, the function returns -1 rather than raising.
mode ({'CA', 'COM'}, optional) –
- 'CA' (default) - distance between alpha-carbon atoms.
- 'COM' - distance between residue centres of mass.
only_C_terminal_residues (bool, optional) – If True (default) only consider residues with index strictly greater than residueIndex. If False, also include residues N-terminal of the reference.
stride (int, optional) – Use every stride-th frame. Default is 1.

Returns:

On success: array of shape (n_frames_after_stride, M) where M is the number of other residues considered.
If residueIndex has no CA: the integer -1.

Return type:

np.ndarray or int

Raises:

SSException – If mode is not one of 'CA' or 'COM'.

Example

>>> d = protein.calculate_all_CA_distances(5)
>>> d.shape
(1000, 86)

get_inter_residue_COM_distance(R1, R2, stride=1, weights=False, etol=1e-07)[source]

Per-frame distance between two residues’ centres of mass.

Distances are in Angstroms.

Parameters:

R1 (int) – Resid of the first residue.
R2 (int) – Resid of the second residue.
stride (int, optional) – Use every stride-th frame. Default is 1.
weights (array_like or False, optional) – Per-frame re-weighting vector, one entry per trajectory frame (len == n_frames), validated by soursop.ssutils.validate_weights() (stride-subsampled and renormalised to align with the strided frames). False (default) returns the per-frame array; if supplied the frame axis is collapsed and the scalar deterministic weighted-mean distance is returned.
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

If weights is False: 1D array of length ceil(n_frames / stride) with the inter-residue COM distance for each (strided) frame. If weights is supplied: the scalar weighted-mean distance. Angstroms.

Return type:

np.ndarray or float

Example

>>> d = protein.get_inter_residue_COM_distance(5, 45)
>>> d.mean(), d.std()
(12.3, 1.1)

get_inter_residue_COM_vector(R1, R2)[source]

Per-frame displacement vector COM(R1) - COM(R2) between residue COMs.

Unlike get_inter_residue_COM_distance(), this returns the full (dx, dy, dz) separation rather than its magnitude. The convention is COM(R1) - COM(R2), i.e. the vector that points from residue R2’s COM to residue R1’s COM. Units are Angstroms.

Note

Units changed in v0.2.0; prior to that release the vector was returned in nanometres.

Parameters:

R1 (int) – Resid of the first residue (the “head” of the vector).
R2 (int) – Resid of the second residue (the “tail” of the vector).

Returns:

Array of shape (n_frames, 3) giving the displacement COM(R1) - COM(R2) in each frame, in Angstroms.

Return type:

np.ndarray

Example

>>> v = protein.get_inter_residue_COM_vector(5, 45)
>>> v.shape
(1000, 3)

get_inter_residue_atomic_distance(R1, R2, A1='CA', A2='CA', mode='atom', stride=1, weights=False, etol=1e-07)[source]

Per-frame distance between two residues, with mode-dependent semantics.

For mode='atom' the distance is between the named atoms A1 of residue R1 and A2 of residue R2. For all other modes the A1/A2 arguments are ignored and the distance is the one returned by mdtraj.compute_contacts under the chosen scheme.

SOURSOP does no sanity checking on atom names — if a name is invalid, an SSException is raised with a hint about likely mistypes.

Distance is returned in Angstroms.

Parameters:

R1 (int) – Resids of the two residues to measure between.
R2 (int) – Resids of the two residues to measure between.
A1 (str, optional) – For mode='atom', the atom names within R1 and R2. Defaults are 'CA'. Ignored for all other modes.
A2 (str, optional) – For mode='atom', the atom names within R1 and R2. Defaults are 'CA'. Ignored for all other modes.
mode ({'atom', 'ca', 'closest', 'closest-heavy', 'sidechain', 'sidechain-heavy'}, optional) –
- 'atom' (default) - distance between atoms A1 and A2.
- 'ca' - equivalent to mode='atom', A1=A2='CA'.
- 'closest' - closest approach between any atom of R1 and any atom of R2 in each frame.
- 'closest-heavy' - same as 'closest' but ignoring hydrogens.
- 'sidechain' - closest sidechain atom of R1 to closest sidechain atom of R2 (requires mdtraj >= 1.8.0).
- 'sidechain-heavy' - sidechain heavy-atom version (requires mdtraj >= 1.8.0). Raises for glycine residues.
stride (int, optional) – Use every stride-th frame. Default is 1.
weights (array_like or False, optional) – Per-frame re-weighting vector, one entry per trajectory frame (len == n_frames), validated by soursop.ssutils.validate_weights() and stride-aligned. False (default) returns the per-frame array; if supplied the frame axis is collapsed to the scalar weighted mean.
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

If weights is False: 1D array of length ceil(n_frames / stride) with the inter-residue distance for each (strided) frame. If weights is supplied: the scalar deterministic weighted-mean distance. Angstroms.

Return type:

np.ndarray or float

Raises:

SSException – If mode is invalid, an atom name cannot be found in the corresponding residue, or the weight vector fails validation.

Example

>>> d = protein.get_inter_residue_atomic_distance(5, 45)               # CA-CA by default
>>> d_min = protein.get_inter_residue_atomic_distance(5, 45, mode='closest-heavy')

get_distance_map(mode='CA', RMS=False, stride=1, return_instantaneous_maps=False, weights=False, verbose=True)[source]

Inter-residue distance map (and its standard deviation).

Builds the full N x N matrix of inter-residue distances where N is the number of residues with a CA atom (ACE / NME caps are therefore excluded). Distances are returned in Angstroms. The matrix is upper triangular (entries below the diagonal are zero).

Parameters:

mode ({'CA', 'COM'}, optional) –
- 'CA' (default) - distances between alpha-carbon atoms.
- 'COM' - distances between residue centres of mass.
RMS (bool, optional) – If True, report root-mean-square distances \(\sqrt{\langle r_{ij}^2 \rangle}\), the polymer-physics order parameter. If False (default), report the ensemble mean \(\langle r_{ij} \rangle\).
stride (int, optional) – Use every stride-th frame. Default is 1 (every frame).
return_instantaneous_maps (bool, optional) – If True, the first element of the returned tuple is a 3D array (n_frames, N, N) of per-frame distance maps rather than the 2D ensemble-average map. Default is False.
weights (list or np.ndarray, optional) – Per-frame weights for re-weighted averaging (e.g. T-WHAM output). Default False means uniform weighting; in that case the std-map is also returned, otherwise it is None.
verbose (bool, optional) – If True (default), print one status line per row of the matrix. This calculation can be slow on long trajectories.

Returns:

(distance_map, std_distance_map) where:

distance_map is (N, N) ensemble-mean distances (or (n_frames, N, N) per-frame distances if return_instantaneous_maps=True).
std_distance_map is (N, N) per-pair standard deviations across frames; None if weights was given.

Return type:

tuple of (np.ndarray, np.ndarray)

Example

>>> mean_map, std_map = protein.get_distance_map(verbose=False)
>>> mean_map.shape
(92, 92)
>>> rms_map, _ = protein.get_distance_map(mode='COM', RMS=True, verbose=False)

get_polymer_scaled_distance_map(nu=None, A0=None, min_separation=10, mode='fractional-change', stride=1, weights=False, etol=1e-07, verbose=True)[source]

Quantify how each inter-residue distance deviates from a homopolymer model.

For a standard polymer the equilibrium distance scales as \(\langle r_{ij} \rangle = A_0 |i-j|^{\nu}\). This method builds an N x N matrix where each entry quantifies how far the observed mean distance is from that homopolymer prediction. The deviation can be expressed in several ways (see mode).

Note that \(A_0\) here is the prefactor of the inter-residue distance scaling and is NOT the same numerical value as the \(R_0\) prefactor that defines \(R_g = R_0 N^{\nu}\). The scaling exponent \(\nu\) should typically lie in [0.33, 0.598] for real polymers.

If nu and A0 are not supplied, the homopolymer model is fit automatically by get_scaling_exponent() (with default settings) and that fit is used as the reference model.

Parameters:

nu (float, optional) – Scaling exponent used as the homopolymer reference. If supplied, A0 must also be supplied. If None (default), both are obtained by calling get_scaling_exponent().
A0 (float, optional) – Scaling prefactor for the homopolymer reference; paired with nu.
min_separation (int, optional) – Minimum sequence separation |i-j| for which a deviation is computed. Entries with separations below this threshold are filled with the mode-dependent default value (0 for the change modes, 1 for 'scaled'). Default is 10.
mode ({'fractional-change', 'signed-fractional-change', 'signed-absolute-change', 'scaled'}, optional) –
How to quantify the deviation between the observed mean distance \(r_{ij}\) and the homopolymer prediction \(p_{ij}\):
- 'fractional-change' (default): \(|r_{ij} - p_{ij}| / p_{ij}\). Unsigned magnitude.
- 'signed-fractional-change': \((r_{ij} - p_{ij}) / p_{ij}\). Positive => expansion vs. the homopolymer, negative => contraction.
- 'signed-absolute-change': \(r_{ij} - p_{ij}\) (Angstroms).
- 'scaled': \(r_{ij} / p_{ij}\) (default value 1 on near-diagonal).
stride (int, optional) – Use every stride-th frame for the distance map. Default is 1.
weights (list or np.ndarray, optional) – Per-frame weights for re-weighted analysis (e.g. T-WHAM output). Default False means uniform weighting.
etol (float, optional) – Tolerance for the weights-sum-to-one check. Default 1e-7.
verbose (bool, optional) – If True (default), print status updates during the distance map and (when needed) the scaling-exponent fit.

Returns:

(map, nu, A0, redchi) where:

map is an (N, N) numpy matrix of deviations (units depend on mode).
nu is the homopolymer scaling exponent used.
A0 is the homopolymer scaling prefactor used.
redchi is the reduced chi-squared of the homopolymer fit (-1 if nu / A0 were supplied rather than fit).

Return type:

tuple of (np.ndarray, float, float, float)

Raises:

SSException – If mode is not one of the four allowed values, if min_separation < 1, if exactly one of nu/A0 is given, or if the supplied nu/A0 are out of physical range.

Example

>>> m, nu, A0, chi2 = protein.get_polymer_scaled_distance_map(verbose=False)
>>> m.shape, round(nu, 2)
((92, 92), 0.6)

get_radius_of_gyration(R1=None, R2=None, weights=False, etol=1e-07)[source]

Per-frame radius of gyration of the chain (or a sub-region).

\(R_g = \sqrt{(1/N) \sum_i (r_i - R)^2}\) is the canonical polymer-physics measure of overall chain size. It is computed by mdtraj.compute_rg on every heavy/light atom in the selection and returned in Angstroms.

Parameters:

R1 (int, optional) – First residue of the region. None (default) means the first residue including caps.
R2 (int, optional) – Last residue of the region. None (default) means the last residue including caps.
weights (array_like or False, optional) – Per-frame re-weighting vector (validated by soursop.ssutils.validate_weights()). False (default) returns the unmodified per-frame array. If supplied, the frame axis is collapsed and the single deterministic weighted-mean \(R_g\) is returned instead.
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

If weights is False: 1D array of length n_frames with the per-frame \(R_g\) in Angstroms. If weights is supplied: the scalar weighted-mean \(R_g\).

Return type:

np.ndarray or float

Example

>>> rg = protein.get_radius_of_gyration()
>>> rg.mean()
33.4

get_hydrodynamic_radius(R1=None, R2=None, mode='nygaard', alpha1=0.216, alpha2=4.06, alpha3=0.821, distance_mode='CA', weights=False, etol=1e-07)[source]

Per-frame apparent hydrodynamic radius \(R_h\) (in Angstroms).

Two estimators are available:

mode='nygaard' - empirical Rg-to-Rh conversion of Nygaard et al. (2017), using the three parameters alpha1, alpha2, alpha3.
mode='kr' - Kirkwood-Riseman equation (Kirkwood & Riseman 1948), recommended for comparison with PFG-NMR-derived \(R_h\) values (Pesce et al. 2022).

Both approximations hold for fully flexible disordered proteins and become less accurate for larger / more folded domains.

Parameters:

R1 (int, optional) – First residue of the region. None (default) means the first residue including caps.
R2 (int, optional) – Last residue of the region. None (default) means the last residue including caps.
mode ({'nygaard', 'kr'}, optional) – Which estimator to use. Default is 'nygaard'.
alpha1 (float, optional) – Parameters of equation (7) of Nygaard et al. Defaults reproduce the published values (0.216, 4.06, 0.821). Ignored for mode='kr'.
alpha2 (float, optional) – Parameters of equation (7) of Nygaard et al. Defaults reproduce the published values (0.216, 4.06, 0.821). Ignored for mode='kr'.
alpha3 (float, optional) – Parameters of equation (7) of Nygaard et al. Defaults reproduce the published values (0.216, 4.06, 0.821). Ignored for mode='kr'.
distance_mode ({'CA', 'COM'}, optional) – For mode='kr', which inter-residue distance to feed into the Kirkwood-Riseman sum. Default is 'CA'. Ignored for mode='nygaard'.
weights (array_like or False, optional) – Per-frame re-weighting vector. False (default) returns the per-frame array; if supplied the scalar deterministic weighted-mean \(R_h\) is returned.
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Per-frame \(R_h\) (length n_frames), or the scalar weighted mean if weights is supplied. Angstroms.

Return type:

np.ndarray or float

Example

>>> rh_nyg = protein.get_hydrodynamic_radius()
>>> rh_kr  = protein.get_hydrodynamic_radius(mode='kr')
>>> rh_nyg.mean(), rh_kr.mean()
(24.1, 23.8)

References

Nygaard M, Kragelund BB, Papaleo E, Lindorff-Larsen K. An Efficient Method for Estimating the Hydrodynamic Radius of Disordered Protein Conformations. Biophys J. 2017;113:550-557.

Kirkwood JG, Riseman J. The Intrinsic Viscosities and Diffusion Constants of Flexible Macromolecules in Solution. J Chem Phys. 1948;16(6):565-573.

Pesce F et al. Assessment of models for calculating the hydrodynamic radius of intrinsically disordered proteins. Biophysical Journal. 2022. doi:10.1016/j.bpj.2022.12.013

get_end_to_end_distance(mode='COM', weights=False, etol=1e-07)[source]

Per-frame end-to-end distance between the first and last CA residues.

Caps (ACE / NME) are excluded — the “ends” are the first and last residues with a CA atom, so the two modes operate on the same pair of residues.

End-to-end distance is returned in Angstroms.

Parameters:

mode ({'CA', 'COM'}, optional) –
- 'COM' (default) - distance between residue centres of mass.
- 'CA' - distance between alpha-carbon atoms.
weights (array_like or False, optional) – Per-frame re-weighting vector. False (default) returns the per-frame array; if supplied the scalar deterministic weighted-mean end-to-end distance is returned.
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Per-frame end-to-end distance (length n_frames), or the scalar weighted mean if weights is supplied. Angstroms.

Return type:

np.ndarray or float

Example

>>> ee = protein.get_end_to_end_distance(mode='CA')
>>> ee.mean(), ee.std()
(30.1, 4.4)

get_end_to_end_vs_rg_correlation(mode='COM', weights=False, etol=1e-07)[source]

Pearson correlation between per-frame \(R_e^2\) and \(R_g^2\).

For a Gaussian chain Debye’s result gives \(R_e^2 = 6 R_g^2\), but in practice the correlation is a more useful diagnostic of how fractal-like (or otherwise) the chain is. This method returns the Pearson correlation across frames between the end-to-end distance squared and the radius of gyration squared.

Parameters:: mode ({'COM', 'CA'}, optional) – Type of distance to use for the end-to-end distance. Default 'COM'.
Returns:: Pearson correlation coefficient in [-1, 1].
Return type:: float

Example

>>> protein.get_end_to_end_vs_rg_correlation()
0.92

get_asphericity(R1=None, R2=None, verbose=True, weights=False, etol=1e-07)[source]

Per-frame asphericity of the chain (or a sub-region).

Asphericity is a dimensionless shape descriptor computed from the eigenvalues of the gyration tensor: it is 0 for a perfectly spherical distribution of mass and approaches 1 as the chain becomes increasingly extended/rod-like.

See p. 65 of Andreas Vitalis’ thesis (Probing the Early Stages of Polyglutamine Aggregation with Computational Methods, 2009, WashU) for a canonical derivation.

Parameters:

R1 (int, optional) – First residue of the region to consider. None (default) means the first residue in the sequence (including caps).
R2 (int, optional) – Last residue of the region to consider. None (default) means the last residue in the sequence (including caps).
verbose (bool, optional) – If True (default), print one status line every 500 frames during the gyration-tensor computation.
weights (array_like or False, optional) – Per-frame re-weighting vector. False (default) returns the per-frame array; if supplied the frame axis is collapsed and the scalar deterministic weighted-mean asphericity is returned.
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Per-frame asphericity (length n_frames), or the scalar weighted mean if weights is supplied.

Return type:

np.ndarray or float

Example

>>> a = protein.get_asphericity(verbose=False)
>>> a.mean()
0.18

get_acylindricity(R1=None, R2=None, normalized=True, verbose=True, weights=False, etol=1e-07)[source]

Per-frame acylindricity of the chain (or a sub-region).

Acylindricity measures the departure of the mass distribution from cylindrical symmetry about its principal (longest) axis. It is computed from the gyration-tensor eigenvalues \(\lambda_1 \le \lambda_2 \le \lambda_3\) as \(c = \lambda_2 - \lambda_1\) (Theodorou & Suter, 1985). It is zero for any cylindrically-symmetric distribution (a sphere, or a rod/prolate ellipsoid of revolution) and grows as the two minor axes become unequal (e.g. for an oblate “disc” or a generic triaxial shape). It complements get_asphericity(), which reports how anisotropic a conformation is but not in which way.

Parameters:

R1 (int, optional) – First residue of the region to consider. None (default) is the first residue in the sequence (including caps).
R2 (int, optional) – Last residue of the region to consider. None (default) is the last residue in the sequence (including caps).
normalized (bool, optional) – If True (default), return the dimensionless relative acylindricity \(c / R_g^2 = (\lambda_2-\lambda_1)/(\lambda_1+\lambda_2+ \lambda_3)\). If False, return \(c\) in Angstrom².
verbose (bool, optional) – If True (default), print a status line during the gyration-tensor computation.
weights (array_like or False, optional) – Per-frame re-weighting vector. False (default) returns the per-frame array; if supplied the frame axis is collapsed and the scalar deterministic weighted mean is returned.
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Per-frame acylindricity (length n_frames), or the scalar weighted mean if weights is supplied.

Return type:

np.ndarray or float

Example

>>> c = protein.get_acylindricity(verbose=False)
>>> c.mean()
0.07

get_prolateness(R1=None, R2=None, verbose=True, weights=False, etol=1e-07)[source]

Per-frame prolateness (gyration-tensor shape parameter S).

The shape parameter distinguishes prolate (rod- or cigar-like) from oblate (disc- or pancake-like) conformations - a distinction the asphericity cannot make, since both extremes are highly aspherical. It is the normalized third invariant of the gyration tensor,

\[S = \frac{27\,(\lambda_1-\bar\lambda)(\lambda_2-\bar\lambda) (\lambda_3-\bar\lambda)}{(\lambda_1+\lambda_2+\lambda_3)^3}, \qquad \bar\lambda = \tfrac{1}{3}(\lambda_1+\lambda_2+\lambda_3),\]

where \(\lambda_i\) are the gyration-tensor eigenvalues. S ranges over [-0.25, 2]: S = 0 for a spherically-symmetric distribution, S > 0 for prolate shapes (S = 2 for an ideal rod), and S < 0 for oblate shapes (S = -0.25 for an ideal disc). Pairing S with get_asphericity() gives a two-parameter (magnitude, nature) description of conformational shape.

Parameters:

R1 (int, optional) – First residue of the region to consider. None (default) is the first residue in the sequence (including caps).
R2 (int, optional) – Last residue of the region to consider. None (default) is the last residue in the sequence (including caps).
verbose (bool, optional) – If True (default), print a status line during the gyration-tensor computation.
weights (array_like or False, optional) – Per-frame re-weighting vector. False (default) returns the per-frame array; if supplied the frame axis is collapsed and the scalar deterministic weighted mean is returned.
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Per-frame prolateness (length n_frames), or the scalar weighted mean if weights is supplied.

Return type:

np.ndarray or float

Example

>>> S = protein.get_prolateness(verbose=False)
>>> S.mean()
0.31

get_t(R1=None, R2=None, weights=False, etol=1e-07)[source]

Per-frame dimensionless size parameter \(\langle t \rangle\).

\(\langle t \rangle\) is the Vitalis-thesis size parameter (Vitalis 2009) that scales the radius of gyration against the contour length \(L = 3.6 N\) (Angstroms). It is dimensionless and useful for comparing chains of different lengths.

Parameters:

R1 (int, optional) – First residue of the region. None (default) means the first residue.
R2 (int, optional) – Last residue of the region. None (default) means the last residue.
weights (array_like or False, optional) – Per-frame re-weighting vector (len == n_frames), validated by soursop.ssutils.validate_weights(). False (default) returns the per-frame array; if supplied the frame axis is collapsed and the scalar deterministic weighted-mean \(\langle t \rangle\) is returned.
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Per-frame \(\langle t \rangle\) (length n_frames), or the scalar weighted mean if weights is supplied.

Return type:

np.ndarray or float

Example

>>> t = protein.get_t()
>>> t.mean()
2.4

References

Vitalis, A. (2009). Probing the Early Stages of Polyglutamine Aggregation with Computational Methods (R. Pappu, ed.) [Ph.D. thesis from Washington University in St. Louis].

get_gyration_tensor(R1=None, R2=None, verbose=True, weights=False, etol=1e-07)[source]

Per-frame 3 x 3 gyration tensor of the chain (or a sub-region).

The gyration tensor \(T_{ab} = (1/N) \sum_i (r_{i,a} - R_a)(r_{i,b} - R_b)\) is the second moment of mass about the centre of mass. Its eigenvalues are the principal moments of inertia divided by mass and underlie derived quantities like get_radius_of_gyration() and get_asphericity(). Units are Angstroms^2.

Parameters:

R1 (int, optional) – First residue of the region to consider. None (default) means the first residue (including caps).
R2 (int, optional) – Last residue of the region to consider. None (default) means the last residue (including caps).
verbose (bool, optional) – If True (default), print one status line every 500 frames.
weights (array_like or False, optional) – Per-frame re-weighting vector. False (default) returns the per-frame (n_frames, 3, 3) array; if supplied the frame axis is collapsed and a single deterministic weighted-mean (3, 3) tensor is returned.
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

(n_frames, 3, 3) per-frame tensors, or a single (3, 3) weighted-mean tensor if weights is supplied.

Return type:

np.ndarray

Example

>>> T = protein.get_gyration_tensor(verbose=False)
>>> T.shape
(1000, 3, 3)

get_angles(angle_name)[source]

Per-frame backbone or sidechain dihedral angles for every applicable residue.

Computes the named dihedral using the corresponding mdtraj helper (compute_phi, compute_psi, …, compute_chi5) and converts the result to degrees. The first call for a given angle_name is memoised so repeated calls are essentially free.

Parameters:

angle_name ({'phi', 'psi', 'omega', 'chi1', 'chi2', 'chi3', 'chi4', 'chi5'}) – Dihedral to compute. chi3-chi5 only exist for residues with long sidechains (Arg, Lys, Met, etc.), so the result will contain fewer entries than n_residues.

Returns:

A two-element list:

[0] is a list-of-lists; each sub-list holds the four mdtraj.Atom objects that define one dihedral.
[1] is an array of shape (n_dihedrals, n_frames) of the per-frame angles in degrees.

Return type:

list of [list_of_mdtraj_atoms, np.ndarray]

Raises:

SSException – If angle_name is not one of the eight allowed strings.

Example

For a protein with two arginine residues:

>>> chi5 = protein.get_angles('chi5')
>>> chi5[0][1]                  # atoms defining 2nd arginine's chi5
>>> chi5[1][1]                  # per-frame chi5 angles for that residue

get_secondary_structure_DSSP(R1=None, R2=None, return_per_frame=False, weights=False, etol=1e-07, helix_window=10, helix_rmsd_thresh=0.5, beta_window=5, beta_rmsd_thresh=0.6)[source]

Per-residue secondary structure (helix / extended / coil) via DSSP.

Calls mdtraj.compute_dssp on the chosen residue range and summarises the output into per-residue fractional occupancies (the default) or per-frame binary masks (with return_per_frame=True).

The three buckets are H (helix), E (extended/beta), C (coil). At every residue the three values sum to 1 (default mode).

two-bead models: for a two-bead (CA/CB) protein DSSP cannot be used (there is no N/C/O backbone), so secondary structure is assigned from the CA trace instead. A residue is helical in a frame if it falls inside any helix_window-length CA fragment whose RMSD to an idealized alpha-helix is below helix_rmsd_thresh (Angstroms), and extended (beta) if it falls inside any beta_window-length fragment matching an idealized extended strand within beta_rmsd_thresh (Angstroms). The helix_window / helix_rmsd_thresh / beta_window / beta_rmsd_thresh arguments are ignored for all-atom and one-bead proteins.

Parameters:

R1 (int, optional) – First residue of the region. None (default) means the first CA-bearing residue.
R2 (int, optional) – Last residue of the region. None (default) means the last CA-bearing residue.
return_per_frame (bool, optional) –
- False (default): return per-residue fractional occupancies.
- True: return per-frame binary 1/0 masks.
weights (array_like or False, optional) – Per-frame re-weighting vector (len == n_frames), validated by soursop.ssutils.validate_weights(). Only valid with return_per_frame=False: the per-residue fractional occupancies become the deterministic frame-weighted fractions. Supplying weights with return_per_frame=True raises an SSException (a per-frame mask is not an ensemble average). Default False (unweighted).
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.
helix_window (int, optional) – (two-bead models only) Number of consecutive CA beads in the idealized-helix comparison fragment. Default 10. Ignored for all-atom and one-bead proteins.
helix_rmsd_thresh (float, optional) – (two-bead models only) RMSD cutoff (Angstroms) below which a CA fragment is classed helical. Default 0.5. Ignored for all-atom and one-bead proteins.
beta_window (int, optional) – (two-bead models only) Number of consecutive CA beads in the idealized-extended comparison fragment. Default 5. Ignored for all-atom and one-bead proteins.
beta_rmsd_thresh (float, optional) – (two-bead models only) RMSD cutoff (Angstroms) below which a CA fragment is classed extended/beta. Default 0.6. Ignored for all-atom and one-bead proteins.

Returns:

(resid_list, helix, extended, coil) where:

resid_list (list of int): residue indices covered.
If return_per_frame=False: helix, extended, coil are 1D arrays of length len(resid_list) with per-residue fractional time in each bucket.
If return_per_frame=True: helix, extended, coil are 2D arrays of shape (n_frames, len(resid_list)) with 1 / 0 entries indicating presence of that bucket in that frame.

Return type:

tuple of length 4

Example

>>> resids, h, e, c = protein.get_secondary_structure_DSSP()
>>> h.mean()    # mean helicity across the chain
0.32

get_secondary_structure_BBSEG(R1=None, R2=None, return_per_frame=False, weights=False, etol=1e-07)[source]

Per-residue secondary structure via the BBSEG2 phi/psi classification.

Classifies each (phi, psi) pair into one of 9 BBSEG2 bins using a lookup table distributed with CAMPARI. Implementation is in pure Python, so the result is version-stable across mdtraj releases.

BBSEG2 classes:

0 - unclassified
1 - beta (turn/sheet)
2 - PII (polyproline type II helix)
3 - unusual region
4 - right-handed alpha helix
5 - inverse C7-equatorial (gamma-prime turn)
6 - classic C7-equatorial (gamma turn)
7 - 7-residue-per-turn helix
8 - left-handed alpha helix

Parameters:

R1 (int, optional) – First residue of the region. None (default) means the first residue with phi/psi defined.
R2 (int, optional) – Last residue of the region. None (default) means the last residue with phi/psi defined.
return_per_frame (bool, optional) –
- False (default): return per-residue fractional occupancies.
- True: return per-frame binary masks.
weights (array_like or False, optional) – Per-frame re-weighting vector (len == n_frames), validated by soursop.ssutils.validate_weights(). Only valid with return_per_frame=False: the per-class per-residue fractional occupancies become the deterministic frame-weighted fractions. Supplying weights with return_per_frame=True raises an SSException. Default False (unweighted).
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

(resid_list, per_class) where:

resid_list (list of int): residue indices covered.
per_class (dict): keys 0..8 (int), values are either 1D fractional-occupancy arrays (length len(resid_list)) or 2D (n_frames, len(resid_list)) binary masks depending on return_per_frame.

Return type:

tuple of (list, dict)

Example

>>> resids, classes = protein.get_secondary_structure_BBSEG()
>>> classes[4].mean()    # mean right-handed-alpha occupancy across chain

get_internal_scaling(R1=None, R2=None, mode='COM', mean_vals=False, stride=1, weights=False, etol=1e-07, verbose=True)[source]

Internal scaling profile: sequence separation vs. inter-residue distance.

For every sequence separation \(|i-j|\) from 0 to R2-R1, this method collects the inter-residue distances of every pair at that separation across every frame. The result is the raw material for an internal-scaling plot — a foundational polymer-physics observable for IDPs and IDRs (Mao et al. 2013; Pappu et al. 2008).

ACE / NME peptide caps are excluded. (Note: this differs from CAMPARI, which includes them.)

Distances are in Angstroms.

Parameters:

R1 (int, optional) – First residue of the region. None (default) means the first CA-bearing residue.
R2 (int, optional) – Last residue of the region. None (default) means the last CA-bearing residue.
mode ({'COM', 'CA'}, optional) –
- 'COM' (default) - distances between residue centres of mass.
- 'CA' - distances between alpha-carbon atoms.
mean_vals (bool, optional) – If True, the per-separation distances are reduced to their mean before being returned. If False (default), every individual inter-residue distance is kept (useful for plotting the full distribution).
stride (int, optional) – Use every stride-th frame. Default is 1. Larger values are recommended for long trajectories of large proteins.
weights (list or np.ndarray, optional) – Per-frame weights for re-weighted averaging. Default False means uniform weighting.
etol (float, optional) – Tolerance for the weights-sum-to-one check. Default 1e-7.
verbose (bool, optional) – If True (default), print one status line per sequence separation.

Returns:

(seq_sep, distances) where:

seq_sep is a list of integer sequence separations [0, 1, ..., R2-R1].
If mean_vals=True, distances is a list of per-separation mean distances (one float per separation).
If mean_vals=False (default), distances is a list of 1D arrays — one array per separation containing every individual inter-residue distance at that separation.

Return type:

tuple of (list, list)

Example

>>> sep, mean_d = protein.get_internal_scaling(mean_vals=True, verbose=False)
>>> # plot sep vs mean_d for an internal-scaling profile

References

Mao AH, Lyle N, Pappu RV. Describing sequence-ensemble relationships for intrinsically disordered proteins. Biochem J. 2013;449:307-318.

Pappu RV, Wang X, Vitalis A, Crick SL. A polymer-physics perspective on driving forces and mechanisms for protein aggregation. Arch Biochem Biophys. 2008;469:132-141.

Notes

Computational cost scales rapidly with chain length; consider increasing stride for trajectories of long proteins.

get_internal_scaling_RMS(R1=None, R2=None, mode='COM', stride=1, weights=False, etol=1e-07, verbose=True)[source]

Root-mean-square internal scaling profile: separation vs. RMS distance.

Like get_internal_scaling() but reports \(\sqrt{\langle r_{ij}^2 \rangle}\) for each sequence separation — the formally correct order parameter for polymer-physics analyses (Mao et al. 2013; Pappu et al. 2008). ACE / NME caps are excluded. Distances in Angstroms.

Parameters:

R1 (int, optional) – First residue of the region. None (default) means the first CA-bearing residue.
R2 (int, optional) – Last residue of the region. None (default) means the last CA-bearing residue.
mode ({'COM', 'CA'}, optional) –
- 'COM' (default) - distances between residue centres of mass.
- 'CA' - distances between alpha-carbon atoms.
stride (int, optional) – Use every stride-th frame. Default is 1.
weights (list or np.ndarray, optional) – Per-frame weights for re-weighted averaging. Default False.
etol (float, optional) – Tolerance for the weights-sum-to-one check. Default 1e-7.
verbose (bool, optional) – If True (default), print one status line per sequence separation.

Returns:

(seq_sep, rms_distance) where seq_sep is the list of integer separations and rms_distance is the corresponding list of RMS inter-residue distances in Angstroms.

Return type:

tuple of (list, list)

Example

>>> sep, rms = protein.get_internal_scaling_RMS(verbose=False)
>>> import numpy as np
>>> # log-log fit of rms vs sep recovers the apparent scaling exponent
>>> nu, _ = np.polyfit(np.log(sep[1:]), np.log(rms[1:]), 1)

References

Mao AH, Lyle N, Pappu RV. Describing sequence-ensemble relationships for intrinsically disordered proteins. Biochem J. 2013;449:307-318.

Pappu RV, Wang X, Vitalis A, Crick SL. A polymer-physics perspective on driving forces and mechanisms for protein aggregation. Arch Biochem Biophys. 2008;469:132-141.

get_scaling_exponent(inter_residue_min=15, end_effect=5, n_bootstrap=200, confidence_interval=95.0, mode='COM', num_fitting_points=40, fraction_of_points=0.5, fraction_override=False, etol=1e-07, stride=1, weights=False, verbose=True)[source]

Fit the apparent polymer scaling exponent and prefactor by loglog regression.

Fits the homopolymer scaling relationship \(\sqrt{\langle r_{ij}^2 \rangle} = A_0 |i-j|^{\nu_{app}}\) to the internal-scaling profile of the chain. The exponent \(\nu_{app}\) (“nu-app”) reports on solvent quality and the prefactor \(A_0\) captures chain stiffness / segment volume.

Uncertainties are estimated by a frame-level bootstrap. The frames are resampled with replacement n_bootstrap times (the same frame resample applied across every sequence separation, so the chain-wide correlation structure is preserved and the frame - not the individual residue pair - is treated as the independent sampling unit). Each resample is re-fit to give a distribution of nu / A0 (reported as a percentile confidence interval), and the per-separation spread of the resampled \(\log\) RMS values gives the standard error used as the denominator of a genuine reduced chi-squared.

Note

The frame bootstrap treats frames as independent draws. If the trajectory frames are temporally correlated (i.e. sampled faster than the conformational autocorrelation time) the standard errors are under-estimated and the reduced chi-squared is correspondingly inflated; pre-stride the trajectory (or pass stride) to roughly decorrelate frames for a more honest error.

Note

Despite the precision of the fit, nu_app and A0 are qualitative metrics subject to finite-chain-size effects, and are most rigorously interpreted for genuine homopolymers. For heteropolymers consider get_polymer_scaled_distance_map() which avoids the homopolymer assumption.

Parameters:

inter_residue_min (int, optional) – Minimum sequence separation |i-j| used in the fit. Helps avoid the short-distance regime where local steric effects dominate. Default is 15.
end_effect (int, optional) – Exclude pairs in which one residue is within end_effect residues of either chain end. Default is 5.
n_bootstrap (int, optional) – Number of frame-level bootstrap resamples used to estimate the nu / A0 confidence interval and the per-separation standard errors. Default is 200. If there are fewer than two frames (after striding) no bootstrap is possible and the confidence bounds / reduced chi-squared are returned as nan.
confidence_interval (float, optional) – Width (in percent) of the bootstrap confidence interval reported for nu and A0. Default is 95.0 (i.e. the 2.5th and 97.5th percentiles of the bootstrap distribution).
mode ({'COM', 'CA'}, optional) – Distance type. 'COM' (default) is preferred — CA-CA tends to inflate the apparent profile.
num_fitting_points (int, optional) – Maximum number of evenly-spaced loglog points used in the linear fit. Default is 40.
fraction_of_points (float, optional) – When fraction_override is True, or when the chain is too short to provide num_fitting_points points, this fraction of the valid sequence separations is used instead. Default is 0.5.
fraction_override (bool, optional) – If True, always use fraction_of_points and ignore num_fitting_points. Default is False.
etol (float, optional) – Tolerance for the weights-sum-to-one check. Default 1e-7.
stride (int, optional) – Use every stride-th frame. Default is 1.
weights (list or np.ndarray, optional) – Per-frame weights for re-weighted averaging. Default False.
verbose (bool, optional) – If True (default), print status updates during the internal-scaling pass.

Returns:

(best_nu, best_A0, nu_ci_low, nu_ci_high, A0_ci_low, A0_ci_high, redchi_fit_region, redchi_all_points, fit_region_data, all_points_data) where:

best_nu, best_A0 - point estimates from the full fit.
nu_ci_low, nu_ci_high - lower/upper bounds of the confidence_interval``% bootstrap confidence interval on ``nu (nan if the bootstrap could not be run). Likewise A0_ci_low / A0_ci_high for A0.
redchi_fit_region - reduced chi^2 on the points used in the linear fit (loglog-uniform), using the bootstrap standard error of each \(\log\) RMS point as the per-point uncertainty (nan if the bootstrap could not be run).
redchi_all_points - reduced chi^2 across every observed \(|i-j|\).
fit_region_data - shape (2, K); col 0 is sequence separation, col 1 is RMS distance for the fit-region points.
all_points_data - shape (3, M); col 0 is sequence separation, col 1 is observed RMS distance, col 2 is the best-fit prediction.

Return type:

tuple of length 10

Raises:

SSException – If fraction_of_points > 1.0, if confidence_interval is not in (0, 100), or if too few points remain to fit a line (< 3).

Example

>>> import numpy as np
>>> np.random.seed(42)   # error bootstrap consumes the RNG
>>> result = protein.get_scaling_exponent(verbose=False)
>>> nu, A0 = result[0], result[1]

get_local_heterogeneity(fragment_size=10, bins=None, stride=20, verbose=True, weights=False, etol=1e-07)[source]

Sliding-window heterogeneity: per-position distribution of intra-window RMSDs.

At each starting residue i (from 0 to n_residues - fragment_size) the method computes the RMSD of the local fragment [i, i+fragment_size] across every (strided) frame, then summarises that distribution by its mean, standard deviation, and histogram. The Phi parameter of Lyle, Das & Pappu (2013) is built from this same family of D-values.

Computational cost scales with n_residues * (n_frames / stride); the default stride=20 keeps things tractable on long trajectories.

Note

weights / etol are accepted for API uniformity with the other ensemble-average methods, but supplying weights raises an SSException: this is a pairwise frame-vs-frame measure for which a single per-frame probability vector is not well-defined.

Parameters:

fragment_size (int, optional) – Size of the sliding window (residues). Must be >= 2 and <= n_residues. Default is 10.
bins (np.ndarray or list, optional) – Histogram bin edges. If None (default), uses np.arange(0, 10, 0.01). Must contain at least two evenly- spaced floats.
stride (int, optional) – Use every stride-th frame in the inner RMSD pass. Default 20.
verbose (bool, optional) – If True (default), print one status line per starting residue.

Returns:

(mean, std, histo, bins) where:

mean (list of float, length n_residues - fragment_size): mean intra-window RMSD per starting residue.
std (list of float, same length): standard deviation of the same RMSD distribution.
histo (list of np.ndarray, same length): histogram counts per starting residue.
bins (np.ndarray): the bin edges used (echoed back).

Return type:

tuple of (list, list, list, np.ndarray)

Raises:

SSException – If fragment_size is out of range, or bins is not a valid evenly-spaced 1D vector.

Example

>>> mean, std, hist, bins = protein.get_local_heterogeneity(fragment_size=8)

References

Lyle N, Das RK, Pappu RV. A quantitative measure for protein conformational heterogeneity. J Chem Phys. 2013;139(12):121907.

get_D_vector(stride=20, verbose=True)[source]

Pairwise frame-vs-frame conformational dissimilarity vector \(D\).

For every pair of (strided) frames \((A, B)\) the method computes a single dissimilarity scalar \(D_{AB} = 1 - V_A \cdot V_B / (||V_A|| ||V_B||)\) where \(V\) is the flattened upper-triangular CA-CA distance vector for that frame. This is the input to the Phi parameter of Lyle, Das, Pappu (2013).

Using CA-CA distances (rather than full inter-atomic distances) makes the measure backbone-specific and much faster than the original formulation.

Parameters:

stride (int, optional) – Use every stride-th frame. Default is 20.
verbose (bool, optional) – If True (default), print one status line per outer-loop frame.

Returns:

1D array of length \(\binom{n_{frames}/\text{stride}}{2}\) containing the pairwise dissimilarity values.

Return type:

np.ndarray

Example

>>> D = protein.get_D_vector(stride=20)
>>> D.mean()   # mean pairwise dissimilarity
0.27

References

Lyle N, Das RK, Pappu RV. A quantitative measure for protein conformational heterogeneity. J Chem Phys. 2013;139(12):121907.

get_RMSD(frame1, frame2=-1, region=None, backbone=True, stride=1)[source]

Aligned root-mean-square deviation (RMSD) of one frame vs. another (or all).

Both modes superpose the target onto the reference before computing the RMSD, so the result is rotation/translation invariant. Units are Angstroms.

Parameters:

frame1 (int) – Index of the reference frame.
frame2 (int, optional) – Index of the comparison frame. -1 (default) means “every frame” — i.e. compute an RMSD trace of frame1 against the entire trajectory (subsampled by stride).
region (list or tuple of length 2, optional) – [first_resid, last_resid] (inclusive) restricting the atoms used to compute and align the RMSD. None (default) uses the full chain.
backbone (bool, optional) – If True (default), use only the backbone heavy atoms (N, CA, C, O). If False, use every atom in the selection.
stride (int, optional) – Used only when frame2 == -1. Compare frame1 against every stride-th frame of the trajectory. Default is 1.

Returns:

1-element array if frame2 is a specific frame index.
Length-ceil(n_frames/stride) array if frame2 == -1.

Values are in Angstroms.

Return type:

np.ndarray

Example

>>> rmsd_traj = protein.get_RMSD(frame1=0)              # frame 0 vs all
>>> rmsd_pair = protein.get_RMSD(frame1=0, frame2=42)   # frame 0 vs 42

get_Q(protein_average=True, region=None, beta_const=50.0, lambda_const=1.8, native_contact_threshold=4.5, stride=1, native_state_reference_frame=0, weights=False)[source]

Fraction-of-native-contacts order parameter \(Q\) (Best et al.).

Native contacts are defined from the reference frame (frame 0) as heavy-atom pairs separated by at least 4 residues in sequence and less than native_contact_threshold Angstroms in space. Each frame’s \(Q\) is then a sigmoid-weighted fraction of those contacts that remain “native-like” in that frame, per the Best, Hummer, Eaton formula (Best et al. 2013).

The reference frame is hard-coded to frame 0 because preserving a variable reference frame became unwieldy with re-weighted ensembles.

Parameters:

protein_average (bool, optional) –
- If True (default), return one Q value per frame (the protein average).
- If False, return per-native-contact and per-residue breakdowns in a 5-tuple — useful for visualising which contacts are most persistent.
region (list or tuple of length 2, optional) – [first_resid, last_resid] (inclusive) restricting the residues used to identify native contacts. None (default) uses the full protein.
beta_const (float, optional) – Steepness of the sigmoid weighting (1/nm). Default 50. Do not change without strong justification.
lambda_const (float, optional) – Sigmoid width factor (all-atom default 1.8). Do not change without strong justification.
native_contact_threshold (float, optional) – Distance cutoff (Angstroms) that defines a contact in the reference frame. Default is 4.5.
stride (int, optional) – Use every stride-th frame. Default is 1. May be combined with weights (the weight vector is subsampled with the same stride and renormalised internally).
native_state_reference_frame (int, optional) – Currently ignored — the reference frame is always frame 0. Kept in the signature for backward compatibility.
weights (list or np.ndarray, optional) – Per-frame weights for re-weighted analysis, one entry per trajectory frame (len(weights) == n_frames), each in [0, 1] and summing to 1. Validated by soursop.ssutils.validate_weights(). Only used when protein_average=False (the per-contact / per-residue breakdown); for protein_average=True see Raises. When a stride is given the vector is strided and renormalised, and the native reference frame (frame 0) is excluded and the remainder renormalised so the weighted average is a proper expectation over the non-native (strided) frames. Default False.

Returns:

If protein_average=True (default): 1D array of length ceil(n_frames/stride) with the per-frame Q value.
If protein_average=False: a 5-tuple
1. per_contact_fraction - 1D array of length N_NATIVE_CONTACTS; fraction of time each native contact was native-like over the trajectory.
2. native_contact_pairs - (N_NATIVE_CONTACTS, 2) int array; the atom-pair definitions of native contacts.
3. per_residue_contacts - dict keyed by "RESNAME-RESID"; each value is a list of fractional-native-contact scores for atoms in that residue. Take the mean per key to get the per-residue Q.
4. ordered_residue_keys - the keys of (2) in residue order.
5. res_res_q_matrix - (n_residues, n_residues) array of inter-residue Q values (symmetric).

Return type:

np.ndarray OR tuple of length 5

Raises:

SSException – If weights is given with protein_average=True (frame-averaged Q must be reweighted outside SOURSOP); if the weight vector fails validation (wrong length, out of [0, 1], non-finite, or not summing to 1); if excluding the native reference frame leaves a zero-sum weight vector; or if any constant cannot be coerced to float.

Example

>>> q = protein.get_Q()                          # per-frame Q
>>> q.mean()
0.85
>>> per_c, pairs, per_res, keys, res_res = protein.get_Q(protein_average=False)

References

Best RB, Hummer G, Eaton WA. Native contacts determine protein folding mechanisms in atomistic simulations. PNAS 2013. doi:10.1073/pnas.1311599110

get_contact_map(distance_thresh=5.0, mode='closest-heavy', stride=1, weights=False)[source]

Inter-residue contact map and per-residue contact-order vector.

For every pair of residues this returns the fraction of frames in which the chosen inter-residue distance is below distance_thresh. The companion contact-order vector reduces the 2D map to a per-residue summary by averaging over the appropriate neighbours-excluded denominator.

i to i, i to i+1, and i to i+2 pairs are always excluded. ACE / NME caps are excluded from the residue set.

Parameters:

distance_thresh (float, optional) – Contact threshold in Angstroms. A pair is “in contact” in a given frame if its distance under the chosen mode is below this value. Default is 5.0.
mode ({'closest-heavy', 'ca', 'closest', 'sidechain', 'sidechain-heavy'}, optional) –
How the inter-residue distance is defined (the schemes from mdtraj.compute_contacts):
- 'closest-heavy' (default) - closest pair of heavy atoms.
- 'ca' - alpha-carbon distance.
- 'closest' - closest pair of any atoms (incl. hydrogens).
- 'sidechain' - closest pair of sidechain atoms (mdtraj >= 1.8).
- 'sidechain-heavy' - closest pair of sidechain heavy atoms (mdtraj >= 1.8). Raises if any residue is glycine.
stride (int, optional) – Use every stride-th frame. Default is 1. May be combined with weights (the weight vector is subsampled with the same stride and renormalised internally so it aligns with the strided frames).
weights (list or np.ndarray, optional) – Per-frame weights for re-weighted averaging, one entry per trajectory frame (len(weights) == n_frames), each in [0, 1] and summing to 1. Validated by soursop.ssutils.validate_weights(). When supplied the contact map is the deterministic weighted mean of the per-frame contact maps over the (strided) frames. Default False.

Returns:

(contact_map, contact_order) where:

contact_map is an (N, N) matrix of contact fractions in [0, 1].
contact_order is a 1D vector of length N giving the mean fractional contact count per residue, normalised by the number of valid partners for each residue.

Return type:

tuple of (np.ndarray, np.ndarray)

Raises:

SSException – For mode='sidechain-heavy' when any glycine is present, or if the weight vector fails validation (wrong length, out of [0, 1], non-finite, or not summing to 1).

Example

>>> cmap, corder = protein.get_contact_map()
>>> cmap.shape
(92, 92)

get_clusters(region=None, n_clusters=10, backbone=True, stride=20)[source]

Ward hierarchical clustering of conformations by RMSD.

Builds a strided all-vs-all RMSD matrix and groups frames using Ward’s agglomerative clustering. The number of clusters (n_clusters) must be specified up front. The chosen “centroid” of each cluster is the member whose summed exponential RMSD weight is largest (i.e. the most internally-connected frame).

Parameters:

region (list of length 2, optional) – [first_resid, last_resid] (inclusive) restricting the atoms used for the RMSD calculation. None (default) uses the full chain.
n_clusters (int, optional) – Target number of clusters. Default is 10. Note that the actual number of clusters returned may be smaller if some labels are never assigned.
backbone (bool, optional) – If True (default), use only backbone heavy atoms for the RMSD (much faster). If False, use every atom in the selection.
stride (int, optional) – Use every stride-th frame. Default is 20. stride=1 does an exact all-vs-all but can be very slow.

Returns:

(cluster_members, cluster_trajs, cluster_distance_matrices, cluster_centroids, cluster_frames) where:

cluster_members (list of int): number of frames in each cluster.
cluster_trajs (list of mdtraj.Trajectory): one sub-trajectory per cluster containing the assigned frames.
cluster_distance_matrices (list of np.ndarray): per-cluster all-vs-all RMSD submatrix (Angstroms).
cluster_centroids (list of int): for each cluster, the index within that cluster of the representative frame.
cluster_frames (list of np.ndarray): frame indices in the original trajectory that were assigned to each cluster.

Return type:

tuple of length 5

Example

>>> members, trajs, dmats, centroids, frames = protein.get_clusters(n_clusters=5)
>>> members
[120, 95, 80, 60, 45]

get_all_SASA(probe_radius=1.4, mode='residue', stride=20, weights=False, etol=1e-07)[source]

Solvent-accessible surface area (SASA) per residue or per atom.

Internally uses mdtraj’s Shrake-Rupley implementation (Golden-Spiral algorithm). Results are memoised per (stride, mode, probe_radius) triple so repeated calls with the same parameters are free.

SASA is returned in Angstroms^2; probe_radius is in Angstroms. Because the calculation is expensive the default stride is 20.

Parameters:

probe_radius (float, optional) – Solvent probe radius in Angstroms. Default 1.4 (water).
mode ({'residue', 'atom', 'sidechain', 'backbone', 'all'}, optional) –
Granularity:
- 'residue' (default) - per-residue SASA, shape (n_frames_strided, n_residues).
- 'atom' - per-atom SASA, shape (n_frames_strided, n_atoms).
- 'sidechain' - per-residue SASA summed over sidechain atoms (excluding backbone H, HA, HA2, HA3 which mdtraj’s ‘sidechain’ selector erroneously includes).
- 'backbone' - per-residue SASA summed over backbone atoms (including the backbone hydrogens that mdtraj’s ‘backbone’ selector erroneously excludes).
- 'all' - returns a 3-tuple of (residue, sidechain, backbone) arrays.
stride (int, optional) – Use every stride-th frame. Default is 20.
weights (array_like or False, optional) – Per-frame re-weighting vector (validated against the strided frame axis). False (default) returns the per-frame array (and uses the memo cache). If supplied, the strided frame axis is collapsed and the deterministic weighted-mean SASA is returned (not cached).
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

See mode description. All arrays are in Angstroms^2. If weights is supplied the frame axis is collapsed (e.g. (n_residues,) for mode='residue'; a 3-tuple of such arrays for mode='all').

Return type:

np.ndarray or tuple of np.ndarray

Example

>>> sasa = protein.get_all_SASA(mode='residue', stride=10)
>>> sasa.mean(axis=0)   # mean per-residue SASA across frames

get_regional_SASA(R1, R2, probe_radius=1.4, stride=20, weights=False, etol=1e-07)[source]

Mean total SASA summed over a contiguous residue range.

Calls get_all_SASA() (whose result is memoised) and sums the time-averaged per-residue SASA values from R1 to R2-1. Note the SASA must be computed on the full protein first because atoms outside the region can occlude region atoms.

SASA is returned in Angstroms^2.

Parameters:

R1 (int) – First residue of the region.
R2 (int) – Last residue of the region (exclusive in the internal sum).
probe_radius (float, optional) – Solvent probe radius in Angstroms. Default 1.4 (water).
stride (int, optional) – Use every stride-th frame. Default is 20.
weights (array_like or False, optional) – Per-frame re-weighting vector (validated against the strided frame axis). False (default) is byte-identical to before; if supplied each residue’s time-average is the deterministic weighted mean before summing.
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Sum over residues [R1, R2) of the time-averaged per-residue SASA in Angstroms^2.

Return type:

float

Example

>>> protein.get_regional_SASA(5, 25)
2150.6

get_site_accessibility(input_list, probe_radius=1.4, mode='residue_type', stride=20, weights=False, etol=1e-07)[source]

Mean & std SASA for selected residue types or specific residue indices.

Under mode='residue_type', input_list is interpreted as canonical three-letter residue names and every matching residue in the chain is summarised. Under mode='resid' the input is treated as a list of explicit residue indices.

Useful for comparing accessibility of chemically-equivalent residues at different positions in the chain.

Parameters:

input_list (list) –
Either:
- three-letter residue names, e.g. ['TRP', 'TYR', 'GLN'] (when mode='residue_type').
- residue indices, e.g. [1, 2, 3, 4] (when mode='resid').
probe_radius (float, optional) – Solvent probe radius in Angstroms. Default 1.4 (water).
mode ({'residue_type', 'resid'}, optional) – How to interpret input_list. Default 'residue_type'.
stride (int, optional) – Use every stride-th frame. Default is 20.
weights (array_like or False, optional) – Per-frame re-weighting vector (validated against the strided frame axis). False (default) gives the ordinary mean/std (byte-identical to before); if supplied the per-residue [mean, std] is the deterministic weighted mean and weighted (population) std.
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

Keys are "RESNAME-RESID" strings; each value is [mean_SASA, std_SASA] in Angstroms^2.

Return type:

dict

Raises:

SSException – If input_list contains a residue name not in the canonical 20 + caps set when mode='residue_type'.

Example

>>> protein.get_site_accessibility(['LEU', 'VAL'])
{'LEU-5': [42.1, 6.0], 'LEU-12': [55.4, 8.3], 'VAL-21': [38.0, 5.5], ...}

get_local_collapse(window_size=10, bins=None, verbose=True, weights=False, etol=1e-07)[source]

Sliding-window local radius of gyration profile.

At every starting residue i the local Rg is computed over the window [i, i + window_size], producing a per-position profile of compaction along the chain. Useful for identifying locally collapsed regions in otherwise extended IDPs (and vice versa).

Parameters:

window_size (int, optional) – Size of the sliding window in residues. Must be <= n_residues. Default 10.
bins (np.ndarray or list, optional) – Histogram bin edges (evenly spaced). If None (default), np.arange(0, 10, 0.01) is used.
verbose (bool, optional) – If True (default), print one status line per starting residue.
weights (array_like or False, optional) – Per-frame re-weighting vector. False (default) is byte-identical to before. If supplied, each window’s mean/std is the deterministic weighted mean / weighted (population) std and the histogram counts are frame-weighted.
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

(mean, std, histo, bins) where:

mean (list of float, length n_residues - window_size + 1): mean local Rg per starting residue.
std (same length): standard deviation of local Rg per starting residue.
histo (list of np.ndarray, same length): histogram counts.
bins (np.ndarray): the bin edges actually used.

Return type:

tuple of (list, list, list, np.ndarray)

Raises:

SSException – If window_size > n_residues or bins is malformed.

Example

>>> mean, std, hist, bins = protein.get_local_collapse(window_size=8)
>>> # plot mean vs residue index to see where the chain collapses

Returns (legacy enumeration)

[0] - list of floats (length = n)
Each float reports on the mean Rg at a specific position along the sequence.

[1] - list of floats (length = n)
Each float reports on the standard deviation of the Rg distribution at a specific position along the sequence.

[2] - list of np.ndarrays (length = n)
Histogram counts associated with the local Rg at a given position along the sequence. Basically, this is the emprical distribution that the standard devaitions and mean report on

[3] - np.ndarray (length = n)
Bin values for histogram counts returned in [2]

get_sidechain_alignment_angle(R1, R2, sidechain_atom_1='default', sidechain_atom_2='default')[source]

Per-frame angle between two residues’ sidechain orientation vectors.

For each residue a “sidechain vector” is the CA-to-tip vector, where the tip atom is the canonical sidechain terminus (e.g. CB for Ala, CZ for Phe/Tyr/Arg). The returned angle is between those two vectors. Useful for asking whether two sidechains tend to point towards each other (small angle) or away (~180 degrees).

Default tip atoms (from the OPLS-AA force field):

ALA: CB    CYS: SG    ASP: CG    GLU: CD    PHE: CZ
HIS: NE2   ILE: CD1   LYS: NZ    LEU: CG    MET: CE
ASN: CG    PRO: CG    GLN: CD    ARG: CZ    SER: OG
THR: CB    VAL: CB    TRP: CG    TYR: CZ

GLY, ACE, NME have no sidechain and raise.

Parameters:

R1 (int) – Resid of the first residue. Must not be GLY/ACE/NME.
R2 (int) – Resid of the second residue. Must not be GLY/ACE/NME.
sidechain_atom_1 (str, optional) – Override the tip atom of R1. 'default' (default) uses the canonical OPLS-AA atom from the table above.
sidechain_atom_2 (str, optional) – Override the tip atom of R2. Same convention as sidechain_atom_1.

Returns:

1D array of length n_frames with the per-frame alignment angle in degrees.

Return type:

np.ndarray

Raises:

SSException – If either residue is glycine or a cap, if a residue index is out of range, or if a custom sidechain atom name cannot be found.

Example

>>> theta = protein.get_sidechain_alignment_angle(5, 45)
>>> theta.mean()
82.4

get_dihedral_mutual_information(angle_name='psi', bwidth=0.6283185307179586, stride=1, weights=False, normalize=False)[source]

Mutual-information matrix between every pair of a chosen dihedral.

For each pair of dihedrals \((\phi_i, \phi_j)\) of the same type, this computes the mutual information

\[I(\phi_i; \phi_j) = H(\phi_i) + H(\phi_j) - H(\phi_i, \phi_j)\]

where the Shannon entropies \(H\) come from binned histograms. The natural baseline for interpretation is the equivalent matrix computed on a polymer-model reference ensemble (e.g. an excluded-volume simulation), since the absolute values are bin-width-dependent. Setting normalize=True divides each entry by the joint entropy, which can help with comparisons across ensembles of different sizes.

Parameters:

angle_name ({'phi', 'psi', 'omega', 'chi1'}, optional) – Dihedral to use. Default 'psi'. (chi2-chi5 are not supported by this method.)
bwidth (float, optional) – Histogram bin width on the [-pi, pi] interval. Smaller widths give a finer-grained estimate but need more data. Default pi/5.
stride (int, optional) – Use every stride-th frame. Default 1.
weights (array_like, optional) – Per-frame weights for re-weighted MI estimation. Must satisfy len(weights) == n_frames // stride. Default False.
normalize (bool, optional) – If True, divide each MI value by the joint entropy. Default False.

Returns:

Symmetric (n_dihedrals, n_dihedrals) mutual-information matrix. The diagonal is the self-information of each dihedral.

Return type:

np.ndarray

Raises:

SSException – If bwidth is not in (0, 2*pi] or angle_name is not one of the four allowed strings.

Example

>>> MI = protein.get_dihedral_mutual_information(angle_name='psi')
>>> MI.shape
(92, 92)

get_local_to_global_correlation(mode='COM', n_cycles=100, max_num_pairs=10, stride=20, weights=False, verbose=True)[source]

How well finite subsets of inter-residue distances predict global Rg.

For each k in [1, max_num_pairs), this randomly picks k inter-residue distance pairs n_cycles times and computes the correlation between the chain’s \(R_g^2\) and the mean-squared of those k distances. Averaging the correlations across cycles gives a per-k measure of how informative a finite local measurement is about the global chain dimensions.

This is intended to inform experimental design (e.g. how many smFRET pairs are “enough” to pin down \(R_g\)).

Parameters:

mode ({'COM', 'CA'}, optional) – Distance type. Default 'COM'.
n_cycles (int, optional) – Number of random pair-resamples per k. Default 100. Larger n_cycles reduces sampling noise; values below ~50 are not recommended.
max_num_pairs (int, optional) – Upper bound on the number of pairs sampled. Default 10. Beyond 10-20 the average correlation typically saturates close to 1.
stride (int, optional) – Use every stride-th frame. Default 20.
weights (array_like or False, optional) – Per-frame weights for the covariance computation. Default False (uniform weighting).
verbose (bool, optional) – If True (default), print one status line per k.

Returns:

(raw, n_pairs, mean_corr, std_corr) where:

raw (np.ndarray of shape (n_cycles * (max_num_pairs - 1), 2)): one row per resample; col 0 is k and col 1 is the \(R_g^2\) vs. local-mean-square correlation for that resample.
n_pairs (np.ndarray): the k values used, [1, 2, ..., max_num_pairs - 1].
mean_corr (np.ndarray): mean correlation per k.
std_corr (np.ndarray): standard deviation of correlation per k (assumes Gaussian distribution, may be approximate).

Return type:

tuple of length 4

Raises:

SSException – If mode is invalid, the strided Rg length disagrees with the inter-residue distance length, or the installed numpy version cannot accept aweights in numpy.cov.

Example

>>> import numpy as np
>>> np.random.seed(42)   # the pair-resampling consumes the RNG
>>> raw, ks, mean_c, std_c = protein.get_local_to_global_correlation()

get_overlap_concentration()[source]

Overlap concentration \(c^*\) of the chain, in Molar.

Defined as the polymer concentration at which independent chains begin to occupy the same volume in trans, computed by treating each chain as a sphere of radius equal to its mean \(R_g\). Useful as a coarse limit on dilute-regime behaviour.

Returns:: Overlap concentration \(c^*\) in Molar.
Return type:: float

Example

>>> protein.get_overlap_concentration()
0.0042

get_angle_decay(atom1='C', atom2='N', return_all_pairs=False, weights=False, etol=1e-07)[source]

Bond-vector autocorrelation along the chain (persistence-length proxy).

For each residue we form a single intra-residue vector \(\vec{v}_i\) (default C -> N, the peptide bond direction). For every pair \((i, j)\) we compute \(\langle \hat{v}_i \cdot \hat{v}_j \rangle\) across frames; averaging over all pairs with the same sequence separation \(|i-j|\) yields the decay profile. A faster initial decay implies a less rigid chain.

Both atom1 and atom2 must exist in every residue, so the peptide C->N vector is essentially the only reasonable choice.

Parameters:

atom1 (str, optional) – Origin atom name. Default 'C'.
atom2 (str, optional) – Tip atom name. Default 'N'.
return_all_pairs (bool, optional) – If True, additionally return a dict of every individual "i-j" residue-pair correlation. Default is False.
weights (array_like or False, optional) – Per-frame re-weighting vector. False (default) is byte-identical to before. If supplied, each residue-pair’s mean cos(theta) is the deterministic frame-weighted mean (the over-pairs averaging per separation is unchanged).
etol (float, optional) – Tolerance on |sum(weights) - 1|. Default 1e-7.

Returns:

If return_all_pairs=False (default): array of shape (n_separations, 3) with columns [separation, mean_corr, std_corr]. The first row is [0, 1.0, 0.0] (self correlation).
If return_all_pairs=True: (array, pair_dict) where pair_dict["i-j"] is the mean correlation for that specific residue pair.

Return type:

np.ndarray OR tuple of (np.ndarray, dict)

Example

>>> decay = protein.get_angle_decay()
>>> # decay[:, 0] is separation, decay[:, 1] is mean correlation