ssprotein

The ssprotein module holds the SSProtein class. This is the main class used for working with simulations of individual proteins.

Once a trajectory has been read in and an SSProtein object extracted from the SSTrajectory object, a wide range of analyses are available.

By way of an example:

from soursop.sstrajectory import SSTrajectory
import numpy as np

# read in a trajectory
TrajOb = SSTrajectory('traj.xtc', 'start.pdb')

# get the first protein chain (this is an SSProtein object)
ProtObj = TrajOb.proteinTrajectoryList[0]

# print the mean radius of gyration
mean_e2e = np.mean(ProtObj.get_end_to_end_distance())
print(mean_e2e)

SSProtein Properties

SSProtein objects have a set of object variables associated with them.

class soursop.ssprotein.SSProtein(traj, debug=False)[source]
resid_with_CA()

Return a list of resids that have CA (alpha-carbon) atoms. The values associated with this property are built using the internal function __get_resid_with_CA().

Returns:

Returns a list of integers where each int is the resid of a residue that has a CA atom

Return type:

list

See also

idx_with_CA

ncap()

Flag that returns if an N-terminal capping residue is present (or not).

Returns:

True if N-teminal cap is present, False if not

Return type:

bool

ccap()

Flag that returns if a C-terminal capping residue is present (or not).

Returns:

True if C-teminal cap is present, False if not

Return type:

bool

n_frames()

Returns the number of frames in the trajectory.

Returns:

Returns the number of frames in the simulation trajectory

Return type:

int

n_residues()

Returns the number of residues in the protein (including caps)

Returns:

Returns the number of frames in the protein

Return type:

int

residue_index_list()

Returns the list of resids associated with this protein. These will always start from 0 and increment (as of version 0.1.3) but these are extracted here EXPLICITLY from the underlying mdtraj.topology object, which can be useful for debugging.

Returns:

Returns a list of resid values which should be a uniformly incrementing list of numbers starting at 0 and incrementing to self.n_residues-1.

Return type:

list

length()[source]

SSProtein Functions

class soursop.ssprotein.SSProtein(traj, debug=False)[source]
reset_cache()[source]

Function that enables the user to manually delete data that the SSProtein object has precomputed. In general this shouldn’t be necessary, but there could be some edge cases where resetting the stored data cache is helpful.

No input or return type

Parameters:

None

Return type:

None

print_residues(verbose=True)[source]

Function to help determine the mapping of residue ID to PDB residue value. Prints the mapping between resid and PDB residue, and returns this information in a list.

Returns a list of lists, where each list element is itself a list of two elements, index position and the resname-resid from the PDB file.

Parameters:

verbose (bool {True}) – If set to True, print_residues() will print out to screen and also return a list. If set to False, means nothing is printed to the screen.

Returns:

List containing a mapping of the zero-indexed residues and their names.

Return type:

return_list

get_amino_acid_sequence(oneletter=False, numbered=True)[source]

Returns the protein’s amino acid sequence.

Parameters:
  • oneletter (bool {False}) – If True returns a single sequence of one letter amino acid codes. If False get a list of 3 letter codes with residue number separated by a ‘-’ character.

  • numbered (bool {True}) – If True the return value is a list of RESNAME-RESID strings, if False return value is a list of RESNAME in the correct order.

Returns:

A list comprised of the 1-letter or 3-letter names of the amino acid sequence.

Return type:

list

get_residue_atom_indices(resid, atom_name=None)[source]

Function that looks up the atomic indices of a specific residue’s atom. This is a memoization function; i.e., when called it populates a table so the computational cost here is only incurred once.

Note:

This function used to be called residue_atom_com().

Parameters:
  • resid (int) – The residue index to lookup. If the residue has not been cached, it will be added to the lookup table for later reuse.

  • atom_name (str or None) – The name of the atom to lookup which will return the corresponding residue ID. Like the previous parameter, if that residue does not exist in the lookup table it will be added for later reuse. Default = None

Returns:

A list containing all the atoms corresponding to a given residue ID that match the input residue id (resid) or, the residue corresponding to the atom name (atom_name).

Return type:

list

get_CA_index(resid)[source]
Get the CA atom index for the residue defined by residueIndex.

Again does this.

via memoization - i.e. the first time a specific residue is requested the function looks up the information and then stores it locally in case its needed again.

Defensivly checks for errors.

Parameters:

residueIndex (int) – Defines the resid to select the CA from.

Returns:

A list of size 1 containing the CA atom index for the residue index, residueIndex.

Return type:

list

Raises:

SSException – When the number of CA atoms do not equal 1.

get_multiple_CA_index(resID_list=None)[source]

Returns the atom indices associated with the C-alpha (CA) atom for the residues defined in the resID_list OR for all residues, if no list is provided.

Parameters:

resID_list (list of int or None) – Defines a list of residues for which the C-alpha atom index will be retrieved. If no list is provided we simply select the list of residues with C-alphas, whose indices have been corrected. Default is None.

Returns:

Returns a list C-Alpha indices of the input list of residue IDs.

Return type:

list

get_residue_COM(resid, atom_name=None)[source]

Property that returns the 3 x n np.ndarray with the COM of the residue in question for every frame in the simulation. The absolute positions are returned in Angstroms.

Computes it once and then looks up this data.

Parameters:
  • resid (int) – Resid for residue of interest

  • atom_name (str) – Lets you pass a specific atom names associated with the residue of interest

Returns:

Returns an [x,y,z] x n np.ndarray of x/y/z positions for this residue COM for EVERY frame.

Return type:

np.ndarray

get_residue_mass(R1)[source]

Returns the mass associated with a specific residue.

Parameters:

R1 (int) – Resid to be examined

Returns:

Returns the mass of the residue

Return type:

float

calculate_all_CA_distances(residueIndex, mode='CA', only_C_terminal_residues=True, periodic=False, stride=1)[source]

Calculate the full set of distances between C-alpha atoms. Specifically, from the residueIndex, this function will calculate distances between that residue and all other residues that have CA- atoms, either as CA-CA distance, or as COM-COM distance (as defined by the mode keyword).

Note that by default this explicitly works in a way to avoid computing redundancy where we ONLY compute distances between residue i and residues greater than i up to the final residue. This behaviour is defined by the only_C_terminal_residues flag.

Distance is returned in Angstroms.

Parameters:
  • residueIndex (int) – Defines the residue index to select the CA from.

  • mode (str) –

    String, must be one of either CA or COM

    • CA - alpha carbon.

    • COM - center of mass (associated withe the residue).

    Default = ‘CA’.

  • only_C_terminal_residues (bool) – This variable means that only residues C-terminal of the residueIndex value will be considered. This is useful when performing an ALL vs. ALL matrix as it ensures that only the upper triangle is calculated if we iterate over all residues, but may not be deseriable in other contexts. Default = True.

  • periodic (bool) – Flag which - if set to true - means we use minimum image convention for computing distances. Default = False.

  • stride (int) – Defines the spacing between frames to compare - i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every stride-th frame. Default = 1.

Returns:

Array containing the end-to-end distance measures based on the input mode.

Return type:

numpy.array

Raises:

SSException – If the input mode is nether ‘CA’ or ‘COM’.

get_inter_residue_COM_distance(R1, R2, stride=1)[source]

Function which calculates the complete set of distances between two residues’ centers of mass (COM) and returns a vector of the distances.

Distance is returned in Angstroms. Note no periodic option is currently available here.

Parameters:
  • R1 (int) – Resid of the first residue

  • R2 (int) – Resid of the second residue

  • stride (int) – Defines the spacing betwen frames to compare with - i.e. take every $stride-th frame. Default = 1.

Returns:

Returns an array of inter-residue distances in angstroms

Return type:

np.ndarray

get_inter_residue_COM_vector(R1, R2)[source]

Function which calculates the complete set of distances between two residues’ centers of mass (COM) and returns the inter-residue distance vector.

NOTE: This gives a VECTOR and not the distance between the two centers of mass (which is calculated by get_inter_residue_COM_distance). Note units for the relative positions in the vector are in Angstroms.

WARNING: This changed in 0.2.0, where in 0.1.9x the return units was in nm).

Parameters:
  • R1 (int) – Residue index of first residue

  • R2 (int) – Residue index of second residue

Returns:

Returns an array of inter-residue distances in angstroms

Return type:

np.ndarray

get_inter_residue_atomic_distance(R1, R2, A1='CA', A2='CA', mode='atom', periodic=False, stride=1)[source]

Function which returns the distance between two specific atoms on two residues. The atoms selected are based on the ‘name’ field from the topology selection language. This defines a specific atom as defined by the PDB file. By default A1 and A2 are CA (C-alpha) but one can define any residue of interest.

We do not perform any sanity checking on the atom name - this gets really hard - so have an explicit try/except block which will warn you that you’ve probably selected an illegal atom name from the residues.

Distance is returned in Angstroms.

Parameters:
  • R1 (int) – Residue index of first residue

  • R2 (int) – Residue index of second residue

  • A1 (str) – Atom name of the atom in R1 we’re looking at. Default = CA.

  • A2 (str) – Atom name of the atom in R2 we’re looking at. Default = CA.

  • mode (str) –

    Mode allows the user to define differnet modes for computing atomic distance. The default is ‘atom’ whereby a pair of atoms (A1 and A2) are provided. Other options are detailed below and are identical to those offered by mdtraj in compute_contacts

    • ca - same as setting ‘atom’ and A1=’CA’ and A2=’CA’, this uses the C-alpha atoms

    • closest - closest atom associated with each of the residues, i.e. the is the point of closest approach between the two residues

    • closest-heavy - same as closest, except only non-hydrogen atoms are considered

    • sidechain - closest atom where that atom is in the sidechain. Note this requires mdtraj version 1.8.0 or higher.

    • sidechain-heavy - closest atom where that atom is in the sidechain and is heavy. Note this requires mdtraj version 1.8.0 or higher.

  • periodic (bool) – Flag which - if set to true - means we use minimum image convention for computing distances. Default = False.

  • stride (int) – Defines the spacing between frames to compare - i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every stride-th frame. Note this operation may scale poorly as protein length increases at which point increasing the stride may become necessary. Default = 1.

Returns:

Returns an array of inter-residue atomic distances

Return type:

np.ndarray

get_distance_map(mode='CA', RMS=False, periodic=False, stride=1, weights=False, verbose=True)[source]

Function to calculate the CA defined distance map for a protein of interest. Note this function doesn’t take any arguments and instead will just calculate the complete distance map.

NOTE that distance maps are calculated between all CA-CA distances and NOT center of mass positions. This also means ACE/NME caps are EXCLUDED from this anlysis.

Distance is described in Angstroms.

Parameters:
  • mode (str) – String, must be one of either ‘CA’ or ‘COM’. - ‘CA’ = alpha carbon. - ‘COM’ = center of mass (associated withe the residue). Default = ‘CA’.

  • RMS (bool) – If set to False, scaling map reports ensemble average distances (this is the standard and default behaviour). If True, then the distance reported is the root mean squared (RMS) = i.e. SQRT(<r_ij^2>), which is the formal order parameter that should be used for polymeric distance properties. Default = False.

  • periodic (bool) – Flag which - if set to true - means we use minimum image convention for computing distances. Default = False.

  • stride (int) – Defines the spacing between frames to compare - i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every stride-th frame. Default = 1.

  • weights (list or array of floats) – Defines the frame-specific weights if re-weighted analysis is required. This can be useful if an ensemble has been re-weighted to better match experimental data, or in the case of analysing replica exchange data that is re-combined using T-WHAM.

  • verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!

Returns:

A 2-tuple containing * [0] : The distance map derived from the measurements between CA atoms. * [1] : The standard deviation corresponding to the distance map.

Return type:

tuple

get_polymer_scaled_distance_map(nu=None, A0=None, min_separation=10, mode='fractional-change', stride=1, weights=False, etol=1e-07, verbose=True)[source]

Function that allows for a global assesment of how well all i-j distances conform to standard polymer scaling behaviour (i.e. \(r_{(i,j)} = A_0|i-j|^{\nu}\)).

Essentially, this generates a distance map (2D matrix of i vs. j distances) where that distance is either normalized by the expected distance for a provided homopolymer model, or quantifies the fractional deviation from a homopolymer model fit to the data. These two modes are explained in more detail below.

In this standard scaling relationship:

  • \(\langle r_{i,j} \rangle\) : Average inter-residue distance of residue i and j

  • \(A_0\): Scaling prefactor. Note this is NOT the same numerical value as the math:R_0 prefactor that defines the relationship \(R_g = R_0 N^{\nu}\).

  • \(|i-j|\) : Sequence separation between residues i and j

  • \({\nu}\) : The intrinsic polymer scaling exponent (in principle should be between 0.33 and 0.598).

This is the scaling behaviour expected for a standard homopolymer. This function then assess how well this relationship holds for ALL possible inter-residue distances.

This function returns a four position tuple.

Position one is an n x n numpy matrix (where n = sequence length), where the element is either the default value OR quantifes the deviation from a polymer model in one of two ways.

Positions two and three are the nu and A0 values used, respectively.

Finally, position 4 will be the reduced chi-square fitting to the polymer model for the internal scaling profile (i.e. how A0 and nu are originally calculated).

NB: The reduced chi-squared value will be -1 if nu and A0 are provided in the function call.

If no options are provided, the function calculates the best fit to a homopolymer mode using the default parameters associated with the get_scaling_exponent() function, and then uses this model to determine pairwise deviations.

Parameters:
  • nu (float) – Scaling exponent used (if provided). Note for a provided nu to be used, both nu and A0 must be provided. Default is None

  • A0 (float) – Scaling prefactor used (if provided). Note for a provided A0 to be used, both A0 and nu must be provided. Default is None

  • min_separation (int) – Minimum distance for which deviations are calculated. At close distances, we expect local steric effects to cause deviations from a polymer model, so this value defines the threshold minimum distance to be used. Default is 10.

  • mode (str) –

    Defines the mode in which deviation from a homopolymer model is calculated. Options are: fractional-change, signed-fractional-change, signed-absolute-change, scaled.

    • fractional-change: Each inter-residue deviation is calculated as \(d_{(i,j)} = (|r_{i,j} - p_{i,j}|) / p_{i,j}\), where \(r_{i,j}\) is the mean distance from the simulation for residues i and j, while \(p_{i,j}\) is the expected distance between any two i,j residues in the polymer model.

    • signed-fractional-change: Each inter-residue deviation is calculated as \(d_{(i,j)} = (r_{i,j} - p_{i,j}) / p_{i,j}\), i.e. the same as the fractional-change mode except that a sign is now included. Positive values mean there is expansion with respect to the homopolymer behaviour, while negative values mean there is contraction with respect to the homopolymer model.

    • signed-absolute-change: Each inter-residue deviation is calculated as \(d_{(i,j)} = (r_{i,j} - p_{i,j})\) i.e. the same as the signed-fractional-change, except now it is no longer fraction but in absolute distance units. This can be useful for getting a sense of by how-much the real behaviour deviates from the model in terms of Angstroms.

    • scaled: Each inter-residue deviation is calculated as \(d_{(i,j)} = (r_{i,j}/p_{i,j})\)

  • weights (list) – Defines the frame-specific weights if re-weighted analysis is required. This can be useful if an ensemble has been re-weighted to better match experimental data, or in the case of analysing replica exchange data that is re-combined using T-WHAM. Can be a list of any slicable element vector. Default = None.

  • etol ({0.0000001} float) – Defines the error tollerance for the weight sums - i.e. if abs(np.sum(weights) - 1) > etol an exception is raised.

  • verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting! Default = True.

Returns:

  • [0] : n x n numpy matrix (where n = sequence length), where the element is either the default value OR quantifes the deviation from a polymer model in one of two ways.

  • [1] : float defining the scaling exponent nu

  • [2] : float defining the A0 prefactor

  • [3] : reduced chi-squared fitting to the polymer model (goodness of fit)

Return type:

tuple (len = 4)

Raises:

soursop.ssexceptions.SSException – If inappropriate inputs are passed this function raises an SSException

get_radius_of_gyration(R1=None, R2=None)[source]

Returns the radius of gyration associated with the region defined by the intervening stretch of residues between R1 and R2. When residues are not provided the full protein’s radius of gyration (INCLUDING the caps, if present) is calculated.

Radius of gyration is returned in Angstroms.

Parameters:
  • R1 (int) – Index value for first residue in the region of interest. If not provided (None) then first residue is used. Default = None

  • R2 (int) – Index value for last residue in the region of interest. If not provided (False) then last residue is used. Default = None

Returns:

Returns a numpy array with per-frame instantaneous radius of gyration

Return type:

np.ndarray

get_hydrodynamic_radius(R1=None, R2=None, mode='nygaard', alpha1=0.216, alpha2=4.06, alpha3=0.821, distance_mode='CA')[source]

Returns the apparent hydrodynamic radius as calculated based on either the approximation derived by Nygaard et al. [1], or the Kirkwood-Riseman equation [2]. Which mode is used depends on the ‘mode’ keyword.

The Kirkwood-Riseman equation may be more accurate when computing the Rh for comparison with NMR-derived Rh values, as reported in [3].

For ‘nygaard’ mode, the arguments (alpha1/2/3) are used, and should not be altered to recapitulate behaviour defined by Nygaard et al. Default values here are alpha1=0.216, alpha2=4.06 and alpha3=0.821. Note that distance_mode parameter is ignored, as this is only used with the Kirkwood-Riseman implementation.

For ‘kr’ (Kirkwood-Riseman mode), the Kirkwood-Riseman equation uses the alpha carbon positions if distance_mode is set to CA, or the center of mass dpositions if distance_mode is set to COM. If either of these are not provided as options for distance_mode the function will fail. Default is CA. Note that the alpha1/2/3 arguments are ignored, as these are only used with the Nygaard mode.

NOTE that in both cases, these approximations hold for fully flexible disordered proteins, but will likely become increasingly unreasonable in the context of larger and larger folded domains.

The hydrodyamic radii are returned in Angstroms.

Parameters:
  • R1 (int) – Index value for first residue in the region of interest. If not provided (None) then first residue is used. Default = None

  • R2 (int) – Index value for last residue in the region of interest. If not provided (None) then last residue is used. Default = None

  • mode (str) – Selector that lets you choose the mode being used. The default is an empirical conversion from Rg proposed by Nygaard et al (mode=’nygaard’) but the alternative and possibly more accurate approach when comparing against Rh values derived from pulse-field gradient NMR experiments isthe Kirkwood-Riseman (mode = ‘kr’) model. Must be one of these two selectors

  • alpha1 (float) – First parameter in equation (7) from Nygaard et al. Default = 0.216

  • alpha2 (float) – Second parameter in equation (7) from Nygaard et al. Default = 4.06

  • alpha3 (float) – Third parameter in equation (7) from Nygaard et al. Default = 0.821

Returns:

Returns a numpy array with per-frame instantaneous hydrodynamic radii

Return type:

np.ndarray

References

[1] Nygaard M, Kragelund BB, Papaleo E, Lindorff-Larsen K. An Efficient Method for Estimating the Hydrodynamic Radius of Disordered Protein Conformations. Biophys J. 2017;113: 550–557.

[2] Kirkwood, J. G., & Riseman, J. (1948). The Intrinsic Viscosities and Diffusion Constants of Flexible Macromolecules in Solution. The Journal of Chemical Physics, 16(6), 565–573.

[3] Pesce, F., Newcombe, E. A., Seiffert, P., Tranchant, E. E., Olsen, J. G., Grace, C. R., Kragelund, B. B., & Lindorff-Larsen, K. (2022). Assessment of models for calculating the hydrodynamic radius of intrinsically disordered proteins. Biophysical Journal. https://doi.org/10.1016/j.bpj.2022.12.013

get_end_to_end_distance(mode='COM')[source]

Returns an array of end-to-end distances for the conformations in the simulation. Note that this is calculated between residues that have CA atoms (i.e. not caps), which means the if mode COM or mode=CA will be computing distances between the same residues.

End-to-end distance is returned in Angstroms

Parameters:

mode (['CA' or 'COM']) – Defines the mode used to define the residue position, either the residue center or mass or the residue CA atom. The provided mode must be one of these two options. Default = ‘COM’.

Returns:

Returns an array of floats with the end-to-end distance of the chain in Angstroms

Return type:

np.ndarray

get_end_to_end_vs_rg_correlation(mode='COM')[source]

Computes the correlation between Rg^2 and end-to-end^2.

Parameters:

mode (str {'CA'}) – String, must be one of either ‘CA’ or ‘COM’. - ‘CA’ = alpha carbon. - ‘COM’ = center of mass (associated withe the residue).

Returns:

A single float describing the correlation (as calculated by np.corrcoef).

Return type:

float

get_asphericity(R1=None, R2=None, verbose=True)[source]

Returns the asphericity associated with the region defined by the intervening stretch of residues between R1 and R2. This can be a somewhat slow operation, so a status message is printed for the impatient biophysicisit.

Asphericity is defined in many places - for my personal favourite explanation and definition see Page 65 of Andreas Vitalis’ thesis (Probing the Early Stages of Polyglutamine Aggregation with Computational Methods, 2009, Washington University in St. Louis).

Parameters:
  • R1 (int) – Index value for first residue in the region of interest. If not provided (None) then the first residue in the sequence is used (including caps). Default = None.

  • R2 (int) – Index value for second residue in the region of interest. If not provided (None) then the first residue in the sequence is used (including caps). Default = None.

  • verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting! Default = True.

Returns:

Returns a numpy array with the asphericity for each frame of the protein.

Return type:

np.ndarray

get_t(R1=None, R2=None)[source]

Returns the parameter <t>, a dimensionless parameter which describes the size of the ensemble.

<t> is defined in [1]

Parameters:
  • R1 (int {None}) – Index value for first residue in the region of interest. If not provided (False) then first residue is used.

  • R2 (int {None}) – Index value for last residue in the region of interest. If not provided (False) then last residue is used.

Returns:

Returns a numpy array with per-frame instantaneous t-values

Return type:

np.ndarray

References

1. Vitalis, A. (2009). Probing the Early Stages of Polyglutamine Aggregation with Computational Methods (R. Pappu (ed.)) [Ph.D. thesis from Washington University in St. Louis].

get_gyration_tensor(R1=None, R2=None, verbose=True)[source]

Returns the instantaneous gyration tensor associated with each frame.

Parameters:
  • R1 (int) – Index value for first residue in the region of interest. If not provided (None) then first residue is used. Default = None

  • R2 (int) – Index value for last residue in the region of interest. If not provided (False) then last residue is used. Default = None

  • verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting! Default = True.

Returns:

Returns a numpy array where each position is the frame-specific gyration tensor value

Return type:

np.ndarray

get_angles(angle_name)[source]

Returns all angle values for the specified angle name. Valid angle names are phi, psi, omega, chi1, chi2, chi3, chi4, chi5. Returns a tuple with both angle values for each frame and also also for each indexed angle the name of the atoms used to calculate that atom (based on the intrinsic underlying protein topology).

Parameters:

angle_name (str) – Name of angle being investigated. Must be one of phi, psi, omega, chi1, chi2, chi3, chi4, or chi5.

Returns:

Returns a tuple where element 0 is the list of lists, where each sub-list defines the atom names that are used to calculate each angle, while element 2 is a list of arrays, where each element in the array maps to the atoms defined in the corresponding element in tuple element 0, and each array defines the list of angles associated with each frame.

Return type:

tuple

Example

For example, in a protein with two arginine residues:

chi5 = ssprotein_object._angles('chi5')

# will print the atom names (chi5[0]) associated with the 2nd ([1]) arginine
# in the sequence
print(chi5[0][1])

# will print the actual chi5 angles associated with the 2nd arginine
print(chi5[1][1])
get_secondary_structure_DSSP(R1=None, R2=None, return_per_frame=False)[source]

Returns the a 4 by n numpy array inwhich column 1 gives residue number, column 2 is local helicity, column 3 is local ‘extended’ (beta strand/sheet) and column 4 is local coil on a per-residue basis.

Parameter R1 and R2 define a local region if only a sub-region is required.

The function return depends on if return_per_frame is set to True or False.

If False (default), a tuple is returned where the 4 internal vectors are

  • [0] - residue ID list - integers counting the resid for each element in the other lists

  • [1] - per residue factional helicity content

  • [2] - per residue factional extended content

  • [3] - per residue fractional coil content

If True, returns a tuple where the 4 elements are

  • [0] - residue ID list - integers counting the resid for each element in the other lists

  • [1] - frame vs. residue matrix where 1 = helix and 0 = not helix

  • [2] - frame vs. residue matrix where 1 = extended and 0 = not extended

  • [3] - frame vs. residue matrix where 1 = coil and 0 = not coil

Parameters:
  • R1 (int) – Default value is None. Defines the value for first residue in the region of interest. If not provided (False) then first residue is used. Default is None.

  • R2 (int) – Default value is None. Defines the value for last residue in the region of interest. If not provided (False) then last residue is used. Default is None

  • return_per_frame (bool) – Default value is False. Defines if the function should, instead of returning average data, return three N x F (N = number of residues, F number of frames) where 1 = the associated secondary structure is seen and 0 not seen. Default is False

Returns:

Returns a 4x tuple, where the meaning of the elements depend on if the return_per_frame is set to to True or False. See function description for further explanation.

Return type:

tuple

get_secondary_structure_BBSEG(R1=None, R2=None, return_per_frame=False)[source]

Returns a dictionary where eack key-value pair is keyed by a BBSEG classification type (0-9) and each value is a vector showing the fraction of time each residue is in that particular BBSEG type.

BBSEG classification types are listed below:

  • 0 - unclassified

  • 1 - beta (turn/sheet)

  • 2 - PII (polyproline type II helix)

  • 3 - Unusual region

  • 4 - Right handed alpha helix

  • 5 - Inverse C7 Equatorial (gamma-prime turn)

  • 6 - Classic C7 Equatorial (gamma turn)

  • 7 - Helix with 7 Residues per Turn

  • 8 - Left handed alpha helix

Parameters R1 and R2 are optional and allow a local sub-region to be defined.

The function return depends on if return_per_frame is set to True or False.

If False (default), a tuple is returned with the following elements:

  • [0] - residue ID list - integers counting the resid for each element in the other lists

  • [1] - Dictionary where keys are the BBSEG types and values are lists with fractional per-residue content for the associated BBSEG type.

If True, a tupe is returned with the following elements:

  • [0] - residue ID list - integers counting the resid for each element in the other lists

  • [1] - Dictionary where keys are BBSEG types and values are 2d np.arrays that map frame and residue with a 1 if the given BBSEG structure is found and 0 if not.

Parameters:
  • R1 (int) – Default value is False. Defines the value for first residue in the region of interest. If not provided (False) then first residue is used.

  • R2 (int) – Default value is False. Defines the value for last residue in the region of interest. If not provided (False) then last residue is used.

  • return_per_frame (bool {False}) – Default value is False. Defines if the function should, instead of returning average data, return three N x F (N = number of residues, F number of frames) where 1 = the associated secondary structure is seen and 0 not seen

Returns:

return_bbseg – Dictionary of 9 key-value pairs where keys are integers 0-8 and values are numpy arrays showing the fractional occupancy of each of the distinct types of defined secondary structure. Note the three classifications will sum to 1 (within numerical precision).

Return type:

dict

get_internal_scaling(R1=None, R2=None, mode='COM', mean_vals=False, stride=1, weights=False, etol=1e-07, verbose=True)[source]

Calculates the raw internal scaling info for the protein in the simulation. R1 and R2 define a sub-region to operate over if sub- regional analysis is required. When residues are not provided the full protein’s internal scaling (EXCLUDING the caps, if present) is calculated.

Distance is measured in Angstroms.

Returns two lists of the same length

  1. List of arrays, where each array is the simulation average set of inter-residue distances for the primary sequence separation defined by the equivalent position in the second array. Each array in this list will be a different length as there are many more i to i+1 pairs of residues than i to i+10 (e.g. in a 15 residue sequence).

  2. The sequence separation associated with each set of measurements (i.e. a single list which will normally be 0,1,2,3,…n where n = number of residues in sequence.

The internal scaling profile is a plot of sequence separation vs. mean through-space distance for all pairs of residues at a given sequence separation. What this means is that if we had a 6 residue peptide the internal scaling profile would be calculated as follows:

sequence separation = 0
average distance(average distance of 1-to-1, 2-to-2, 3-to-3, etc.)

sequence separation = 1
average distance(average distance of 1-to-2, 2-to-3, 3-to-4, etc.)

sequence separation = 2
average distance(average distance of 1-to-3, 2-to-4, 3-to-5, etc.)

sequence separation = 3
average distance(average distance of 1-to-4, 2-to-5, 3-to-6.)

sequence separation = 4
average distance(average distance of 1-to-5, 2-to-6)

sequence separation = 5
average distance(average distance of 1-to-6)

The residues considered for internal scaling analysis DO NOT include the ACE/NME peptide caps if present. This differs from CAMPARI, which DOES include the peptide caps. As an aisde, exclusion of the caps is different to how this is done in CAMPARI, which includes the caps, if present.

For more information on the ideas associated with internal scaling, see [1,2].

Parameters:
  • R1 (int) – Index value for first residue in the region of interest. If not provided (None) then first residue is used. Default = None

  • R2 (int) – Index value for last residue in the region of interest. If not provided (None) then last residue is used. Default = None

  • mode (['CA','COM']) – String, must be one of either ‘CA’ or ‘COM’. - ‘CA’ = alpha carbon. - ‘COM’ = center of mass (associated withe the residue). Default = ‘CA’.

  • mean_vals (bool) – This is False by default, but if True the mean IS is returned instead of the explicit values. In reality the non-default behaviour is probably preferable. Default = False

  • stride (int) – Defines the spacing between frames to compare - i.e. if comparing frame 1 to a trajectory we’d compare frame 1 and every stride-th frame. Note this operation may scale poorly as protein length increases at which point increasing the stride may become necessary. Default = 1

  • weights (list/np.ndarray {False}) – If provided this defines the frame-specific weights if re-weighted analysis is required. This can be useful if an ensemble has been re-weighted to better match experimental data, or in the case of analysing replica exchange data that is re-combined using T-WHAM. Default = False.

  • etol ({0.0000001} float) – Defines the error tollerance for the weight sums - i.e. if abs(np.sum(weights) - 1) > etol an exception is raised.

  • verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting! Default = True

Returns:

  1. Element 1 is a list of seequence separation values starting at 1 and monotonically incrementing by 1 until we’re at length of sequence - 1.

  2. Element 2 depends on if mean_vals is set to true or not. If it’s true then element 2 is a single value that reports the average inter-residue distance between ALL pairs of residues in all frames that are at some sequence separation i.e plotting element 1 and element 2 gives you an internal scaling profile.

    If mean_vals is False (default) then element 2 is a list of arrays where each array is EVERY individual inter-residue distance at the corresponding sequence separation. This means you can plot the distribution on the internal scaling profile should you so choose, rather than being limited to just the mean.

Return type:

Tuple

Returns:

The function returns a tuple with elements. Each sub-element is the same length are match 1-1 with one another.

  1. Element 1 is a list of seequence separation values starting at 1 and monotonically incrementing by 1 until we’re at length of sequence - 1.

  2. Element 2 depends on if mean_vals is set to true or not. If it’s true then element 2 is a single value that reports the average inter-residue distance between ALL pairs of residues in all frames that are at some sequence separation i.e plotting element 1 and element 2 gives you an internal scaling profile.

    If mean_vals is False (default) then element 2 is a list of arrays where each array is EVERY individual inter-residue distance at the corresponding sequence separation. This means you can plot the distribution on the internal scaling profile should you so choose, rather than being limited to just the mean.

Return type:

Tuple

References

[1] Mao, A.H., Lyle, N., and Pappu, R.V. (2013). Describing sequence-ensemble relationships for intrinsically disordered proteins. Biochem. J 449, 307-318.

[2] Pappu, R.V., Wang, X., Vitalis, A., and Crick, S.L. (2008). A polymer physics perspective on driving forces and mechanisms for protein aggregation - Highlight Issue: Protein Folding. Arch. Biochem. Biophys. 469, 132-141.

get_internal_scaling_RMS(R1=None, R2=None, mode='COM', stride=1, weights=False, etol=1e-07, verbose=True)[source]

Calculates the averaged internal scaling info for the protein in the simulation in terms of.

root mean square (i.e. sqrt(<Rij^2>) vs | i - j |.

R1 and R2 define a sub-region to operate over if sub-regional analysis is required. When residues are not provided the full protein’s internal scaling (EXCLUDING* the caps, if present) is calculated.

Distance is measured in Angstroms.

Returns two lists of the same length:

  1. sequence separation (|i - j|)

  2. mean sqrt(<Rij^2>)

The internal scaling profile is a plot of sequence separation vs. mean through-space distance for all pairs of residues at a given sequence separation. What this means is that if we had a 6 residue peptide the internal scaling profile would be calculated as follows.

sequence separation = 0 average distance(average distance of 1-to-1, 2-to-2, 3-to-3, etc.)

sequence separation = 1 average distance(average distance of 1-to-2, 2-to-3, 3-to-4, etc.)

sequence separation = 2 average distance(average distance of 1-to-3, 2-to-4, 3-to-5, etc.)

sequence separation = 3 average distance(average distance of 1-to-4, 2-to-5, 3-to-6.)

sequence separation = 4 average distance(average distance of 1-to-5, 2-to-6)

sequence separation = 5 average distance(average distance of 1-to-6)

The residues considered for internal scaling analysis DO NOT include the ACE/NME peptide caps if present. This differs from CAMPARI, which DOES include the peptide caps.

For more information on this and other ideas for how polymer-physics can be a useful way of thinking about proteins, take a look at

Mao, A.H., Lyle, N., and Pappu, R.V. (2013). Describing sequence-ensemble relationships for intrinsically disordered proteins. Biochem. J 449, 307-318.

and

Pappu, R.V., Wang, X., Vitalis, A., and Crick, S.L. (2008). A polymer physics perspective on driving forces and mechanisms for protein aggregation - Highlight Issue: Protein Folding. Arch. Biochem. Biophys. 469, 132-141.

Parameters:
  • R1 (int) – Index value for first residue in the region of interest. If not provided (None) then first residue is used. Default = None

  • R2 (int) – Index value for last residue in the region of interest. If not provided (None) then last residue is used. Default = None

  • mode (['CA','COM']) – String, must be one of either ‘CA’ or ‘COM’. - ‘CA’ = alpha carbon. - ‘COM’ = center of mass (associated withe the residue). Default = ‘CA’.

  • mean_vals (bool) – This is False by default, but if True the mean IS is returned instead of the explicit values. In reality the non-default behaviour is probably preferable. Default = False

  • stride (int) – Defines the spacing between frames to compare - i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every stride-th frame. Note this operation may scale poorly as protein length increases at which point increasing the stride may become necessary. Default = 1

  • weights (list/np.ndarray {False}) – If provided this defines the frame-specific weights if re-weighted analysis is required. This can be useful if an ensemble has been re-weighted to better match experimental data, or in the case of analysing replica exchange data that is re-combined using T-WHAM. Default = False.

  • etol ({0.0000001} float) – Defines the error tollerance for the weight sums - i.e. if abs(np.sum(weights) - 1) > etol an exception is raised.

  • verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting! Default = True

Returns:

The function returns a tuple with elements. Each sub-element is the same length are match 1-1 with one another.

  1. Element 1 is a list of seequence separation values starting at 1 and monotonically incrementing by 1 until we’re at length of sequence - 1.

  2. Element 2 is a list of root mean-squared distance for each set of inter-residue distance that matches the sequence separation in element 1.

Return type:

Tuple

get_scaling_exponent(inter_residue_min=15, end_effect=5, subdivision_batch_size=20, mode='COM', num_fitting_points=40, fraction_of_points=0.5, fraction_override=False, stride=1, weights=False, etol=1e-07, verbose=True)[source]

Estimation for the A0 and nu-app exponents for the standard polymer relationship \(\langle r_{i,j}^2 \rangle^{1/2} = A_0 |i-j|^{\nu_{app}}\)

Here, \({\nu}^{app}\) (nu-app) reports on the solvent quality, while the prefactor (\(A_0\)) reports on the average chain persistence length and volume, which itself also depends on the solvent quality - see [1]. For polymers with an apparently scaling exponent above a 0.5 this works, but below this internal scaling starts to increasingly deviate from fractal behavior; as such, formally, this relationship becomes increasingly inaccurate as the apparent nu-app becomes smaller working. In practice, the best possible fit line does still track with relative compactness, although we urge that the meaning of nu-app calculated from a single-chain polymer when that chain is compact does not necessarily track with the nu-app calculated from multiple chains of varying length where the radius of gyration is fitted to the number of residues in the chain.

NOTE:

Despite their precision nu-app and A0 should be treated as

qualitative metrics, and are subject to finite chain effects. The idea of a polymer scaling behaviour is only necessarily useful in the case of a homopolymer, whereas heterpolymers engender massive averaging that can mask underlying conformational complexity. We strongly caution against over interpretation of the scaling exponent. For a better assement of how your chain actually deviates from homopolymer behaviour, see the function get_polymer_scaled_distance_map()

Parameters:
  • inter_residue_min (int) – Minimum distances used when selecting pairs of residues. This 25 threshold was determined previously, and essentially avoids scenarios where the two residues (i and j) are close to each other. The goal of this limit is to avoid finite chain size strongly influencing the scaling exponent limit. Default = 10.

  • end_effect (int) – Avoid pairs where one of the residues is $end_effect residues from the end. Helps mitigate end-effects. The default value of 5 was chosen as it’s around above the blob-length in a polypeptide. Note that for homopolymers this is much less of an issue. Default = 5

  • mode (str) – Defines the mode in which the internal scaling profile is calculated, can use either COM (center of mass) of each residue or the CA carbon of each residue. COM is more appropriate as CA will inherently give a larger profile. Default = COM

  • num_fitting_points (int) – Number of evenly spaced points to used to fit the scaling exponent in loglog space. 40 seems to be a decent number that scales well. Default = 40

  • fraction_of_points (float) – This is only used if fraction_override is set to True OR the sequence has less than the num_of_fitting_points residues. Means that instead of using a an absolute number of points (e.g. 40) to fit the loglog data, we use this fraction of residues. i.e. if the protein had 20 residues and fraction_of_points = 0.5 we’d use 10 points. Default = 0.5

  • fraction_override (bool) – If set to False then fraction_of_points ONLY used if the length of the sequence is less than the num_fitting points. If true then we explicitly use fraction_of_points and ignore num_fitting_points. Default is True.

  • stride (int {1}) – Defines the spacing between frames to compare - i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every stride-th frame. Note this operation may scale poorly as protein length increases at which point increasing the stride may become necessary.

  • weights (list or array of floats or bool) – Defines the frame-specific weights if re-weighted analysis is required. This can be useful if an ensemble has been re-weighted to better match experimental data, or in the case of analysing replica exchange data that is re-combined using T-WHAM. Default is False.

  • etol ({0.0000001} float) – Defines the error tollerance for the weight sums - i.e. if abs(np.sum(weights) - 1) > etol an exception is raised.

  • verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!

Returns:

Returns a 10 position tuple with the following associated values:

  • [0] - best nu

  • [1] - best A0

  • [2] - minimum nu identified in bootstrap fitting

  • [3] - maximum nu identified in bootstrap fitting

  • [4] - minimum A0 identified in bootstrap fitting

  • [5] - maximum A0 identified in bootstrap fitting

  • [6] - reduced chi^2 for the fit region

  • [7] - reduced chi^2 for ALL points

  • [8] - 2-column array, where col 1 is the sequence separation and col 2 is the real spatila separation for the ACTUAL data used to fit to the polymer model (i.e. these points are uniformly spaced from one another on a log-log plot). Reduced chi^2 for the fit region is calculated using this dataset.

  • [9] - 3-column array, where col 1 is the sequence separation, col 2 is the real spatial separation observed and col 3 is the best fit curve, for ALL i-j distances. Reduced chi^2 for all points is calculated using this dataset.

Return type:

tuple

get_local_heterogeneity(fragment_size=10, bins=None, stride=20, verbose=True)[source]

Function to calculate the vector of D values used to calculate the Phi parameter from Lyle et al[1]. The stride defines the spacing between frames which are analyzed. This is just for practical purposes.

The Phi calulation computes a D value for each frame vs. frame comparison - for a 2000 frame simulation this would be 4 million D values if every value was calculated which is a bit much, so the stride lets you define how many frames you should skip.

For a 2000 frame trajectory of a 80 residue protein with a stride=20 allows the calculation to take about 5 seconds. However, as protein size increases the computational cost of this process grows rapidly.

Parameters:
  • fragment_size (int) – Size of local region that is considered to be a single unit over which structural heterogeneity is examined. Should be between 2 and the length of the sequence. Default = 10

  • bins (np.ndarray) – Bins used to capture the heterogeneity at each position. If not defined, the default is a set of bins from 0 to 1 with an interval of 0.01 is used.

  • stride (int) – Defines the spacing between frames to compare - i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every stride-th frame. Default = 20

  • verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting! Default = True.

Returns:

  • [0] : list of floats of len n, where each float reports on the mean RMSD-deviation at a specific position along the sequence as defined by the fragment_size

  • [1] : list of floats of len n, where each float reports on the standard deviation of the RMSD-deviation at a specific position along the sequence as defined by the fragment_size

  • [2] : List of np.ndarrays of len n, where each sub-array reports on the histogram values associated with the full RMSD distribution at a given position along the sequence

  • [3] : np.ndarray which corresponds to bin values for each of the histograms in return[2]

Return type:

tuple (len = 4)

Raises:

soursop.ssexceptions.SSException – If inappropriate inputs are passed this function raises an SSException

get_D_vector(stride=20, verbose=True)[source]

Function to calculate the vector of D values used to calculate the Phi parameter from Lyle et al[1]. The stride parameter defines the spacing between frames which are analyzed. This is just for practical purposes.

The Phi calulation computes a D value for each frame vs. frame comparison - for a 2000 frame simulation this would be 4 million D values if every value was calculated which is a bit much, so the stride let’s you define how many frames you should skip.

Importantly, the DVector calculated here measures dij (see part III A of the paper) as the CA-CA distance and NOT the average inter-atomic distance. This has two effects:

  1. Heterogeneity is now, formally, a measure over backbone heterogeneity and not full protein heterogeneity - this may be desirable (arguably it’s a more interpratable measure of conformational change) but if the interatomic version is required this could be implemented.

  2. It is much more efficient than the original version.

For a 2000 frame trajectory of a 80 residue protein with a stride=20 allows the calculation to take about 5 seconds. However, as protein size increases the computational cost of this process grows rapidly.

Parameters:
  • stride (int {20}) – Defines the spacing between frames to compare - i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every stride-th frame

  • verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!

Returns:

Returns a numpy array of D values (i.e. the D_vector)

Return type:

np.ndarray

References

[1] Lyle, N., Das, R. K., & Pappu, R. V. (2013). A quantitative measure for protein conformational heterogeneity. The Journal of Chemical Physics, 139(12), 121907.

get_RMSD(frame1, frame2=- 1, region=None, backbone=True, stride=1)[source]

Function which will calculate the aligned RMSD between two frames, or between one frame and all frames in the trajectory. This can be done over the entire protein but we can also specificy a local region to perform this analysis over.

Units are Angstroms.

Parameters:
  • frame1 (int) – Defines the frame to be used as a reference

  • frame2 (int) – Defines the frame to be used as a comparison, OR if left blank or set to -1 means the entire trajectory. Default is -1

  • region (list/tuple of length 2) – Defines the first and last residue (INCLUSIVE) for a region to be examined. By default is set to None which means the entire protein is used.

  • backbone (bool) – Boolean flag for using either the full chain or just backbone. Generally backbone alone is fine so default to True.

  • stride (int) – Defines the spacing between frames to compare - i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every stride-th frame. Default is 1.

Returns:

Returns a numpy array of either 1 (if two frames are passed) or nframes length which corresponds to the RMSD difference either between two frames or between one frame and ALL other frames

Return type:

np.ndarray

get_Q(protein_average=True, region=None, beta_const=50.0, lambda_const=1.8, native_contact_threshold=4.5, stride=1, native_state_reference_frame=0, weights=False)[source]

Function which will calculate the fraction of native contacts in each frame of the trajectory.

The “native” state is defined as a specific frame (1st frame by default - note this means the native state frame = 0 as we index from 0!). In earlier versions the ‘native state frame’ was a variable, but this ends up being extremely messy when weights are considered, so assume the native state frame is always frame 0.

Native contacts are defined using the definition from Best et al [1]. The implementation is according to the code helpfully provided at http://mdtraj.org/latest/examples/native-contact.html

Parameters:
  • protein_average (bool {True}) – This is the default mode, and means that the return vector is the AVERAGE fraction of native contacts over the protein surface for each frame (e.g. each value refers to a single frame). If set to false the simulation-average value at native-contact resolution is returned, instead returning $NATIVE_CONTACT number of values and an additional list of native contact pairs. Default = True.

  • region (list/tuple of length 2) – Defines the first and last residue (INCLUSIVE) for a region to be examined. By default is set to None which means the entire protein is used. Default is None.

  • beta_const (float) – Constant used for computing Q in reciprocal nanometers. Default is 50 and probably should not be changed without good reason.

  • lambda_const (float) – Constant value is 1.8 for all-atom simulations. Probably should not be changed without good reason.

  • native_contact_threshold (float) – Threshold in Angstroms typically used for all-atom simulations and again probably should not be changed without good reason. Default = 4.5.

  • stride (int) – Defines the spacing between frames to compare - i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every stride-th frame. Default = 1.

  • native_state_reference_frame (int) –

  • weights (list or array of floats) – Defines the frame-specific weights if re-weighted analysis is required. This can be useful if an ensemble has been re-weighted to better match experimental data, or in the case of analysing replica exchange data that is re-combined using T-WHAM. Default is False

Returns:

If protein_average = True a single vector is returned with the overall protein average fraction of native contacts associated with each frame for each residue. If protein_average is set to False a 5-position tuple is returned, where each of the four positions has the following identity.

  • 0 - The fraction of the time a contact is native for all each native contact (vector of length $N_NATIVE CONTACTS).

  • 1 - The native contact definition (same length as 1) where each element is a pair of atoms which are considered native.

  • 2 - The residue-by-residue dictionary-native contacts dictionary. Keys are residue name-number and each key-associated value is the fractional native contacts for atoms associated with that residue. To get the residue-specific fraction of native contacts take the mean of the element values.

  • 3 - The ordered list of keys from 2 for easy plotting in a residue-residue manner

  • 4 - A nres x nres array showing a 2D contact map defining inter-residue specific Q values

Return type:

vector or tuple

References

[1] Best, Hummer, and Eaton, Native contacts determine protein folding mechanisms in atomistic simulations PNAS (2013) 10.1073/pnas.1311599110

get_contact_map(distance_thresh=5.0, mode='closest-heavy', stride=1, weights=False)[source]

get_contact_map() returns 2-position tuple with the contact map (N x N matrix) and a contact order.

vector (N x 1) that describe the contacts (heavy atom - heavy atom interactions) made by each of the residues over the simulation.

Each element is normalized such that it is between 0 and 1. i+1 and i+2 elements are excluded, and the minimum distances is the distance between the closest heavy atoms on each residue (backbone or sidechain).

Parameters:
  • distance_thresh (float) – Distance threshold used to define a ‘contact’ in Angstroms. Contacts are taken as frames in which the atoms defined by the scheme are within $distance_thresh angstroms of one another. Default is 5.0.

  • mode (string) –

    Mode allows the user to define differnet modes for computing contacts. Possible options are detailed below and are identical to those offered by mdtraj in compute_contacts. Default is closes-heavy.

    • ca - same as setting ‘atom’ and A1=’CA’ and A2=’CA’, this uses the C-alpha atoms

    • closest - closest atom associated with each of the residues, i.e. the is the point of closest approach between the two residues

    • closest-heavy - same as closest, except only non-hydrogen atoms are considered

    • sidechain - closest atom where that atom is in the sidechain. Note this requires mdtraj version 1.8.0 or higher.

    • sidechain-heavy - closest atom where that atom is in the sidechain and is heavy. Note this requires mdtraj version 1.8.0 or higher.

  • stride (int) – Defines the spacing between frames to compare - i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every stride-th frame. Note this operation may scale poorly as protein length increases at which point increasing the stride may become necessary. Defaultis 1.

  • floats] (weights [list or array of) – Defines the frame-specific weights if re-weighted analysis is required. This can be useful if an ensemble has been re-weighted to better match experimental data, or in the case of analysing replica exchange data that is re-combined using T-WHAM. Default is None.

Returns:

Returns a tuple where: 0 - contact map 1 - contact order

Return type:

tuple of size 2

get_clusters(region=None, n_clusters=10, backbone=True, stride=20)[source]

Function to determine the structural clusters associated with a trajectory. This can be useful for identifying the most populated clusters. This approach uses Ward’s hiearchical clustering, which means we must define the number of clusters we want a-priori.

Clustering is done using RMSD - BUT the approach taken here would be easy to re-implement in another function where you ‘simiarity’ metric was something else.

Returns a 4-place tuple with the following sub-elements:

[0] - cluster_members: A list of length n_clusters where each element corresponds to the number of frames in each of the 1-n_clusters cluster. i.e. if I had defined n_clusters=3 this would be a list of length 3

[1] - cluster_trajectories: A list of n_cluster mdtraj trajectory objects of the conformations in the cluster. This is particularly useful because it means you can perform any arbitrary analaysis on the cluster members.

[2] - cluster distance_matrices: A list of n_clusters where each member is a square matrix that defines the structural distance between each of the members of each cluster. In other words, this quantifies how similar (in terms of RMSD, in units Angstroms) the the members of a given cluster are to one another. Useful for computing cluster tightness.

[3] - cluster_centroids A list of n_clusters where each element is the index associated with each cluster trajectory that defines the cluster centroid (i.e. a ‘representative’ conformation). As an example - if we had 3 clusters this might look like [4,1,6], which means the 4th, 1st, and 6th frame from each of the respective mdtraj trajectories in the cluster_trajectories list would correspond to the centroid.

[4] - cluster frames: List of lists, where each sublist contains the frame indices associated with that cluster. Useful if clustering on a single chain and want to use that information over an entire trajectory.

Parameters:
  • region (list of length 2) – Allows the user to defines the first and last residue (INCLUSIVE) for a region to be assessed by the cluster analysis. Default is None (i.e. no region selected).

  • n_clusters (int) – Number of clusters to be returned through Ward’s clustering algorithm. Default = 10

  • backbone (bool) – Flag to determine if backbone atoms or full chain should be used. By default the backbone is used mainly because this makes things a lot computationally cheaper. Default is True

  • stride (int) – Defines the spacing betwen frames to compare with - i.e. take every $stride-th frame. Setting stride=1 would mean every frame is used, which would mean you’re doing an all vs. all comparions, which would be ideal BUT may be slow. Default = 20.

Returns:

See description in main function signature.

Return type:

tuple of length 5

get_all_SASA(probe_radius=1.4, mode='residue', stride=20)[source]

Returns the Solvent Accessible Surface Area (SASA) for each residue from every stride-th frame. SASA is determined using the Golden-Spiral algorithm.

SASA is returned in Angstroms squared, probe radius is also in Angstroms.

Note that this is a quiet a computationally expensive calculation, so by default the stride is set at 20

Parameters:
  • probe_radius (float) – Radius of the solvent probe used in Angstroms. Uses the the Golden-Spiral algorithm, where a radius of 1.4 A is pretty standard. Default = 1.4

  • mode (string) – Defines the mode used to compute the SASA. Must be one of [‘residue’, ‘atom’, ‘sidechain’,’backbone’, ‘all’]. For atom mode, extracted areas are resolved per-atom. For ‘residue’, this is computed instead on the per residue basis.

  • stride (int) – Defines the spacing between frames to compare - i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every stride-th frame. Note this operation may scale poorly as protein length increases at which point increasing the stride may become necessary. NOTE default is NOT 1 because this is quite an expensive analysis. Default = 20.

Returns:

The return type depends on what options are passed.

  • If no mode is passed or mode =’residue’ is passed, the function defaults to using the ‘residue’ mode with a stride of 20. This returns a np.ndarray of shape (frames, nres) where frames is the number of individual frames over which the per-residue SASA was calculated, and nres is the number of residues INCLUDING caps if present.

  • If mode = ‘atom’ This returns a np.ndarray of shape (frames, natoms) where frames is the number of individual frames over which the per-residue SASA was The function defaults to using the ‘atom’ mode with the requested stride (default =20).

  • If mode =’sidechain’ is passed, this returns a np.ndarray of shape (frames, nres) where frames is the number of individual frames over which the per-residue SASA was calculated and nres is the number of residues where the sidechain SASA was calculated. Sidechain selection here is done using mdtraj ‘sidechain’ topology selection language with the exception that backbone hydrogen atoms are included with the sidechain (H, HA, HA2, HA3) so we excluded these.

  • If mode =’backbone’ is passed this returns a np.ndarray of shape (frames, nres) where frames is the number of individual frames over which the per-residue SASA was calculated and nres is the number of residues where the The function defaults to using the ‘sidechain’ mode with the requested stride (default =20). Backbone selection here is done using mdtraj ‘backbone’ topology selection language with the exception that backbone hydrogen atoms are excluded from the backbone (H, HA, HA2, HA3) so we include these.

  • If mode =’all’ is passed, the function returns a 3x tuple where the first element is the ‘residues’ return array, the second is the ‘sidechain’ return array and the third is the ‘backbone’ return array

Return type:

np.ndarray or tuple

get_regional_SASA(R1, R2, probe_radius=1.4, stride=20)[source]

Returns the Solvent Accessible Surface Area (SASA) for a local region in every stride-th frame. SASA is determined using Golden-Spiral algorithm.

SASA is returned in Angstroms squared, probe radius is in Angstroms.

A note on performance - the algorithm has to calculate SASA for EVERY atom for each frame before a subregion can be evaluated. However, the total SASA data is saved and memoized for future use in the session. As such, if you scan over a protein to compute local SASA in a sliding window the first window will take an INORDINATE amount of time but the subsequent ones are just look ups so instantaneous. This speed up is lost if either the stride or the probe-size is altered, however.

Parameters:
  • R1 (int) – Index value for the first residue in the region

  • R2 (int) – Index value for the last residue in the region

  • probe_radius (float) – Radius of the solvent probe used in Angstroms. Uses the the Golden-Spiral algorithm, where a radius of 1.4 A is pretty standard. Default = 1.4

  • stride (int) – Defines the spacing between frames to compare - i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every stride-th frame. Note this operation may scale poorly as protein length increases at which point increasing the stride may become necessary. NOTE default is NOT 1 because this is quite an expensive analysis. Default = 20.

Returns:

Returns a single float which defines the sum-total SASA of the region defined between residue R1 and R2. Note that SASA values are returned in squared angstroms.

Return type:

float

get_site_accessibility(input_list, probe_radius=1.4, mode='residue_type', stride=20)[source]

Function to compute site/residue type accessibility.

This can be done using one of two modes. Under ‘residue_type’ mode, the input_list should be a list of canonical 3-letter amino acid names. Under ‘resid’ mode, the input list should be a list of residue id positions (recall that resid ALWAYS starts from 0 and will include the ACE cap if present).

Returns a dictionary with the key = residue “type-number” string and the value equal the average and standard devaition associated with the SASA. Useful for looking at comparative accessibility of the same residue accross the sequence.

Parameters:
  • input_list (list) – List of either residue names (e.g. [‘TRP’,’TYR’,’GLN’] or resid values ([1,2,3,4]) which will be taken and the SASA calculated for.

  • probe_radius (float) – Radius of the solvent probe used in Angstroms. Uses the Golden-Spiral algorithm. 1.4 A is pretty standard for water probe size. Default = 1.4

  • mode (string) – Mode used to examine sites. MUST be one of ‘residue_type’ or ‘resid’. Default = ‘residue_type’

  • stride (int) – Defines the spacing between frames to compare - i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every stride-th frame. Note this operation may scale poorly as protein length increases at which point increasing the stride may become necessary. NOTE default is NOT 1 because this is quite an expensive analysis. Default = 20.

Returns:

Returns a dictionary where values are residue positions (with residue type and index position) and values are SASA of that specific resdue position. Note that SASA values are returned in squared angstroms.

Return type:

dict

get_local_collapse(window_size=10, bins=None, verbose=True)[source]

This function calculates a vectorial representation of the radius of gyration along a polypeptide chain, using a sliding window to calculate the local Rg. This makes it very easy to determine where you see local collapse vs. local expansion.

Local collapse is calculated using a stepsize of 1 and a window size of window_size. As such, the resulting output will be of length n, where n = (number of residues - window_size)+1.

Parameters:
  • window_size (int, default=10) – Size of the window along the chain in which conformations are examined.

  • bins (np.arange or list) – A range of values (np.arange or list) spanning histogram bins. Default is np.arange(0, 10, 0.1).

verbosebool

Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!

Returns:

  • [0] - list of floats (length = n)

    Each float reports on the mean Rg at a specific position along the sequence.

  • [1] - list of floats (length = n)

    Each float reports on the standard deviation of the Rg distribution at a specific position along the sequence.

  • [2] - list of np.ndarrays (length = n)

    Histogram counts associated with the local Rg at a given position along the sequence. Basically, this is the emprical distribution that the standard devaitions and mean report on

  • [3] - np.ndarray (length = n)

    Bin values for histogram counts returned in [2]

Return type:

tuple (len = 4)

get_sidechain_alignment_angle(R1, R2, sidechain_atom_1='default', sidechain_atom_2='default')[source]

Function that computes the angle alignment between two residue sidechains. Sidechain vectors are defined as the unit vector between the CA of the residue and a designated ‘sidechain’ atom on the sidechain. The default sidechain atoms are listed below, but custom atom names can also be provided using the sidechain_atom_1/2 variables.

Residue : atom name default pairs (defaults from OPLS-AA used):

'ALA' : 'CB'
'CYS' : 'SG'
'ASP' : 'CG'
'GLU' : 'CD'
'PHE' : 'CZ'
'GLY' : 'ERROR'
'HIS' : 'NE2'
'ILE' : 'CD1'
'LYS' : 'NZ'
'LEU' : 'CG'
'MET' : 'CE'
'ASN' : 'CG'
'PRO' : 'CG'
'GLN' : 'CD'
'ARG' : 'CZ'
'SER' : 'OG'
'THR' : 'CB'
'VAL' : 'CB'
'TRP' : 'CG'
'TYR' : 'CZ'
'ACE' : 'ERROR'
'NME' : 'ERROR'
Parameters:
  • R1 (int) – Residue index of first residue

  • R2 (int) – Residue index of second residue

  • sidechain_atom_1 (str {CA}) – Atom name of the sidechain atom in residue 1 we’re going to use. Default = ‘default’.

  • sidechain_atom_2 (str {CA}) – Atom name of the sidechain atom in residue 2 we’re going to use. Default = ‘default’.

Returns:

Returns a numpy array which defines the per-frame angle alignment between the sidechains

Return type:

np.ndarray

get_dihedral_mutual_information(angle_name='psi', bwidth=0.6283185307179586, stride=1, weights=False, normalize=False)[source]

Generate the full mutual information matrix for a specific dihedral type. The resulting matrix describes the mutual information between each dihedral angle as defined by the variable angle_name. A weights parameter can be passed if frames are to be re-weighted, but this requires that the (number of frames) / stride = the number of weights.

The mutual information for a pair of angles is determined by generating a histogram of each dihedral individually (p(phi1), p(phi2)) and the joint probability histogram (p(phi1,phi2)), and then computing the Shannon entropy associated with the single and joint probability histograms (H_phi1, H_phi2, H_phi1_phi2). The mutual information is then returned as:

H_phi1 + H_phi2 - (H_phi1_H_phi2)

The easiest way to interpret these results is to normalize the inferred matrix using an equivalent matrix generated using a limiting polymer model (e.g., an excluded volume simulaion).

An optional flag is included to normalize the mutual information by the joint Shannon entropy. This may also be useful when looking for differences in MI matrices between a simulation and a limiting polymer model.

Return: Mutual information matrix (n x n) where n is the number of that type of bonds in the protein.

Parameters:
  • angle_name (str) – String, must be one of the following options ‘chi1’,’phi, ‘psi’, ‘omega’. Default = ‘psi’.

  • bwidth (np.array) – The width of the bins that will stretch from -pi to pi. np.pi/5.0 is probablty the smallest binsize you want - even np.pi/2 should work well. You may want to experiment with this parameter… Default = np.pi/5.0

  • stride (int) – Defines the spacing between frames to compare - i.e. if comparing frame1 to a trajectory we’d compare frame 1 and every stride-th frame. Note this operation may scale poorly as protein length increases at which point increasing the stride may become necessary. Default = 1

  • weights (array_like) – An numpy.array object that corresponds to the number of frames within an input trajectory. Default = False.

  • normalize (boolean) – Boolean flag to determine whether the mutual information matrix should be normalized by the joint shannon entropy (Default = False).

Returns:

Returns a 2D (square) matrix where the upper-left triangle is populated with the mutual information between the angles and pairs of residues in each chain.

Return type:

np.ndarray (2d)

get_local_to_global_correlation(mode='COM', n_cycles=100, max_num_pairs=10, stride=20, weights=False, verbose=True)[source]

Method to analyze how well ensemble average distances taken from a finite number of inter-residue distances correlate with global dimensions as measured by the radius of gyration. This is a new analysis, that is best explained through a formal write up.

Parameters:
  • mode (str {'COM'}) –

    Must be one of either ‘CA’ or ‘COM’.

    • ’CA’ = alpha carbon.

    • ’COM’ = center of mass (associated withe the residue).

  • n_cycles (int {100}) – Number of times, for each number of pairs, we re-select a different set of paris to use. This depends (weakly) on the number of residues, but we do not recommend a value < 50. For larger proteins this number should be increased. Again, it’s worth examining how your results chain as a function of n_cycles to determine the optimal tradeoff between speed and precision.

  • max_num_pairs (int {10}) – The maximum number of pairs to be consider for correlation analysis. In general we’ve found above 10-20 the average correlation tends to plateau towards 1 (note it will NEVER be 1) so 1-20 is a reasonable set to use.

  • stride (int {20}) – Defines the spacing between frames for calculating the ensemble average. As stride gets larger this analysis gets slower. It is worth experimenting with to see how the results change as a function of stride, but in theory the accuracy should remain fixed but precision improved as stride is reduced.

  • weights (bool {False}) – Flag that indicates if frame weights should be used or not.

  • verbose (bool) – Flag that by default is True determines if the function prints status updates. This is relevant because this function can be computationally expensive, so having some report on status can be comforting!

Returns:

Returns a four-place tuple.

  • [0] = This is an 2 by (n_cycles*max_num_pairs) array, where the first column is the number of pairs and the second column is the Rg-lambda correlation for a specific set of pairs (the pairs in question are not included). This can be thought of as the ‘raw’ data, and may only be useful if distributions of the correlation are of interest (e.g. for generating 2D histograms).

  • [1] = Array with the number of pairs used (e.g. if max_num_pairs = 10 then this would be [1,2,3,4,5,6,7,8,9]).

  • [2] = Array with the mean correlation associated with the numbers of pairs in position 2 and the radius of gyration.

  • [3] = Array with the standard deviation of the correlation associated with the number of pairs in position 2 and the radius of gyration. Note the inclusion of the standard deviation makes the assumption that the distribution is Gaussian which may or may not be true.

Return type:

tuple

Raises:

SSException – Raised when the mode is not ‘CA’ or ‘COM’; or, when the lengths of the Rg-stride derived calculations did not match the lengths of the internal distances; or, when the installed numpy version doesn’t support the aweights keyword for the numpy.cov function.

get_overlap_concentration()[source]

Returns the overlap concentration for the chain in Moles.

The overlap concentration reflects the concentration at which a flexible polymer begins to ‘collide’ in trans with other polymers - i.e. the concentration at which the chains begin to overlap. We calculate this by first computing the mean radius if gyration, and then computing the concentration of chain within the radius of gyration.

Returns:

Molar concentration for the overlap concentration.

Return type:

float

get_angle_decay(atom1='C', atom2='N', return_all_pairs=False)[source]

Function that returns the correlation getween C->N bond vectors along the chain as a function of sequence separation. This decay can be used to estimate the persisence length.

Conceptually, the way to think about this is that if every amino acid has a C->N vector WITHIN itself, we can ask how that C->N vector decays as you move further apart in sequence separation. For residue i and i+1 these vectors are quite well-correlated because, they’re next to each other, but as i and i+x get further away these two vectors become decorrelated. This gives a measure of the intramolecular correlation, which effectively reports on how stiff the chain is.

Note that atom1 and atom2 must be found within the same residue and must be present for all residues, so the C->N bond is really the only reasonable option.

Parameters:
  • atom1 (str) – The first atom to use when calculating the angle decay. Default = ‘C’.

  • atom2 (str) – The second atom to use when calculating the angle decay. Default = ‘N’.

  • return_all_pairs (bool {False}) – Whether or not to return a dictionary with all inter-residue distances or not. If True, the second element in the return tuple is a dictionary of res1-res2 pairs that reports on the average decay for that specific pair. This can be useful for assessing if there are specific regions of the chain that are more or less stiff.

Returns:

The return type depends on if return_all_pairs is set to True or not.

If return_all_pairs is False (default), then the return type is a numpy array of shape (nres,3), where the first column is the inter-residue separation, the second column is the average correlation and the third column is the standard devaition of the distribution of correlations obtained from all inter-residue correlations associated with that specific sequence separation.

If return_all_pairs is set to True then, the 2-position tuple is returned where element 1 is the same matrix described above, and the second element is a dictionary of res1-res2 pairs that reports on the average decay for that specific pair. This can be useful for assessing if there are specific regions of the chain that are more or less stiff.

Return type:

list or tuple