Examples
This page walks through common IDP analysis workflows using SOURSOP, organized from basic trajectory loading through to advanced sampling-quality assessment. All examples assume you have a trajectory file (e.g. traj.xtc) and a topology file (e.g. start.pdb).
A larger collection of worked examples and Jupyter notebooks, including the analyses from the SOURSOP publication, can be found in the supporting data repository.
1. Reading a trajectory
All SOURSOP analyses begin by reading a trajectory with SSTrajectory. Upon loading, SOURSOP automatically identifies every protein chain and exposes each as an SSProtein object:
from soursop.sstrajectory import SSTrajectory
import numpy as np
# read in a trajectory and topology
traj = SSTrajectory('traj.xtc', 'start.pdb')
# retrieve the first (and often only) protein chain
protein = traj.proteinTrajectoryList[0]
# basic properties
print(f"Residues : {protein.n_residues}")
print(f"Frames : {protein.n_frames}")
print(f"Sequence : {protein.get_amino_acid_sequence(oneletter=True)}")
SOURSOP also accepts pre-loaded mdtraj trajectory objects via the TRJ keyword, which is useful for scripting pipelines that already have a trajectory in memory:
import mdtraj as md
mdtraj_traj = md.load('traj.xtc', top='start.pdb')
traj = SSTrajectory(TRJ=mdtraj_traj)
protein = traj.proteinTrajectoryList[0]
2. Global dimensions
The most common IDP observables describe the overall size and shape of the chain. All functions return per-frame NumPy arrays, so standard statistics apply directly.
Radius of gyration \(R_g\), hydrodynamic radius \(R_h\), and end-to-end distance \(r_{ee}\):
rg = protein.get_radius_of_gyration() # Angstroms, shape (n_frames,)
rh = protein.get_hydrodynamic_radius() # Angstroms, shape (n_frames,)
e2e = protein.get_end_to_end_distance() # Angstroms, shape (n_frames,)
print(f"Mean Rg = {np.mean(rg):.2f} ± {np.std(rg):.2f} Å")
print(f"Mean Rh = {np.mean(rh):.2f} ± {np.std(rh):.2f} Å")
print(f"Mean e2e = {np.mean(e2e):.2f} ± {np.std(e2e):.2f} Å")
Asphericity describes how far the chain deviates from a sphere (0 = perfectly spherical, 1 = rod-like):
asph = protein.get_asphericity()
print(f"Mean asphericity = {np.mean(asph):.3f}")
Correlation between \(r_{ee}\) and \(R_g\) — a useful diagnostic for whether the two global size metrics are capturing consistent information:
pearson_r, pval = protein.get_end_to_end_vs_rg_correlation()
print(f"Pearson r(e2e, Rg) = {pearson_r:.3f} (p = {pval:.2e})")
Overlap concentration \(c^*\) estimates the concentration above which chains begin to crowd one another:
c_star = protein.get_overlap_concentration()
print(f"c* = {c_star:.4f} mg/mL")
3. Polymer scaling and internal structure
IDR conformational behaviour is often interpreted through the lens of polymer physics. The internal scaling profile \(\langle r^2(i,j) \rangle\) reports the mean-square inter-residue distance as a function of sequence separation \(|i - j|\).
Internal scaling (mean across the ensemble):
import matplotlib.pyplot as plt
separation, mean_r2 = protein.get_internal_scaling(mode='CA', mean_vals=True)
plt.loglog(separation, mean_r2)
plt.xlabel("Sequence separation |i - j|")
plt.ylabel(r"$\langle r^2 \rangle$ (Ų)")
plt.title("Internal scaling profile")
plt.show()
Scaling exponent \(\nu\) — the Flory exponent extracted by fitting \(\langle r^2 \rangle \sim |i-j|^{2\nu}\):
nu, prefactor = protein.get_scaling_exponent(mode='CA', show_fig=False)
print(f"Flory exponent ν = {nu:.3f}")
# ν ≈ 0.5 → Gaussian / theta-solvent behaviour
# ν ≈ 0.6 → self-avoiding random coil (good solvent)
# ν < 0.5 → compact / collapsed chain
Polymer-scaled distance map normalises the mean inter-residue distance matrix by the expected excluded-volume scaling, highlighting regions that are more compact or more expanded than a reference random coil:
mean_map, std_map = protein.get_polymer_scaled_distance_map(mode='CA')
plt.imshow(mean_map, origin='lower', cmap='RdBu_r')
plt.colorbar(label='Normalized distance')
plt.xlabel('Residue index')
plt.ylabel('Residue index')
plt.title('Polymer-scaled distance map')
plt.show()
Local heterogeneity in the scaling behaviour measures how the local Flory exponent varies along the chain, revealing compact or expanded subregions:
local_het = protein.get_local_heterogeneity(window_size=10)
4. Secondary structure and backbone angles
DSSP secondary structure assigns a secondary structure label to each residue in every frame:
dssp = protein.get_secondary_structure_DSSP()
# returns an (n_frames, n_residues) array of single-character labels
# mean helicity per residue
helicity = np.mean(dssp == 'H', axis=0)
plt.bar(range(protein.n_residues), helicity)
plt.xlabel('Residue index')
plt.ylabel('Fractional helicity')
plt.title('Per-residue α-helix propensity')
plt.show()
BBSEG backbone-torsion classification provides an 8-state assignment based on φ/ψ backbone dihedral regions, which is particularly useful for IDRs where the DSSP labels can be sparse or noisy:
bbseg = protein.get_secondary_structure_BBSEG()
# returns an (n_frames, n_residues) array of integer labels (0–7)
# 0=unassigned, 1=α-helix, 2=PPII, 3=β-strand, 4=turn, ...
Backbone dihedral angles:
angles = protein.get_angles()
# returns a dict with keys 'phi' and 'psi', each (n_frames, n_residues)
phi = angles['phi']
psi = angles['psi']
# Ramachandran plot for residue 10
residue_idx = 10
plt.scatter(np.degrees(phi[:, residue_idx]),
np.degrees(psi[:, residue_idx]),
alpha=0.3, s=2)
plt.xlabel('φ (°)')
plt.ylabel('ψ (°)')
plt.title(f'Ramachandran plot — residue {residue_idx}')
plt.show()
5. Distance maps and contacts
Mean inter-residue distance map:
mean_dist, std_dist = protein.get_distance_map(mode='CA')
plt.imshow(mean_dist, origin='lower', cmap='viridis_r')
plt.colorbar(label='Mean CA–CA distance (Å)')
plt.xlabel('Residue j')
plt.ylabel('Residue i')
plt.title('Mean inter-residue distance map')
plt.show()
Contact map — fraction of frames in which each residue pair is within a cutoff distance (default 8 Å):
contact_map = protein.get_contact_map(distance_thresh=8.0, mode='CA')
plt.imshow(contact_map, origin='lower', cmap='hot_r', vmin=0, vmax=1)
plt.colorbar(label='Contact frequency')
plt.xlabel('Residue j')
plt.ylabel('Residue i')
plt.title('Contact map (CA, 8 Å cutoff)')
plt.show()
RMSD to a reference frame (here, frame 0):
rmsd = protein.get_RMSD(frame_index=0, region=[0, protein.n_residues])
print(f"Mean RMSD from frame 0: {np.mean(rmsd):.2f} Å")
6. Solvent accessibility
Per-residue SASA averaged across the trajectory:
mean_sasa, std_sasa = protein.get_all_SASA()
# mean_sasa: shape (n_residues,), units Ų
plt.bar(range(protein.n_residues), mean_sasa, yerr=std_sasa, capsize=2)
plt.xlabel('Residue index')
plt.ylabel('SASA (Ų)')
plt.title('Per-residue solvent accessibility')
plt.show()
Regional SASA for a specific stretch of residues — useful for assessing the accessibility of a functional linear motif:
# SASA for residues 10 to 20 (inclusive)
mean_region, std_region = protein.get_regional_SASA(R1=10, R2=20)
print(f"Mean SASA (residues 10–20) = {mean_region:.1f} ± {std_region:.1f} Ų")
Site accessibility returns the fraction of frames in which a residue’s SASA exceeds a threshold, giving a per-residue accessibility score:
accessibility = protein.get_site_accessibility()
7. Multi-chain systems
For systems with more than one protein chain, system-level analyses are performed on the SSTrajectory object rather than on individual SSProtein objects:
traj = SSTrajectory('traj.xtc', 'start.pdb')
# overall Rg combining all chains
rg_total = traj.get_overall_radius_of_gyration()
print(f"System Rg: {np.mean(rg_total):.2f} Å")
# inter-chain distance map between chain 0 and chain 1
mean_dist, std_dist = traj.get_interchain_distance_map(
proteinID1=0, proteinID2=1, mode='CA'
)
# inter-chain contact map
contact_map = traj.get_interchain_contact_map(
proteinID1=0, proteinID2=1, distance_thresh=8.0
)
8. NMR comparison
Random coil chemical shifts (via ssnmr) predict the expected NMR chemical shifts for a sequence in the disordered limit, corrected for temperature, pH, and nearest-neighbour effects. These are a useful reference baseline when interpreting experimental IDP spectra:
from soursop.ssnmr import compute_random_coil_chemical_shifts
sequence = protein.get_amino_acid_sequence(oneletter=True)
shifts = compute_random_coil_chemical_shifts(
sequence, temperature=25, pH=7.4
)
# extract CA shifts
ca_shifts = [res['CA'] for res in shifts]
print("Predicted CA chemical shifts:", ca_shifts)
Paramagnetic relaxation enhancement (PRE) profiles compare the ensemble to an experiment in which a nitroxide spin label at a chosen position relaxes neighbouring amide protons. The intensity ratio I_para/I_dia decays toward 0 for residues near the label and stays near 1 for distant residues:
from soursop.sspre import SSPRE
# 600 MHz magnet; tau_c = 5 ns, t_delay = 16 ms, R_2D = 10 Hz
pre = SSPRE(protein, tau_c=5, t_delay=16, R_2D=10, W_H=600000000)
# spin label at residue 20 (CB atom)
intensity_ratio, gamma2 = pre.generate_PRE_profile(label_position=20)
plt.plot(intensity_ratio)
plt.axhline(0.5, linestyle='--', color='gray', label='0.5 threshold')
plt.xlabel('Residue index')
plt.ylabel(r'$I_\mathrm{para} / I_\mathrm{dia}$')
plt.title('Simulated PRE profile — label at residue 20')
plt.legend()
plt.show()
9. Assessing sampling quality with PENGUIN
For disordered proteins it is important to verify that independent replicate simulations have adequately explored conformational space and have not become trapped in local energetic minima. PENGUIN (Lotthammer & Holehouse, J. Chem. Inf. Model. 2025) addresses this by comparing per-residue backbone dihedral distributions to the excluded-volume (EV) polymer limit and to one another across replicates.
See the sssampling page for a full description of the methodology.
Quickstart — using the precomputed EV reference:
from soursop.sssampling import SamplingQuality
# list of replicate trajectory files (all share the same topology)
replicates = ['rep0.xtc', 'rep1.xtc', 'rep2.xtc', 'rep3.xtc']
sq = SamplingQuality(
traj_list=replicates,
top_file='topology.pdb',
)
# per-residue Hellinger distances: FH vs. precomputed EV reference
fh_vs_ev = sq.compute_dihedral_hellingers()
# shape: (n_replicates, n_residues)
# near 0 → EV-like sampling; near 1 → restricted dihedrals
# all-to-all inter-replica Hellinger distances
fh_vs_fh = sq.get_all_to_all_2d_trj_comparison()
# near 0 → replicates agree; large → replicates are in distinct trapped states
# four-panel PENGUIN summary figure (saves or displays)
sq.quality_plot()
Interpreting the output:
FH vs. EV distance |
FH vs. FH distance |
Interpretation |
|---|---|---|
Near 0 |
Near 0 |
Well-sampled, disordered |
Elevated, consistent |
Near 0 |
Locally folded / transient structure |
Elevated, variable |
Large |
Energetically trapped — poor sampling |
Using your own EV trajectories — if you have run dedicated excluded-volume simulations (e.g. with CAMPARI), supply them explicitly:
sq = SamplingQuality(
traj_list=['fh_rep0.xtc', 'fh_rep1.xtc'],
reference_list=['ev_rep0.xtc', 'ev_rep1.xtc'],
top_file='fh_topology.pdb',
ref_top='ev_topology.pdb',
)
sq.quality_plot()