encodermap.trajinfo package#
Submodules#
encodermap.trajinfo.info_all module#
Classes to work with ensembles of trajectories.
The statistics of a protein can be better described by an ensemble of proteins, rather than a single long trajectory. Treating a protein in such a way opens great possibilities and changes the way one can treat molecular dynamics data. Trajectory ensembles allow:
Faster convergence via adaptive sampling.
Better grasp of equilibrium and off-equilibrium dynamics.
This subpackage contains two classes which are containers of trajectory data. The SingleTraj trajectory contains information about a single trajectory. The TrajEnsemble class contains information about multiple trajectories. This adds a new dimension to MD data. The time and atom dimension are already established. Two frames can be appended along the time axis to get a trajectory with multiple frames. If they are appended along the atom axis, the new frame contains the atoms of these two. The trajectory works in a similar fashion. Adding two trajectories along the trajectory axis returns a trajectory ensemble, represented as a TrajEnsemble class in this package.
- class TrajEnsemble(trajs, tops=None, backend='no_load', common_str=None, basename_fn=None, traj_nums=None, custom_top=None)[source]#
Bases:
object
A fancy list of single trajectories. Topologies can be different across trajs.
Check out http://statisticalbiophysicsblog.org/?p=92 for why trajectory ensembles are awesome.
This class is a fancy list of
encodermap.trajinfo.info_single.SingleTraj`
. Trajectories can have different topologies and will be grouped by thecommon_str
argument. Each trajectory has its own uniquetraj_num
, which identifies it in the ensemble - even when the ensemble is sliced or subsampled.Examples
>>> import encodermap as em >>> traj1 = em.SingleTraj.from_pdb_id("1YUG") >>> traj2 = em.SingleTraj.from_pdb_id("1YUF")
Addition of two
encodermap.trajinfo.info_single.SingleTraj
also creates an ensemble.>>> trajs = traj1 + traj2 >>> trajs <encodermap.TrajEnsemble object. Current backend is no_load. Containing 2 trajectories. Common str is ['1YUG', '1YUF']. Not containing any CVs...>
Indexing a
TrajEnsemble
returns aencodermap.trajinfo.info_single.SingleTraj
based on its 0-based index. Think of theTrajEnsmeble
as a list ofencodermap.trajinfo.info_single.SingleTraj
. But trajectories can also havetraj_nums
, which do not have to adhere to[0, 1, 2, ...]
. This is similar to how apandas.DataFrame
offers indexing via.loc[]
and.iloc[]
(https://pandas.pydata.org/docs/user_guide/indexing.html#different-choices-for-indexing). For indexing trajs based on theirtraj_num
, you can use the.tsel[]
accessor of theTrajEnsmeble
Examples
>>> import encodermap as em >>> traj1 = em.SingleTraj.from_pdb_id("1YUG") >>> traj2 = em.SingleTraj.from_pdb_id("1YUF")
Addition of two SingleTraj also creates an ensemble.
>>> trajs = traj1 + traj2 >>> trajs.traj_nums [0, 1]
Change the
traj_num
oftraj2
>>> trajs[1].traj_num = 4 >>> trajs.traj_nums [0, 4] >>> trajs[1] <encodermap.SingleTraj object. Currently not in memory. Basename is '1YUF'. Not containing any CVs. Common string is '1YUF'. Object at ...> >>> trajs.tsel[4] <encodermap.SingleTraj object. Currently not in memory. Basename is '1YUF'. Not containing any CVs. Common string is '1YUF'. Object at ...>
TrajEnsemble
supports fancy indexing. You can slice to your liking (trajs[::5]
returns aTrajEnsemble
object that only consideres every fifth frame). Besides indexing by slices and integers, you can pass a 2-dimensionalnumpy.ndarray
.np.array([[0, 5], [1, 10], [5, 20]])
will return aTrajEnsemble
object with frame 5 of trajectory 0, frame 10 of trajectory 1 and frame 20 of trajectory 5.Examples
>>> import encodermap as em >>> traj1 = em.SingleTraj.from_pdb_id("1YUG") >>> traj2 = em.SingleTraj.from_pdb_id("1YUF") >>> trajs = traj1 + traj2 >>> sel = trajs[[[0, 0], [0, 1], [0, 2], [1, 10]]] >>> sel <encodermap.TrajEnsemble object. Current backend is no_load. Containing 4 frames and 2 trajectories. Common str is...>
The
TrajEnsemble
class also is an iterator to iterate over trajectores. Besides plain iteration, theTrajEnsmeble
also offers alternate iterators. Theitertrajs()
iterator returns a two-tuple oftraj_num
andtraj
. Theiterframes()
iterator returns a three-tuple oftraj_num
,frame_num
, andtraj
.Examples
>>> import encodermap as em >>> traj1 = em.SingleTraj.from_pdb_id("1YUG") >>> traj2 = em.SingleTraj.from_pdb_id("1YUF") >>> trajs = traj1 + traj2 >>> trajs[1].traj_num = 4 >>> for traj_num, traj in trajs.itertrajs(): ... print(traj_num, traj.n_frames) 0 15 4 16 >>> for traj_num, frame_num ,traj in trajs.subsample(10).iterframes(): ... print(traj_num, frame_num, traj.n_frames) 0 0 1 0 10 1 4 0 1 4 10 1
The
TrajEnsemble
has multiple alternative constructors. Thewith_overwrite_trajnums
constructor fixes inhomogeneous sequences ofencodermap.trajinfo.info_single.SingleTraj
andTrajEnsemble
.Examples
>>> import encodermap as em >>> traj1 = em.SingleTraj.from_pdb_id("1YUG", traj_num=0) >>> traj2 = em.SingleTraj.from_pdb_id("1YUF", traj_num=0) >>> trajs = em.TrajEnsemble([traj1, traj2]) Traceback (most recent call last): ... Exception: The `traj_num` attributes of the provided 2 `SingleTraj`s is not unique, the `traj_num` 0 occurs 2 times. This can happen, if you use `SingleTraj`s, that are already part of a `TrajEnsemble`. To create copies of the `SingleTraj`s and overwrite their `traj_num`s, use the `with_overwrite_trajnums()` constructor. >>> trajs = em.TrajEnsemble.with_overwrite_trajnums(traj1, traj2) >>> trajs <encodermap.TrajEnsemble...>
The
from_dataset
constructor can be used to load an ensemble from an.h5
fileExamples
>>> import encodermap as em >>> from tempfile import TemporaryDirectory >>> traj1 = em.SingleTraj.from_pdb_id("1YUG") >>> traj2 = em.SingleTraj.from_pdb_id("1YUF") >>> trajs = em.TrajEnsemble([traj1, traj2]) >>> with TemporaryDirectory() as td: ... trajs.save(td + "/trajs.h5") ... new = em.TrajEnsemble.from_dataset(td + "/trajs.h5") ... print(new) encodermap.TrajEnsemble object. Current backend is no_load. Containing 2 trajectories. Common str is...Not containing any CVs.
- Parameters:
trajs (Union[Sequence[str], Sequence[Path], Sequence[md.Trajectory], Sequence[SingleTraj]])
tops (Union[None, Sequence[str], Sequence[Path]])
backend (Literal['mdtraj', 'no_load'])
common_str (Optional[Sequence[str]])
traj_nums (Optional[Sequence[int]])
custom_top (Optional[CustomAAsDict])
- CVs#
The collective variables of the
SingleTraj
classes. Only CVs with matching names in allSingleTraj
classes are returned. The data is stacked along a hypothetical time axis along the trajs.
- _CVs#
The same data as in CVs but with labels. Additionally, the xarray is not stacked along the time axis. It contains an extra dimension for trajectories.
- Type:
- basenames#
A list with the names of the trajecotries. The leading path and the file extension is omitted.
- name_arr#
An array with
len(name_arr) == n_frames
. This array keeps track of each frame in this object by identifying each frame with a filename. This can be useful, when frames are mixed inside aTrajEnsemble
class.- Type:
np.ndarray
- property CVs: dict[str, ndarray]#
Returns dict of CVs in SingleTraj classes. Only CVs with the same names in all SingleTraj classes are loaded.
- Type:
- property CVs_in_file: bool#
Is true, if CVs can be loaded from file. Can be used to build a data generator from.
- Type:
- property _CVs: Dataset#
Returns x-array Dataset of matching CVs. stacked along the trajectory-axis.
- Type:
- batch_iterator(batch_size: int, replace: bool = False, CV_names: tuple[str] = ('',), deterministic: bool = True, yield_index: bool = True, start: int = 1) Iterator[tuple[ndarray, ndarray]] [source]#
- batch_iterator(batch_size: int, replace: bool = False, CV_names: tuple[str] = ('',), deterministic: bool = True, yield_index: bool = False, start: int = 1) Iterator[ndarray]
- batch_iterator(batch_size: int, replace: bool = False, CV_names: Sequence[str] | None = None, deterministic: bool = True, yield_index: bool = True, start: int = 1) Iterator[tuple[ndarray, tuple[ndarray, ndarray, ndarray, ndarray, ndarray]]]
- batch_iterator(batch_size: int, replace: bool = False, CV_names: Sequence[str] | None = None, deterministic: bool = True, yield_index: bool = False, start: int = 1) Iterator[tuple[ndarray, ndarray, ndarray, ndarray, ndarray]]
Lazy batched iterator of CV data.
This iterator extracts batches of CV data from the ensemble. If the ensemble is a large HDF5 datset, this provides the ability to use all data without loading it all into memory.
Examples
Import EncoderMap and load some example trajectories.
>>> import encodermap as em >>> trajs = em.TrajEnsemble( ... [ ... 'https://files.rcsb.org/view/1YUG.pdb', ... 'https://files.rcsb.org/view/1YUF.pdb' ... ] ... )
This iterator will yield new samples forever. The batch is a tuple of
numpy.ndarray
.>>> for batch in trajs.batch_iterator(batch_size=2): ... print([b.shape for b in batch]) ... break [(2, 148), (2, 147), (2, 150, 3), (2, 149), (2, 82)]
Use it with Python’s builtin
next()
function. Thedeterministic
flag returns deterministic batches. Theyield_index
flag also provides the index of the extracted batch. In this example, both batches are extracted from the 1YUG trajectory (traj_num==0
).>>> iterator = trajs.batch_iterator(deterministic=True, batch_size=2, yield_index=True) >>> index, batch = next(iterator) >>> index [[0 5] [0 8]] >>> index, batch = next(iterator) >>> index [[ 0 3] [ 0 10]]
If a single string is requested for
CV_names
, the batch, will be a sinlgenumpy.ndarray
, rather than a tuple thereof.>>> iterator = trajs.batch_iterator(batch_size=2, CV_names=["central_dihedrals"])
>>> batch = next(iterator) >>> batch.shape (2, 147)
- Parameters:
batch_size (int) – The size of the batch.
replace (bool) – Whether inside a single batch a sample can occur more than once. Set to False (default) to only allow unique samples in a batch.
CV_names (Sequence[str]) – The names of the CVs to be used in the iterator. If a list/tuple with a single string is provided, the batch will be a
numpy.ndarray
, rather than a tuple thereof.deterministic (bbol) – Whether the samples should be deterministic.
yield_index (bool) – Whether to also yield the index of the extracted samples.
start (int) – A start ineteger, which can be used together with
deterministic=True
to get different deterministic datasets.
- Returns:
Different iterators based on chosen arguments.
- Return type:
Iterator[Any]
- cluster(cluster_id, col='cluster_membership', memberships=None, n_points=-1, overwrite=True)[source]#
Clusters this
TrajEnsemble
based on the providedcluster_id
andcol
.With ‘clustering’ we mean to extract a subset given a certain membership. Take two trajectories with 3 frames each as an ensemble. Let’s say we calculate the end-to-end distance of the trajectories and use it as a collective variable of the system. The values are
[0.8, 1.3, 1.2, 1.9, 0.2, 1.3]
. Based on these values, we define a boolean CV (using 0 as False and 1 as True) which says whether the end-to-end distance is smaller or grather than 1.0. We give this CV the name'end_to_end_binary'
and the values are[0, 1, 1, 1, 0, 1]
. We can use this CV to ‘cluster’ theTrajEnsemble
via:cluster = trajs.cluster(cluster_id=0, col='end_to_end_binary')
:This gives a
TrajEnsemble
with 2 frames.
cluster = trajs.cluster(cluster_id=0, col='end_to_end_binary')
:This gives a
TrajEnsemble
with 4 frames.
Sometimes, you want to save this a cluster in a format that can be rendered by graphical programs (
.xtc, .pdb
), you can use either thejoin
orstack
method of the resulting :obj:``TrajEnsemble` to get a mdtraj.Trajectory, which is either stacked along the atom axis or joined along the time axis.Note
If the resulting
TrajEnsemble
has inhomogeneous topologies, thejoin
method will return a dict[md.Topology, md.Trajectory] instead. This dict can be used to save multiple (.xtc, .pdb
) files and visualize your cluster in external programs.The
col
parameter takes any CV name, that is per-frame and integer.- Parameters:
cluster_id (int) – The cluster id to use. Needs to be an integer, that is present in the
col
parameter.col (str) – Which ‘column’ of the collective variables to use. Needs to be a key, that can be found in
trajs.CVs.keys()
.memberships (Optional[np.ndarray]) – If a
numpy.ndarray
is provided here, the memberships from this array will be used. In this case, thecol
argument will be unused.n_points (int) – How many points the resulting cluster should contain. Subsamples the points in
col == cluster_id
evenly and without repeat. If set to -1, all points will be used.overwrite (bool) – When the
memberships
argument is used, but theTrajEnsemble
already has a CV under the name specified bycol
, you can set this to True to overwrite this column. Can be helpful, when you iteratively conduct multiple clusterings.
- Return type:
Examples
Import EncoderMap and NumPy.
>>> import encodermap as em >>> import numpy as np
Load an example project.
>>> trajs = em.load_project("pASP_pGLU", load_autoencoder=False)
Create an array full of
-1
’s. These are the ‘outliers’.>>> cluster_membership = np.ones(shape=(trajs.n_frames, )) * -1
Select the first 5 frames of every traj to be in cluster 0.
>>> cluster_membership[trajs.id[:, 1] < 5] = 0
Select all frames between 50 and 55 to be cluster 1.
>>> cluster_membership[(50 <= trajs.id[:, 1]) & (trajs.id[:, 1] <= 55)] = 1 >>> np.unique(cluster_membership) array([-1., 0., 1.])
Load this array as a CV called
'clu_mem'
.>>> trajs.load_CVs(cluster_membership, attr_name='clu_mem')
Extract all of cluster 0 with
n_points=-1
.>>> clu0 = trajs.cluster(0, "clu_mem") >>> clu0.n_frames 35
Extract an evenly spaced subset of cluster 1 with 10 total points.
>>> clu1 = trajs.cluster(1, "clu_mem", n_points=10) >>> clu1.n_frames 10
Cclusters with inhomogeneous topologies can be stacked along the atom axis.
>>> [t.n_atoms for t in trajs] [69, 83, 103, 91, 80, 63, 73] >>> stacked = clu1.stack() >>> stacked.n_atoms 795
But joining the trajectories returns a
dict[top, traj]
if the topologies are inhomogeneous.>>> joined = clu1.join() >>> type(joined) <class 'dict'>
- dash_summary()[source]#
A
pandas.DataFrame
that summarizes this ensemble.- Returns:
The DataFrame.
- Return type:
pd.DataFrame
- property featurizer#
- classmethod from_textfile(fname, basename_fn=None)[source]#
Creates a
TrajEnsemble
object from a textfile.- The textfile needs to be space-separated with two or three columns:
- Column 1:
The trajectory file.
- Column 2:
The corresponding topology file (If you are using
.h5
trajs, column 1 and 2 will be identical, but column 2 needs to be there nonetheless).
- Column 3:
The common string of the trajectory. This column can be left out, which will result in an
TrajEnsemble
without common strings.
- Parameters:
fname (Union[str, Path]) – File to be read.
basename_fn (Union[None, Callable[[str], str]], optional) – A function to apply to the
traj_file
string to return the basename of the trajectory. If None is provided, the filename without extension will be used. When all files are named the same and the folder they’re in defines the name of the trajectory, you can supplylambda x: split('/')[-2]
as this argument. Defaults to None.
- Returns:
A
TrajEnsemble
instance.- Return type:
- get_single_frame(key)[source]#
Returns a single frame from all loaded trajectories.
Consider a
TrajEnsemble
class with two trajectories. One has 10 frames, the other 5 (trajs.n_frames
is 15). Callingtrajs.get_single_frame(12)
is equal to callingtrajs[1][1]
. Callingtrajs.get_single_frame(16)
will error, andtrajs.get_single_frame(1)
is the same astrajs[0][1]
.- Parameters:
key (int) – The frame to return.
- Returns:
The frame.
- Return type:
- property index_arr: ndarray#
Returns np.ndarray with ndim = 2. Clearly assigning every loaded frame an identifier of traj_num (self.index_arr[:,0]) and frame_num (self.index_arr[:,1]). Can be used to create an unspecified subset of frames and can be useful when used with clustering.
- Type:
np.ndarray
- iterframes()[source]#
Generator over the frames in this instance.
- Yields:
tuple –
- A tuple containing the following:
int: The traj_num
int: The frame_num
encodermap.SingleTraj: An SingleTraj object.
- Return type:
Iterator[tuple[int, int, SingleTraj]]
Examples
Import EncoderMap and load an example
TrajEnsemble
.>>> import encodermap as em >>> trajs = em.TrajEnsemble( ... [ ... 'https://files.rcsb.org/view/1YUG.pdb', ... 'https://files.rcsb.org/view/1YUF.pdb', ... ], ... ) >>> print(trajs.n_frames) 31
Subsample every tenth frame.
>>> trajs = trajs.subsample(10) >>> trajs.n_frames 4
Call the
iterframes
method.>>> for traj_num, frame_num, frame in trajs.iterframes(): ... print(traj_num, frame_num, frame.n_frames) 0 0 1 0 10 1 1 0 1 1 10 1
- itertrajs()[source]#
Generator over the SingleTraj classes.
- Yields:
tuple –
- A tuple containing the following:
int: A loop-counter integer. Is identical with traj.traj_num.
encodermap.SingleTraj: An SingleTraj object.
- Return type:
Examples
>>> import encodermap as em >>> trajs = em.TrajEnsemble( ... [ ... 'https://files.rcsb.org/view/1YUG.pdb', ... 'https://files.rcsb.org/view/1YUF.pdb' ... ] ... ) >>> for i, traj in trajs.itertrajs(): ... print(traj.basename) 1YUG 1YUF
- join(align_string='name CA', superpose=True, ref_align_string='name CA', base_traj=None, progbar=None, dict_keys='top')[source]#
- load_CVs(data=None, attr_name=None, cols=None, deg=None, periodic=True, labels=None, directory=None, ensemble=False, override=False, custom_aas=None, alignment=None)[source]#
Loads CVs in various ways. The easiest way is to provide a single
numpy.ndarray
and a name for that array.Besides np.ndarray, files (
.txt and .npy
) can be loaded. Features or Featurizers can be provided. Axarray.Dataset
can be provided. A str can be provided which either is the name of one of EncoderMap’s features (encodermap.features) or the string can be ‘all’, which loads all features required for EncoderMap’sencodermap.autoencoder.autoencoder`AngleDihedralCartesianEncoderMap
.- Parameters:
data (Optional[TrajEnsembleFeatureType]) – The CV to load. When a
numpy.ndarray
is provided, it needs to have a shape matchingn_frames
and the data will be distributed to the trajs, When a list of files is provided,len(data)
(the files) needs to matchn_trajs
. The first file will be loaded by the first traj (based on the traj’straj_num
) and so on. If a list ofnumpy.ndarray
is provided, the first array will be assigned to the first traj (based on the traj’straj_num
). If None is provided, the argumentdirectory
will be used to construct a str using this expressionfname = directory + traj.basename + '_' + attr_name
. If there are.txt
or.npy
files matching that string in thedirectory
, the CVs will be loaded from these files to the corresponding trajs. Defaults to None.attr_name (Optional[str]) – The name under which the CV should be found in the class. Choose whatever you like.
'highd'
,'lowd'
,'dists'
, etc. The CV can then be accessed via dot-notation:trajs.attr_name
. Defaults to None, in which case, the argumentdata
should point to existing files. Theattr_name
will be extracted from these files.A list of integers indexing the columns of the data to be loaded. This is useful if a file contains columns which are not features (i.e. an indexer or the error of the features. eg:
id f1 f2 f1_err f2_err 0 1.0 2.0 0.1 0.1 1 2.5 1.2 0.11 0.52
In that case, you would want to supply
cols=[1, 2]
to thecols
argument. If None is provided all columns are loaded. Defaults to None.deg (Optional[bool]) – Whether to return angular CVs using degrees. If None or False, CVs will be in radian. Defaults to None.
periodic (bool) – Whether to use the minimum image convention to calculate distances/angles/dihedrals. This is generally recommended, when you don’t clean up your trajectories and the proteins break over the periodic boundary conditions. However, when the protein is large, the distance between one site and another might be shorter through the periodic boundary. This can lead to wrong results in your distance calculations.
labels (list[str]) – A list containing the labels for the dimensions of the data. If you provide a
numpy.ndarray
with shape(n_trajs, n_frames, n_feat)
, this list needs to be oflen(n_feat)
. An exception will be raised otherwise. If None is privided, the labels will be automatically generated. Defaults to None.directory (Optional[str]) – If this argument is provided, the directory will be searched for
.txt
or.npy
files which have the same names as the trajectories have basenames. The CVs will then be loaded from these files.ensemble (bool) – Whether the trajs in this class belong to an ensemble. This implies that they contain either the same topology or are very similar (think wt, and mutant). Setting this option True will try to match the CVs of the trajs onto the same dataset. If a VAL residue has been replaced by LYS in the mutant, the number of sidechain dihedrals will increase. The CVs of the trajs with VAL will thus contain some NaN values. Defaults to False.
override (bool) – Whether to override CVs with the same name as
attr_name
.custom_aas (Optional[CustomAAsDict]) – You can provide non-standard residue definitions in this argument. See
encodermap.trajinfo.trajinfo_utils.CustomTopology
for information how to use the custom_aas argument. If set to None (default), only standard residue names are assumed.alignment (Optional[str]) – If your proteins have similar but different sequences, you can provide a CLUSTAL W alignment as this argument and the featurization will align the features accordingly.
- Raises:
TypeError – When wrong Type has been provided for data.
- Return type:
None
- load_custom_topology(custom_top=None)[source]#
Loads a custom_topology from a CustomTopology class or a dict.
See also
CustomTopology
- Parameters:
custom_top (CustomTopology | dict[str | tuple[str, str], None | tuple[str, None] | tuple[str, dict[Literal['bonds', 'optional_bonds', 'delete_bonds', 'optional_delete_bonds', 'PHI', 'PSI', 'OMEGA', 'not_PHI', 'not_PSI', 'not_OMEGA', 'CHI1', 'CHI2', 'CHI3', 'CHI4', 'CHI5'], list[str] | list[tuple[str | int, str | int]]]]] | None) – Optional[Union[CustomTopology, CustomAAsDict]]: An instance of the CustomTopology class or a dictionary that can be made into such.
- Return type:
None
- property locations: list[str]#
Duplication of self.traj_files but using the trajs own traj_file attribute. Ensures that traj files are always returned independent of the current load state.
- Type:
- property name_arr: ndarray#
Trajectory names with the same length as self.n_frames.
- Type:
np.ndarray
- parse_clustal_w_alignment(aln)[source]#
Parse an alignment in ClustalW format and add the info to the trajectories.
- Parameters:
aln (str) – The alignment in ClustalW format.
- Return type:
None
- save(fname, CVs='all', overwrite=False, only_top=False)[source]#
Saves this TrajEnsemble into a single
.h5
file.- Parameters:
fname (Union[str, Path]) – Where to save the file.
CVs (Union[Literal["all"], list[str], Literal[False]]) – Which CVs to alos store in the file. If set to
'all'
, all CVs will be saved. Otherwise, a list[str] can be provided to only save specific CVs. Can also be set to False, no CVs are stored in the file.overwrite (bool) – If the file exists, it is overwritten.
only_top (bool) – Only writes the trajectorie’s topologies into the file.
- Raises:
IOError – If file already exists and overwrite is not True.
- Return type:
None
- sidechain_info()[source]#
Indices used for the AngleDihedralCartesianEncoderMap class to allow training with multiple different sidechains.
- Returns:
The indices. The key ‘-1’ is used for the hypothetical convex hull of all feature spaces (the output of the tensorflow model). The other keys match the common_str of the trajs.
- Return type:
- Raises:
Exception – When the common_strings and topologies are not aligned. An exception is raised. Aligned means that all trajs with the same common_str should possess the same topology.
- split_into_frames(inplace=False)[source]#
Splits self into separate frames.
- Parameters:
inplace (bool) – Whether to do the split inplace or not. Defaults to False and thus, returns a new TrajEnsemble class.
- Return type:
None
- stack(align_string='name CA', superpose=True, ref_align_string='name CA', base_traj=None, progbar=None)[source]#
- subsample(stride=None, total=None)[source]#
Returns a subset of this
TrajEnsemble
given the provided stride or total.This is a faster alternative than using the
trajs[trajs.index_arr[::1000]]
when HDF5 trajs are used, because the slicing information is saved in the respectiveencodermap.trajinfo.info_single.SingleTraj
and loading of single frames is faster in HDF5 formatted trajs.
- Parameters:
- Returns:
A trajectory ensemble.
- Return type:
Note
The result from
subsample(1000)` `is different from ``trajs[trajs.index_arr[::1000]]
. With subsample every trajectory is sub-sampled independently. Consider aTrajEnsemble
with twoencodermap.trajinfo.info_single.SingleTraj
trajectories with 18 frames each.subsampled = trajs.subsample(5)
would return aTrajEnsemble
with two trajs with 3 frames each (subsampled.n_frames == 6
). Whereas,subsampled = trajs[trajs.index_arr[::5]]
would return aTrajEnsemble
with 7 SingleTrajs with 1 frame each (subsampled.n_frames == 7
). Because the time and frame numbers are saved all the time, this should not be too much of a problem.
- tf_dataset(batch_size, replace=False, sidechains=False, reconstruct_sidechains=False, CV_names=None, deterministic=False, prefetch=True, start=1)[source]#
- to_alignment_query()[source]#
A string, that cen be put into sequence alignment software.
- Return type:
- property top: list[Topology]#
Returns a minimal set of mdtraj.Topologies.
If all trajectories share the same topology a list with len 1 will be returned.
- Type:
- property top_files: list[str]#
Returns minimal set of topology files.
If yoy want a list of top files with the same length as self.trajs use self._top_files and self._traj_files.
- Type:
- property traj_files: list[str]#
A list of the traj_files of the individual SingleTraj classes.
- Type:
- property traj_joined: Trajectory#
Returns a mdtraj Trajectory with every frame of this class appended along the time axis.
Can also work if different topologies (with the same number of atoms) are loaded. In that case, the first frame in self will be used as topology parent and the remaining frames’ xyz coordinates are used to position the parents’ atoms accordingly.
Examples
>>> import encodermap as em >>> trajs = em.load_project("pASP_pGLU") >>> subsample = trajs[0][:20] + trajs[1][:20] >>> subsample.split_into_frames().traj_joined <mdtraj.Trajectory with 40 frames, 69 atoms, 6 residues, and unitcells at ...>
- Type:
mdtraj.Trajectory
- property trajs_by_common_str: dict[None | str, TrajEnsemble]#
Returns the trajs in self ordered by top.
If all trajectories share the same common_str, a dict with one key will be returned. As the common_str can be None, None can also occur as a key in this dict.
- Type:
- property trajs_by_top: dict[Topology, TrajEnsemble]#
Returns the trajs in self ordered by top.
If all trajectories share the same topology, a dict with one key will be returned.
- Type:
dict[md.Topology, TrajEnsemble]
- property trajs_by_traj_num: dict[int, SingleTraj]#
- property tsel#
- classmethod with_overwrite_trajnums(*trajs)[source]#
Creates a
TrajEnsemble
by copying the providedencodermap.trajinfo.info_single.SingleTraj
instances and changing theirtraj_num
attribute to adhere to[0, 1, 2, ...]
.- Parameters:
trajs (Sequence[SingleTraj]) – The sequence of trajs.
- Returns:
A
TrajEnsemble
instance.- Return type:
encodermap.trajinfo.info_single module#
Classes to work with ensembles of trajectories.
The statistics of a protein can be better described by an ensemble of proteins, rather than a single long trajectory. Treating a protein in such a way opens great possibilities and changes the way one can treat molecular dynamics data. Trajectory ensembles allow:
Faster convergence via adaptive sampling.
Better anomaly detection of unique structural states.
This subpackage contains two classes which are containers of trajectory data. The SingleTraj trajectory contains information about a single trajectory. The TrajEnsemble class contains information about multiple trajectories. This adds a new dimension to MD data. The time and atom dimension are already established. Two frames can be appended along the time axis to get a trajectory with multiple frames. If they are appended along the atom axis, the new frame contains the atoms of these two. The trajectory works in a similar fashion. Adding two trajectories along the trajectory axis returns a trajectory ensemble, represented as a TrajEnsemble class in this package.
- class SingleTraj(traj, top=None, common_str='', backend='no_load', index=None, traj_num=None, basename_fn=None, custom_top=None)[source]#
Bases:
object
This class contains the info about a single trajectory.
This class contains many of the attributes and methods of mdtraj.Trajectory. It is meant to be used as a standalone single trajetcory or in an ensemble defined in the
encodermap.trajinfo.info_all.TrajEnsemble
class. Other than the standard mdtraj.Trajectory this class loads the MD data only when needed. The location of the file(s) and other attributes like indices (int, list[int], numpy.ndarray, slice) are stored until the traj is accessed via theSingleTraj.traj
attribute. The returned traj is a mdtraj.Trajectory with the correct number of frames in the correct sequence.Besides MD data, this class keeps track of your collective variables. Oftentimes the raw xyz data of a trajectory is not needed for understanding the conformation and suitable CVs are selected to represent a protein via internal coordinates (torsions, pairwise distances, etc.). This class keeps tack of your CVs. Whether you call them
'highd'
or'torsions'
, this class keeps track of everything and returns the values when you need them.SingleTraj
supports fancy indexing, so you can extract one or more frames from a Trajectory as a separate trajectory. For example, to form a trajectory with every other frame, you can slice withtraj[::2]
.Note
SingleTraj uses the nanometer, degree & picosecond unit system.
- Parameters:
- backend#
Current state of loading. If
backend == 'no_load'
xyz data will be loaded from disk, if accessed. Ifbackend == 'mdtraj'
, the data is already in RAM.- Type:
- common_str#
Substring of
traj_file
andtop_file
. Used to group multiple trajectory and topology files. Iftraj files=['protein1_traj1.xtc', 'protein1_traj2.xtc']
have the same topolgy stored in a file called'protein1.pdb'
, you can load them withcommon_str='protein1'
together with more.xtc
and.pdb
files and these two.xtc
files will use the correct.pdb
file.- Type:
- index#
A sequence of fancy slices of the trajectory. When file is loaded from disk, the fancy indexes will be applied one after the other.
- Type:
Sequence[Union[None, int, list, numpy.ndarray, slice]]
- top_file#
Topology file used to create this class. If a .h5 trajectory was used traj_file and top_file are identical. If a
mdtraj.Trajectory
was used to create SingleTraj, these strings are empty.- Type:
Examples
Load a pdb file with 14 frames from rcsb.org
>>> import encodermap as em >>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb") >>> traj <encodermap.SingleTraj object... >>> traj.n_frames 14
Providing common_str sets this attribute.
>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb", common_str="1GHC") >>> traj.common_str '1GHC'
Indexing using integers returns a SingleTraj with only one frame.
>>> frame = traj[5] >>> frame.n_frames 1
Indexing can also use lists of integers.
>>> subset = traj[[0, 1, 5]] >>> subset.n_frames 3
Further indexing this
subset
uses the current trajectory ‘as is’. Although frame 0, 1, and 5 have been extracted fromtraj
, we get frame 5 fromsubset
by indexing with 2.>>> frame = subset[2] >>> frame.id array([5])
Indexing using the original frame indices from the file is done using the
fsel[]
accessor.>>> frame = subset.fsel[5] >>> frame.id array([5])
Advanced slicing
>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb")[-1:7:-2] >>> [frame.id[0] for frame in traj] [13, 11, 9]
The
traj_num
argument is mainly used inencodermap.TrajEnsemble
, but can be provided manually.>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb", traj_num=2) >>> traj.traj_num 2
The argument
basename_fn
should be a callable, that takes a string and returns a string.>>> from pathlib import Path >>> def my_basename_fn(filename): ... stem = str(Path(filename).stem) ... return "custom_" + stem >>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb", basename_fn=my_basename_fn) >>> traj.basename 'custom_1GHC'
Build a trajectory ensemble from multiple
SingleTraj
objects.>>> traj1 = em.SingleTraj("https://files.rcsb.org/view/1YUG.pdb") >>> traj2 = em.SingleTraj("https://files.rcsb.org/view/1YUF.pdb") >>> trajs = traj1 + traj2 >>> print(trajs.n_trajs, trajs.n_frames, [traj.n_frames for traj in trajs]) 2 31 [15, 16]
- property CVs: dict[str, ndarray]#
Returns a simple dict from the more complicated self._CVs xarray Dataset.
If self._CVs is empty and self.traj_file is a HDF5 (.h5) file, the contents of the HDF5 will be checked, whether CVs have been stored there. If not and empty dict will be returned.
- Type:
- _mdtraj_attr = ['n_frames', 'n_atoms', 'n_chains', 'n_residues', 'openmm_boxes', 'openmm_positions', 'time', 'timestep', 'xyz', 'unitcell_vectors', 'unitcell_lengths', 'unitcell_angles', '_check_valid_unitcell', '_distance_unit', '_have_unitcell', '_rmsd_traces', '_savers', '_string_summary_basic', '_time', '_time_default_to_arange', '_topology', '_unitcell_angles', '_unitcell_lengths', '_xyz']#
- property _n_frames_base_h5_file: int#
Can be used to get n_frames without loading an HDF5 into memory.
- Type:
- property _original_frame_indices: ndarray#
If trajectory has not been loaded, it is loaded and the frames of the trajectory file on disk are put into a np.arange(). If the trajectory is sliced in weird ways, this array tracks the original frames.
- Type:
- property _traj#
Needs to be here to complete setter. Not returning anything, because setter is also not returning anything.
- atom_slice(atom_indices, invert=False)[source]#
Deletes atoms from this
SingleTraj
instance.- Parameters:
atom_indices (Union[list, numpy.ndarray]) – The indices of the atoms to keep.
invert (bool) – If False, it is assumed, that the atoms in
atom_indices
are the ones to be kept. If True, the atoms inatom_indices
are the ones to be removed.
- Return type:
None
- property basename: str#
Basename is the filename without path and without extension. If basename_fn is not None, it will be applied to traj_file.
- Type:
- dash_summary()[source]#
Returns a
pandas.DataFrame
with useful information about this instance.- Returns:
The dataframe.
- Return type:
pd.DataFrame
- del_CVs()[source]#
Resets the
_CVs
attribute to an emptyxarray.Dataset
.- Return type:
None
- property extension: str#
Extension is the file extension of the trajectory file (self.traj_file).
- Type:
- property featurizer#
- classmethod from_pdb_id(pdb_id, traj_num=None)[source]#
Alternate constructor for the TrajEnsemble class.
Builds an SingleTraj class from a pdb-id.
- Parameters:
- Returns:
An SingleTraj class.
- Return type:
- property fsel#
- get_single_frame(key)[source]#
Returns a single frame from the trajectory.
- Parameters:
key (Union[int, np.int]) – Index of the frame.
- Return type:
Examples
Import EncoderMap and load
SingleTraj
.>>> import encodermap as em >>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb") >>> traj.n_frames 14
Load the same traj and give it a
traj_num
for recognition in a set of multiple trajectories.>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb", traj_num=5) >>> frame = traj.get_single_frame(2) >>> frame.id array([[5, 2]])
- property id: ndarray#
id is an array of unique identifiers which identify the frames in this SingleTraj object when multiple Trajectories are considered.
If the traj was initialized from an TrajEnsemble class, the traj gets a unique identifier (traj_num) which will also be put into the id array, so that id can have two shapes ((n_frames, ), (n_frames, 2)) This corresponds to self.id.ndim = 1 and self.id.ndim = 2. In the latter case self.id[:,1] are the frames and self.id[:,0] is an array full of traj_num.
- Type:
- property indices_chi1: ndarray#
A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.
- Type:
- property indices_chi2: ndarray#
A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.
- Type:
- property indices_chi3: ndarray#
A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.
- Type:
- property indices_chi4: ndarray#
A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.
- Type:
- property indices_chi5: ndarray#
A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.
- Type:
- property indices_omega: ndarray#
A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.
- Type:
- property indices_phi: ndarray#
A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.
- Type:
- property indices_psi: ndarray#
A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.
- Type:
- iterframes(with_traj_num: bool = False) Iterable[tuple[int, SingleTraj]] [source]#
- iterframes(with_traj_num: bool = True) Iterable[tuple[int, int, SingleTraj]]
Iterator over the frames in this class.
- Parameters:
with_traj_num (bool) – Whether to return a three-tuple of traj_num, frame_num, frame (True) or just traj_num, frame (False).
- Yields:
tuple –
- A tuple containing the following:
int: The traj_num.
int: The frame_num.
encodermap.SingleTraj: An SingleTraj object.
Examples
Import EncoderMap and create
SingleTraj
instance.>>> import encodermap as em >>> traj = em.SingleTraj('https://files.rcsb.org/view/1YUG.pdb') >>> traj.n_frames 15
Slicing the trajectory every 5th frame
>>> traj = traj[::5] >>> traj.n_frames 3
Using the
iterframes()
iterator.>>> for frame_num, frame in traj.iterframes(): ... print(frame_num, frame.n_frames) 0 1 5 1 10 1
- join(other)[source]#
Join two trajectories together along the time/frame axis.
Note
Returns a
mdtraj.Trajectory
and thus loses CVs, filenames, etc.- Parameters:
other (SingleTraj | Trajectory)
- Return type:
Trajectory
- load_CV(data, attr_name=None, cols=None, deg=None, periodic=True, labels=None, override=False)[source]#
Load CVs into traj. Many options are possible. Provide xarray, numpy array, em.loading.feature, em.featurizer, and even string!
This method loads CVs into the SingleTraj instance. Many ways of doing so are available:
numpy.ndarray
: The easiest way. Provide a np array and a name forthe array, and the data will be saved as an instance variable, accesible via SingleTraj.name.
xarray.DataArray
: You can load a multidimensional xarray asdata into the class. Please refer to xarrays own documentation if you want to create one yourself.
xarray.Dataset
: You can add another dataset to the existing _CVs.encodermap.loading.features.Feature
: If you provide one of thefeatures from
encodermap.loading.features
the resulting features will be loaded and also be placed under the set name.
encodermap.loading.featurizer.Featurizer
: If you provide afull featurizer, the data will be generated and be accessible as an attribute.
- str: If a string is provided, the data will be loaded from a
.txt
,.npy
, or NetCDF / HDF5.nc
file.
- Parameters:
(Union[str (data) – em.loading.features.Feature, em.loading.featurizer.Featurizer]): The CV to load. Either as
numpy.ndarray
,xarray.DataArray
, EncoderMap feature, or EncoderMap Featurizer.numpy.ndarray – em.loading.features.Feature, em.loading.featurizer.Featurizer]): The CV to load. Either as
numpy.ndarray
,xarray.DataArray
, EncoderMap feature, or EncoderMap Featurizer.xr.DataArray – em.loading.features.Feature, em.loading.featurizer.Featurizer]): The CV to load. Either as
numpy.ndarray
,xarray.DataArray
, EncoderMap feature, or EncoderMap Featurizer.data (SingleTrajFeatureType)
attr_name (Optional[str])
deg (Optional[bool])
periodic (bool)
override (bool)
- Return type:
None
- :paramem.loading.features.Feature, em.loading.featurizer.Featurizer]):
The CV to load. Either as
numpy.ndarray
,xarray.DataArray
, EncoderMap feature, or EncoderMap Featurizer.
- Parameters:
attr_name (Optional[str]) – The name under which the CV should be found in the class. Is needed, if a raw numpy array is passed, otherwise the name will be generated from the filename (if
data == str
), the DataArray.name (ifdata == xarray.DataArray
), or the feature name.cols (Optional[list]) – A list specifying the columns to use it for the high-dimensional data. If your highD data contains (x,y,z,…)-errors or has an enumeration column at
col=0
this can be used to remove this unwanted data.deg (Optional[bool]) – Whether the provided data is in radians (False) or degree (True). It can also be None for non-angular data.
labels (Optional[Union[list, str]]) – If you want to label the data you provided, pass a list of str. If set to None, the features in this dimension will be labeled as
[f"{attr_name.upper()} FEATURE {i}" for i in range(self.n_frames)]
. If a str is provided, the features will be labeled as[f"{attr_name.upper()} {label.upper()} {i}" for i in range(self.n_frames)]
. If a list of str is provided, it needs to have the same length as the traj has frames. Defaults to None.override (bool) – Whether to overwrite existing CVs. The method will also print a message which CVs have been overwritten.
data (SingleTrajFeatureType)
periodic (bool)
- Return type:
None
Examples
Import EncoderMap and load an example Trajectory.
>>> import encodermap as em >>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb")
Load the central dihedrals using ``data=’central_dihedrals’` as shortcut.
>>> traj.load_CV("central_dihedrals") >>> traj.central_dihedrals.shape (14, 222)
>>> traj._CVs['central_dihedrals']['CENTRAL_DIHEDRALS'].values[:2] ['CENTERDIH PSI RESID MET: 1 CHAIN 0' 'CENTERDIH OMEGA RESID MET: 1 CHAIN 0']
Slicing the
SingleTraj
keeps all CVs in order.>>> import numpy as np >>> from pathlib import Path >>> traj1 = em.SingleTraj( ... Path(em.__file__).parent.parent / "tests/data/1am7_corrected.xtc", ... Path(em.__file__).parent.parent / "tests/data/1am7_protein.pdb", ... ) >>> traj1.load_CV(traj1.xyz[..., -1], 'z_coordinate') ... >>> for i, frame in enumerate(traj1): ... print(np.array_equal(frame.z_coordinate[0], frame.xyz[0, :, -1])) ... if i == 3: ... break True True True True
- Raises:
FileNotFoundError – When the file given by data does not exist.
IOError – When the provided filename does not have .txt, .npy or .nc extension.
TypeError – When data does not match the specified input types.
Exception – When a numpy array has been passed as data and no attr_name has been provided.
Exception – When the provided attr_name is str, but cannot be a python identifier.
- Parameters:
- Return type:
None
- load_custom_topology(custom_top=None)[source]#
Loads a custom_topology from a CustomTopology class or a dict.
- Parameters:
custom_top (Optional[Union[CustomTopology, CustomAAsDict]]) – Optional[Union[CustomTopology, CustomAAsDict]]: An instance of
encodermap.trajinfo.trajinfo_utils.CustomTopology
or a dictionary that can be made into such.- Return type:
None
- load_traj(new_backend='mdtraj')[source]#
Loads the trajectory, with a new specified backend.
After this is called the instance variable self.trajectory will contain a mdtraj Trajectory object.
- Parameters:
new_backend (str, optional) –
- Can either be:
'mdtraj'
to load the trajectory using mdtraj.'no_load'
to not load the traj (unload).
Defaults to
'mdtraj'
.- Return type:
None
- property n_atoms: int#
Number of atoms in traj.
Loads the traj into memory if not in HDF5 file format. Be aware.
- Type:
- property n_frames: int#
Number of frames in traj.
Loads the traj into memory if not in HDF5 file format. Be aware.
- Type:
- save(fname, CVs='all', overwrite=False)[source]#
Save the trajectory as HDF5 file format to disk.
- Parameters:
- Raises:
IOError – When the file already exists and overwrite is False.
- Return type:
None
- save_CV_as_numpy(attr_name, fname=None, overwrite=False)[source]#
Saves a specified collective variable of this traj as a
.npy
file.This got its own method for parallelization purposes.
- select(sel_str='all')[source]#
Execute a selection against the topology.
Examples
>>> import encodermap as em >>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb") >>> select = traj.top.select("name CA and resSeq 1") >>> select array([1])
>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb") >>> select = traj.top.select("name CA and resSeq 1") >>> traj.top.atom(select[0]) MET1-CA
- show_traj(gui=True)[source]#
Returns an nglview view object.
- Returns:
The nglview widget object.
- Return type:
view (nglview.widget)
- Parameters:
gui (bool)
- sidechain_info()[source]#
Indices used for the AngleDihedralCartesianEncoderMap class to allow training with multiple different sidechains.
- Returns:
The indices. The key ‘-1’ is used for the hypothetical convex hull of all feature spaces (the output of the tensorflow model). The other keys match the common_str of the trajs.
- Return type:
- Raises:
Exception – When the common_strings and topologies are not aligned. An exception is raised. Aligned means that all trajs with the same common_str should possess the same topology.
- stack(other)[source]#
Stack two trajectories along the atom axis
Note
Returns a m``dtraj.Trajectory`` and thus loses CVs, filenames, etc.
- Parameters:
other (SingleTraj)
- Return type:
Trajectory
- superpose(reference, frame=0, atom_indices=None, ref_atom_indices=None, parallel=True, inherit_CVs=False)[source]#
Superpose each conformation in this trajectory upon a reference
- Parameters:
reference (Union[mdtraj.Trajectory, SingleTraj]) – The reference frame to align to. If the reference has multiple frames and you want to use a specific frame as reference, use the
frame
argument also.frame (int, optional) – Align to this frame in reference. Default is 0.
atom_indices (Union[np.array, None], optional) – Indices in self, used to calculate RMS values. Defaults to None which means all atoms will be used.
ref_atom_indices (Union[np.array, None], optional) – Indices in reference, used to calculate RMS values. Defaults to None which means all atoms will be used.
parallel (bool, optional) – Use OpenMP to run the superposition in parallel over multiple cores.
inherit_CVs (bool, optional) – Whether to also inherit the CVs. This feature is currently not implemented. It would require additional code in all Feature classes discerning intrinsic (distance, angle, cluster_membership, etc.) or an extrinsic feature (absolute coordinate, COG position, etc.). Then this extrinsic/intrinsic boolean flag also needs to accompany the xarray Datasets, so that the intrinsic CVs can be inherited, and the extrinsic can be dropped with a corresponding message.
- Returns:
A new trajectory with atoms aligned.
- Return type:
- property top: Topology#
The structure of a Topology object is similar to that of a PDB file.
It consists. of a set of Chains (often but not always corresponding to polymer chains). Each Chain contains a set of Residues, and each Residue contains a set of Atoms. In addition, the Topology stores a list of which atom pairs are bonded to each other. Atom and residue names should follow the PDB 3.0 nomenclature for all molecules for which one exists
- chains#
Iterate over chains.
- Type:
generator
- residues#
Iterate over residues.
- Type:
generator
- atoms#
Iterate over atoms.
- Type:
generator
- bonds#
Iterate over bonds.
- Type:
generator
- Type:
mdtraj.Topology
- property traj: Trajectory#
This attribute always returns an mdtraj.Trajectory. if backend is ‘no_load’, the trajectory will be loaded into memory and returned.
- Type:
mdtraj.Trajectory
encodermap.trajinfo.load_traj module#
Util functions for the TrajEnsemble and SingleTraj classes.
encodermap.trajinfo.trajinfo_utils module#
Util functions for the TrajEnsemble and SingleTraj classes.
- class CustomTopology(*new_residues, traj=None)[source]#
Bases:
object
Adds custom topology elements to a topology parsed by MDTraj.
Postpones parsing the custom AAs until requested.
- The custom_aminoacids dictionary follows these styleguides:
The keys can be str or tuple[str, str]
If a key is str, it needs to be a 3-letter code (MET, ALA, GLY, …)
- If a key is a tuple[str, str], the first str of the tuple is a common_str
(see the docstring for encodermap.TrajEnsemble to learn about common_str. This common_str can be used to apply custom topologies to an ensemble based on their common_str. For example:
{("CSR_mutant", "CSR"): ...}
- A key can also affect only a single residue (not all resides called “CSR”).
For that, the 3-letter code of the residue needs to be postponed with a dash and the 1-based indexed resSeq of the residue:
{"CSR-2": ...}
- The value to a key can be None, which means this residue will not be
used for building a topology. Because EncoderMap raises Exceptions, when it encounters unknown residues (to make sure, you don’t forget to featurize some important residues), it will also raise Exceptions when the topology contains unknown solvents/solutes. If you run a simulation in water/methanol mixtures with the residue names SOL and MOH, EncoderMap will raise an Exception upon encountering MOH, so your custom topology should contain 1{“MOH”: None}` to include MOH.
- The value of a key can also be a tuple[str, Union[dict, None]]. In this
case, the first string should be the one-letter code of the residue or the residue most closely representing this residue. If you use phosphotyrosine (PTR) in your simulations and want to use it as a standard tyrosine residue, the custom topology should contain {“PTR”: (“Y”, None)}
- If your residue is completely novel you need to define all possible
bonds, backbone and sidechain dihedrals yourself. For that, you want to provide a tuple[str, dict[str, Union[listr[str], list[int]]] type. This second level dict allows for the following keys: * bonds: For bonds between atoms. This key can contain a list[tuple[str, str]],
which defines bonds in this residue. This dict defines a bond between N and CA in phosphothreonine. {“PTR”: (“Y”, {
- “bonds”: [
(“N”, “CA”),
],
}} These strings can cotain + and - signs to denote bonds to previous or following residues. To connect the residues MET1 to TPO2 to ALA3, you want to have this dict: {“TPO”: (“T”, {
- “bonds”: [
(“-C”, “N”), # bond to MET1-C (“N”, “CA”), … (“C”, “+N”), # bond to ALA2-N
],
}} For exotic bonds, one of the strings can also be int to connect to any 0-based indexed atom in your topology. You can connect the residues CYS2 and CYS20 wit a sulfide bride like so: {“CYS-2”: (“C”, {
- “bonds”: [
(“S”, 321), # connect to CYS20, the 321 is a placeholder
], },
- “CYS-20”: (“C”, {
- “bonds”: [
(20, “S”), # connect to CYS2
], },
}
- optional_bonds: This key accepts the same list[tuple] as ‘bonds’.
However, bonds will raise an Exception if a bond already exists. The above example with a disulfide bridge between CYS2 and CYS20 will thus raise an exception. A better example is: {“CYS-2”: (“C”, {
- “optional_bonds”: [
(“S”, 321), # connect to CYS20, the 321 is a placeholder
], },
- “CYS-20”: (“C”, {
- “optional_bonds”: [
(20, “S”), # connect to CYS2
], },
}
- delete_bonds: This key accepts the same list[tuple] as ‘bonds’,
but will remove bonds. If a bond was marked for deletion, but does not exist in your topology, an Exception will be raised. To delete bonds, without raising an Exception, use:
- optional_delete_bonds: This will delete bonds, if they are present
and won’t raise an Exception if no bond is present.
- PHI, PSI, OMEGA: These keys define the backbone torsions of this
residue. You can just provide a list[str] for these keys. But the str can contain + and - to use atoms in previous or following residues. Example: {
- “CYS-2”: (
“C”, {
“PHI”: [“-C”, “N”, “CA”, “C”], “PSI”: [“N”, “CA”, “C”, “+N”], “OMEGA”: [“CA”, “C”, “+N”, “+CA”],
},
),
}
- not-PSI, not_OMEGA, not_PHI: Same as ‘PHI’, ‘PSI”, ‘OMEGA’, but
will remove these dihedrals from consideration. The vales of these keys do not matter. Example: {
- “CYS-2”: (
“C”, {
“PHI”: [“-C”, “N”, “CA”, “C”], “not_PSI”: [], # value for not_* keys does not matter “not_OMEGA”: [], # it just makes EncoderMap skip these dihedrals.
},
),
}
- CHI1, …, CHI5: Finally, these keys define the atoms considered for
the sidechain angles. If you want to add extra sidechain dihedrals for phosphothreonine, you can do: {
- “TPO”: (
“T”, {
“CHI2”: [“CA”, “CB”, “OG1”, “P”], # include phosphorus in sidechain angles “CHI3”: [“CB”, “OG1”, “P”, “OXT”], # include the terminal axygen in sidechain angles
},
)
}
Examples
>>> # Aminoacids taken from https://www.swisssidechain.ch/ >>> # The provided .pdb file has only strange and unnatural aminoacids. >>> # Its sequence is: >>> # TPO - PTR - ORN - OAS - 2AG - CSR >>> # TPO: phosphothreonine >>> # PTR: phosphotyrosine >>> # ORN: ornithine >>> # OAS: o-acetylserine >>> # 2AG: 2-allyl-glycine >>> # CSR: selenocysteine >>> # However, someone mis-named the 2AG residue to ALL >>> # Let's fix that with EncoderMap's CustomTopology >>> import encodermap as em >>> from pathlib import Path ... >>> traj = em.load(Path(em.__file__).resolve().parent.parent / "tests/data/unnatural_aminoacids.pdb") ... >>> custom_aas = { ... "ALL": ("A", None), # makes EncoderMap treat 2-allyl-glycine as alanine ... "OAS": ( ... "S", # OAS is 2-acetylserine ... { ... "CHI2": ["CA", "CB", "OG", "CD"], # this is a non-standard chi2 angle ... "CHI3": ["CB", "OG", "CD", "CE"], # this is a non-standard chi3 angle ... }, ... ), ... "CSR": ( # CSR is selenocysteine ... "S", ... { ... "bonds": [ # we can manually define bonds for selenocysteine like so: ... ("-C", "N"), # bond between previous carbon and nitrogen CSR ... ("N", "CA"), ... ("N", "H1"), ... ("CA", "C"), ... ("CA", "HA"), # this topology includes hydrogens ... ("C", "O"), ... ("C", "OXT"), # As the C-terminal residue, we don't need to put ("C", "+N") here ... ("CA", "CB"), ... ("CB", "HB1"), ... ("CB", "HB2"), ... ("CB", "SE"), ... ("SE", "HE"), ... ], ... "CHI1": ["N", "CA", "CB", "SE"], # this is a non-standard chi1 angle ... }, ... ), ... "TPO": ( # TPO is phosphothreonine ... "T", ... { ... "CHI2": ["CA", "CB", "OG1", "P"], # a non-standard chi2 angle ... "CHI3": ["CB", "OG1", "P", "OXT"], # a non-standard chi3 angle ... }, ... ), ... } ... >>> # loading this will raise an Exception, because the bonds in CSR already exist >>> traj.load_custom_topology(custom_aas) Traceback (most recent call last): ... Exception: Bond between ALL5-C and CSR6-N already exists. Consider using the key 'optional_bonds' to not raise an Exception on already existing bonds. >>> # If we rename the "bonds" section in "CSR" to "optional_bonds" it will work >>> custom_aas["CSR"][1]["optional_bonds"] = custom_aas["CSR"][1].pop("bonds") >>> traj.load_custom_topology(custom_aas) >>> sidechains = em.features.SideChainDihedrals(traj).describe() >>> "SIDECHDIH CHI2 RESID OAS: 4 CHAIN 0" in sidechains True >>> "SIDECHDIH CHI3 RESID OAS: 4 CHAIN 0" in sidechains True >>> "SIDECHDIH CHI1 RESID CSR: 6 CHAIN 0" in sidechains True >>> "SIDECHDIH CHI2 RESID TPO: 1 CHAIN 0" in sidechains True >>> "SIDECHDIH CHI3 RESID TPO: 1 CHAIN 0" in sidechains True
- Parameters:
new_residues (NewResidue)
traj (Optional[SingleTraj])
- add_bonds()[source]#
Adds and deletes bonds specified in the custom topology.
- Returns:
The new topology.
- Return type:
md.Topology
- add_new_residue(new_residue)[source]#
Adds an instance of NewResidue to the reisdues of this CustomTopology.
- Parameters:
new_residue (NewResidue) – An instance of NewResidue.
- Return type:
None
- atom_sequence(type)[source]#
Returns either backbone or sidechain indices in a useful order.
- Parameters:
type (Literal["OMEGA", "PHI", "PSI", "CHI1", "CHI2", "CHI3", "CHI4", "CHI5"]) – The angle, that is looked for.
- Returns:
A tuple containing two numpy arrays:
- Return type:
tuple[np.ndarray, np.ndarray]
- classmethod from_dict(custom_aas, traj=None)[source]#
Instantiate the class from a dictionary.
- Parameters:
custom_aas (CustomAAsDict) –
Custom AAs defined by a dict with the following properties: The keys are the residue names encountered in this traj. The values to the keys can be one of three types:
- None: if a key: None pair is supplied, this just adds the
residue to the recognized residues. Nothing will be done with it.
- str: If a key: str pair is supplied, it is expected that the
string matches one of the one-letter amino-acid codes. If your new residue is based on Lysine and you named it LYQ, you need to supply: {“LYQ”: “K”}
- tuple[str, dict]: If your residue has nonstandard side-chain
angles (i.e. due to phosphorylation), you can supply a tuple of the one-letter amino-acid code and a dict which defines the sidechain angles like so: {“THR”: (“T”, {“CHI2”: [“CA”, “CB”, “CG”, “P”]})} In this example, the standard amino acid threonine was phosphorylated. The chi2 angle was added. If you want to add custom bonds you can add the “bond” key to the dict and give it either atom names or atom indices of other atoms like so: {“LYQ”: (“K”, {“bonds”: [(“N”, “CA”), (“N”, “H”), …], “CHI1”: [“N”, “CA”, “CB”, “CG”]}).
- tuple[str, str, dict]: In this case, the first string should
be the name of the amino-acid, the second string should be a common_str, that is in self.common_str. That way, the different topologies in this TrajEnsemble can dynamically use different custom_aas.
traj (SingleTraj | None)
- classmethod from_hdf5_file(fname, traj=None)[source]#
- Parameters:
traj (SingleTraj | None)
- classmethod from_json(json_str, traj=None)[source]#
The same as from_dict, but using a json str.
- Parameters:
json_str (str)
traj (SingleTraj | None)
- classmethod from_yaml(path, traj=None)[source]#
- Parameters:
traj (SingleTraj | None)
- get_single_residue_atom_ids(atom_names, r, key_error_ok=False)[source]#
Gives the 0-based atom ids of a single residue.
- Parameters:
atom_names (list[str]) – The names of the atoms. ie. [‘N’ ,’CA’, ‘C’, ‘+N’]
r (NewResidue) – An instance of NewResidue.
key_error_ok (bool) – Whether a key error when querying self._atom_dict raises an error or returns an empty np.ndarray.
- Returns:
An integer array with the ids of the requested atoms.
- Return type:
np.ndarray
- indices_chi1()[source]#
Returns the requested indices as a (n_dihedrals, 4)-shaped numpy array.
- Return type:
- indices_chi2()[source]#
Returns the requested indices as a (n_dihedrals, 4)-shaped numpy array.
- Return type:
- indices_chi3()[source]#
Returns the requested indices as a (n_dihedrals, 4)-shaped numpy array.
- Return type:
- indices_chi4()[source]#
Returns the requested indices as a (n_dihedrals, 4)-shaped numpy array.
- Return type:
- indices_chi5()[source]#
Returns the requested indices as a (n_dihedrals, 4)-shaped numpy array.
- Return type:
- indices_omega()[source]#
Returns the requested indices as a (n_dihedrals, 4)-shaped numpy array.
- Return type:
- indices_phi()[source]#
Returns the requested indices as a (n_dihedrals, 4)-shaped numpy array.
- Return type:
- indices_psi()[source]#
Returns the requested indices as a (n_dihedrals, 4)-shaped numpy array.
- Return type:
- property new_residues: list[NewResidue]#
A list of all new residues.
- Type:
- sidechain_sequence(atom_names, type, top=None)[source]#
Searches for a sequence along the sidechains.
- Parameters:
atom_names (list[str]) – The names of the atoms. Can use +/- to mark atoms in previous or following residue.
(Literal["CHI1" (type) – The type of the dihedral sequence.
"CHI2" – The type of the dihedral sequence.
"CHI3" – The type of the dihedral sequence.
"CHI4" – The type of the dihedral sequence.
"CHI5"] – The type of the dihedral sequence.
top (Optional[md.Topology]) – Can be used to overwrite the toplogy in self.traj.
type (Literal['CHI1', 'CHI2', 'CHI3', 'CHI4', 'CHI5'])
- Returns:
The integer indices of the requested atoms.
- Return type:
np.ndarray
- to_dict()[source]#
- Return type:
dict[str | tuple[str, str], None | tuple[str, None] | tuple[str, dict[Literal[‘bonds’, ‘optional_bonds’, ‘delete_bonds’, ‘optional_delete_bonds’, ‘PHI’, ‘PSI’, ‘OMEGA’, ‘not_PHI’, ‘not_PSI’, ‘not_OMEGA’, ‘CHI1’, ‘CHI2’, ‘CHI3’, ‘CHI4’, ‘CHI5’], list[str] | list[tuple[str | int, str | int]]]]]
- property top: Topology#
The fixed topology.
- Type:
md.Topology
- load_CVs_ensembletraj(trajs, data, attr_name=None, cols=None, deg=None, periodic=True, labels=None, directory=None, ensemble=False, override=False)[source]#
Loads CVs for a trajectory ensemble.
- CVs can be loaded from a multitude of sources. The argument data can be:
np.ndarray: Use a numpy array as a feature.
- str | Path: You can point to .txt or .npy files and load the features
from these files. In this case, the cols argument can be used to only use a subset of columns in these files. You can also point to a single directory in which case the basename of the trajectories will be used to look for .npy and .txt files.
- str: Some strings like “central_dihedrals” are recognized out-of-the-box.
You can also provide “all” to load all dihedrals used in an encodermap.AngleDihedralCartesianEncoderMap.
- Feature: You can provide an encodermap.loading.features Feature. The
CVs will be loaded by creating a featurizer, adding this feature, and obtaining the output.
- Featurizer: You can also directly provide a featurizer with multiple
features.
- xr.DataArray: You can also provide a xarray.DataArray, which will be
appended to the existing CVs.
- xr.Dataset: If you provide a xarray.Dataset, you will overwrite all
currently loaded CVs.
- Parameters:
trajs (TrajEnsemble) – The trajectory ensemble to load the data for.
data (Union[str, list, np.ndarray, 'all', xr.Dataset]) – The CV to load. When a numpy array is provided, it needs to have a shape matching n_frames. The data is distributed to the trajs. When a list of files is provided, len(data) needs to match n_trajs. The first file will be loaded by the first traj (based on the traj’s traj_num) and so on. If a list of np.ndarray is provided, the first array will be assigned to the first traj (based on the traj’s traj_num). If None is provided, the argument directory will be used to construct a str like: fname = directory + traj.basename + ‘_’ + attr_name. If there are .txt or .npy files matching that string in the directory, the CVs will be loaded from these files to the corresponding trajs. Defaults to None.
attr_name (Optional[str]) – The name under which the CV should be found in the class. Choose whatever you like. highd, lowd, dists, etc. The CV can then be accessed via dot-notation: trajs.attr_name. Defaults to None, in which case, the argument data should point to existing files and the attr_name will be extracted from these files.
A list of integers indexing the columns of the data to be loaded. This is useful if a file contains columns which are not features (i.e. an indexer or the error of the features. eg:
id f1 f2 f1_err f2_err 0 1.0 2.0 0.1 0.1 1 2.5 1.2 0.11 0.52
In that case, you would want to supply cols=[1, 2] to the cols argument. If None all columns are loaded. Defaults to None.
deg (Optional[bool]) – Whether to return angular CVs using degrees. If None or False, CVs will be in radian. Defaults to None.
labels (list) – A list containing the labels for the dimensions of the data. If you provide a np.ndarray with shape (n_trajs, n_frames, n_feat), this list needs to be of len(n_feat) Defaults to None.
directory (Optional[str]) – If this argument is provided, the directory will be searched for
.txt
or.npy
files which have the same names as the trajectories have basenames. The CVs will then be loaded from these files.ensemble (bool) – Whether the trajs in this class belong to an ensemble. This implies that they contain either the same topology or are very similar (think wt, and mutant). Setting this option True will try to match the CVs of the trajs onto the same dataset. If a VAL residue has been replaced by LYS in the mutant, the number of sidechain dihedrals will increase. The CVs of the trajs with VAL will thus contain some NaN values. Defaults to False.
override (bool) – Whether to override CVs with the same name as attr_name.
periodic (bool)
- Return type:
None
- load_CVs_singletraj(data, traj, attr_name=None, cols=None, deg=None, periodic=True, labels=None)[source]#
- Parameters:
data (str | Path | ndarray | Feature | Dataset | DataArray | SingleTrajFeaturizer | DaskFeaturizer | Literal['all'] | ~typing.Literal['full'] | ~encodermap.loading.features.AllCartesians | ~encodermap.loading.features.AllBondDistances | ~encodermap.loading.features.CentralCartesians | ~encodermap.loading.features.CentralBondDistances | ~encodermap.loading.features.CentralAngles | ~encodermap.loading.features.CentralDihedrals | ~encodermap.loading.features.SideChainCartesians | ~encodermap.loading.features.SideChainBondDistances | ~encodermap.loading.features.SideChainAngles | ~encodermap.loading.features.SideChainDihedrals | ~encodermap.loading.features.CustomFeature | ~encodermap.loading.features.SelectionFeature | ~encodermap.loading.features.AngleFeature | ~encodermap.loading.features.DihedralFeature | ~encodermap.loading.features.DistanceFeature | ~encodermap.loading.features.AlignFeature | ~encodermap.loading.features.InverseDistanceFeature | ~encodermap.loading.features.ContactFeature | ~encodermap.loading.features.BackboneTorsionFeature | ~encodermap.loading.features.ResidueMinDistanceFeature | ~encodermap.loading.features.GroupCOMFeature | ~encodermap.loading.features.ResidueCOMFeature | ~encodermap.loading.features.SideChainTorsions | ~encodermap.loading.features.MinRmsdFeature | None)
traj (SingleTraj)
attr_name (str | None)
deg (bool | None)
periodic (bool)
- Return type: