encodermap.trajinfo package#

Submodules#

encodermap.trajinfo.info_all module#

Classes to work with ensembles of trajectories.

The statistics of a protein can be better described by an ensemble of proteins, rather than a single long trajectory. Treating a protein in such a way opens great possibilities and changes the way one can treat molecular dynamics data. Trajectory ensembles allow:

Faster convergence via adaptive sampling.

Better grasp of equilibrium and off-equilibrium dynamics.

This subpackage contains two classes which are containers of trajectory data. The SingleTraj trajectory contains information about a single trajectory. The TrajEnsemble class contains information about multiple trajectories. This adds a new dimension to MD data. The time and atom dimension are already established. Two frames can be appended along the time axis to get a trajectory with multiple frames. If they are appended along the atom axis, the new frame contains the atoms of these two. The trajectory works in a similar fashion. Adding two trajectories along the trajectory axis returns a trajectory ensemble, represented as a TrajEnsemble class in this package.

class TrajEnsemble(trajs, tops=None, backend='no_load', common_str=None, basename_fn=None, traj_nums=None, custom_top=None)[source]#

Bases: object

A fancy list of single trajectories. Topologies can be different across trajs.

Check out http://statisticalbiophysicsblog.org/?p=92 for why trajectory ensembles are awesome.

This class is a fancy list of encodermap.trajinfo.info_single.SingleTraj`. Trajectories can have different topologies and will be grouped by the common_str argument. Each trajectory has its own unique traj_num, which identifies it in the ensemble - even when the ensemble is sliced or subsampled.

Examples

>>> import encodermap as em
>>> traj1 = em.SingleTraj.from_pdb_id("1YUG")
>>> traj2 = em.SingleTraj.from_pdb_id("1YUF")

Addition of two encodermap.trajinfo.info_single.SingleTraj also creates an ensemble.

>>> trajs = traj1 + traj2
>>> trajs  
<encodermap.TrajEnsemble object. Current backend is no_load. Containing 2 trajectories. Common str is ['1YUG', '1YUF']. Not containing any CVs...>

Indexing a TrajEnsemble returns a encodermap.trajinfo.info_single.SingleTraj based on its 0-based index. Think of the TrajEnsmeble as a list of encodermap.trajinfo.info_single.SingleTraj. But trajectories can also have traj_nums, which do not have to adhere to [0, 1, 2, ...]. This is similar to how a pandas.DataFrame offers indexing via .loc[] and .iloc[] (https://pandas.pydata.org/docs/user_guide/indexing.html#different-choices-for-indexing). For indexing trajs based on their traj_num, you can use the .tsel[] accessor of the TrajEnsmeble

Examples

>>> import encodermap as em
>>> traj1 = em.SingleTraj.from_pdb_id("1YUG")
>>> traj2 = em.SingleTraj.from_pdb_id("1YUF")

Addition of two SingleTraj also creates an ensemble.

>>> trajs = traj1 + traj2
>>> trajs.traj_nums
[0, 1]

Change the traj_num of traj2

>>> trajs[1].traj_num = 4
>>> trajs.traj_nums
[0, 4]
>>> trajs[1]  
<encodermap.SingleTraj object. Currently not in memory. Basename is '1YUF'. Not containing any CVs. Common string is '1YUF'. Object at ...>
>>> trajs.tsel[4]  
<encodermap.SingleTraj object. Currently not in memory. Basename is '1YUF'. Not containing any CVs. Common string is '1YUF'. Object at ...>

TrajEnsemble supports fancy indexing. You can slice to your liking (trajs[::5] returns a TrajEnsemble object that only consideres every fifth frame). Besides indexing by slices and integers, you can pass a 2-dimensional numpy.ndarray. np.array([[0, 5], [1, 10], [5, 20]]) will return a TrajEnsemble object with frame 5 of trajectory 0, frame 10 of trajectory 1 and frame 20 of trajectory 5.

Examples

>>> import encodermap as em
>>> traj1 = em.SingleTraj.from_pdb_id("1YUG")
>>> traj2 = em.SingleTraj.from_pdb_id("1YUF")
>>> trajs = traj1 + traj2
>>> sel = trajs[[[0, 0], [0, 1], [0, 2], [1, 10]]]
>>> sel  
<encodermap.TrajEnsemble object. Current backend is no_load. Containing 4 frames and 2 trajectories. Common str is...>

The TrajEnsemble class also is an iterator to iterate over trajectores. Besides plain iteration, the TrajEnsmeble also offers alternate iterators. The itertrajs() iterator returns a two-tuple of traj_num and traj. The iterframes() iterator returns a three-tuple of traj_num, frame_num, and traj.

Examples

>>> import encodermap as em
>>> traj1 = em.SingleTraj.from_pdb_id("1YUG")
>>> traj2 = em.SingleTraj.from_pdb_id("1YUF")
>>> trajs = traj1 + traj2
>>> trajs[1].traj_num = 4
>>> for traj_num, traj in trajs.itertrajs():
...     print(traj_num, traj.n_frames)
0 15
4 16
>>> for traj_num, frame_num ,traj in trajs.subsample(10).iterframes():
...     print(traj_num, frame_num, traj.n_frames)
0 0 1
0 10 1
4 0 1
4 10 1

The TrajEnsemble has multiple alternative constructors. The with_overwrite_trajnums constructor fixes inhomogeneous sequences of encodermap.trajinfo.info_single.SingleTraj and TrajEnsemble.

Examples

>>> import encodermap as em
>>> traj1 = em.SingleTraj.from_pdb_id("1YUG", traj_num=0)
>>> traj2 = em.SingleTraj.from_pdb_id("1YUF", traj_num=0)
>>> trajs = em.TrajEnsemble([traj1, traj2])  
Traceback (most recent call last):
    ...
Exception: The `traj_num` attributes of the provided 2 `SingleTraj`s is not unique, the `traj_num` 0 occurs 2 times. This can happen, if you use `SingleTraj`s, that are already part of a `TrajEnsemble`. To create copies of the `SingleTraj`s and overwrite their `traj_num`s, use the `with_overwrite_trajnums()` constructor.
>>> trajs = em.TrajEnsemble.with_overwrite_trajnums(traj1, traj2)
>>> trajs  
<encodermap.TrajEnsemble...>

The from_dataset constructor can be used to load an ensemble from an .h5 file

Examples

>>> import encodermap as em
>>> from tempfile import TemporaryDirectory
>>> traj1 = em.SingleTraj.from_pdb_id("1YUG")
>>> traj2 = em.SingleTraj.from_pdb_id("1YUF")
>>> trajs = em.TrajEnsemble([traj1, traj2])
>>> with TemporaryDirectory() as td:
...     trajs.save(td + "/trajs.h5")
...     new = em.TrajEnsemble.from_dataset(td + "/trajs.h5")
...     print(new)  
encodermap.TrajEnsemble object. Current backend is no_load. Containing 2 trajectories. Common str is...Not containing any CVs.

Parameters:

trajs (Union[Sequence[str], Sequence[Path], Sequence[md.Trajectory], Sequence[SingleTraj]])
tops (Union[None, Sequence[str], Sequence[Path]])
backend (Literal['mdtraj', 'no_load'])
common_str (Optional[Sequence[str]])
basename_fn (Optional[Callable[[str], str]])
traj_nums (Optional[Sequence[int]])
custom_top (Optional[CustomAAsDict])

CVs#

The collective variables of the SingleTraj classes. Only CVs with matching names in all SingleTraj classes are returned. The data is stacked along a hypothetical time axis along the trajs.

Type:: dict[str, np.ndarray]

_CVs#

The same data as in CVs but with labels. Additionally, the xarray is not stacked along the time axis. It contains an extra dimension for trajectories.

Type:: xarray.Dataset

n_trajs#

Number of individual trajectories in this class.

Type:: int

n_frames#

Number of frames, sum over all trajectories.

Type:: int

locations#

A list with the locations of the trajectories.

Type:: list[str]

top#

A list with the reference pdb for each trajecotry.

Type:: list[mdtraj.Topology]

basenames#

A list with the names of the trajecotries. The leading path and the file extension is omitted.

Type:: list[str]

name_arr#

An array with len(name_arr) == n_frames. This array keeps track of each frame in this object by identifying each frame with a filename. This can be useful, when frames are mixed inside a TrajEnsemble class.

Type:: np.ndarray

property CVs: dict[str, ndarray]#

Returns dict of CVs in SingleTraj classes. Only CVs with the same names in all SingleTraj classes are loaded.

Type:: dict

property CVs_in_file: bool#

Is true, if CVs can be loaded from file. Can be used to build a data generator from.

Type:: bool

property _CVs: Dataset#

Returns x-array Dataset of matching CVs. stacked along the trajectory-axis.

Type:: xarray.Dataset

property basenames: list[str]#

List of the basenames in the Info single classes.

Type:: list

batch_iterator(batch_size: int, replace: bool = False, CV_names: tuple[str] = ('',), deterministic: bool = True, yield_index: bool = True, start: int = 1) → Iterator[tuple[ndarray, ndarray]][source]#

batch_iterator(batch_size: int, replace: bool = False, CV_names: tuple[str] = ('',), deterministic: bool = True, yield_index: bool = False, start: int = 1) → Iterator[ndarray]

batch_iterator(batch_size: int, replace: bool = False, CV_names: Sequence[str] | None = None, deterministic: bool = True, yield_index: bool = True, start: int = 1) → Iterator[tuple[ndarray, tuple[ndarray, ndarray, ndarray, ndarray, ndarray]]]

batch_iterator(batch_size: int, replace: bool = False, CV_names: Sequence[str] | None = None, deterministic: bool = True, yield_index: bool = False, start: int = 1) → Iterator[tuple[ndarray, ndarray, ndarray, ndarray, ndarray]]

Lazy batched iterator of CV data.

This iterator extracts batches of CV data from the ensemble. If the ensemble is a large HDF5 datset, this provides the ability to use all data without loading it all into memory.

Examples

Import EncoderMap and load some example trajectories.

>>> import encodermap as em
>>> trajs = em.TrajEnsemble(
...     [
...         'https://files.rcsb.org/view/1YUG.pdb',
...         'https://files.rcsb.org/view/1YUF.pdb'
...     ]
... )

This iterator will yield new samples forever. The batch is a tuple of numpy.ndarray.

>>> for batch in trajs.batch_iterator(batch_size=2):
...     print([b.shape for b in batch])
...     break
[(2, 148), (2, 147), (2, 150, 3), (2, 149), (2, 82)]

Use it with Python’s builtin next() function. The deterministic flag returns deterministic batches. The yield_index flag also provides the index of the extracted batch. In this example, both batches are extracted from the 1YUG trajectory (traj_num==0).

>>> iterator = trajs.batch_iterator(deterministic=True, batch_size=2, yield_index=True)
>>> index, batch = next(iterator)
>>> index
[[0 5]
 [0 8]]
>>> index, batch = next(iterator)
>>> index
[[ 0  3]
 [ 0 10]]

If a single string is requested for CV_names, the batch, will be a sinlge numpy.ndarray, rather than a tuple thereof.
>>> iterator = trajs.batch_iterator(batch_size=2, CV_names=["central_dihedrals"])

>>> batch = next(iterator)
>>> batch.shape
(2, 147)

Parameters:

batch_size (int) – The size of the batch.
replace (bool) – Whether inside a single batch a sample can occur more than once. Set to False (default) to only allow unique samples in a batch.
CV_names (Sequence[str]) – The names of the CVs to be used in the iterator. If a list/tuple with a single string is provided, the batch will be a numpy.ndarray, rather than a tuple thereof.
deterministic (bbol) – Whether the samples should be deterministic.
yield_index (bool) – Whether to also yield the index of the extracted samples.
start (int) – A start ineteger, which can be used together with deterministic=True to get different deterministic datasets.

Returns:

Different iterators based on chosen arguments.

Return type:

Iterator[Any]

cluster(cluster_id, col='cluster_membership', memberships=None, n_points=-1, overwrite=True)[source]#

Clusters this TrajEnsemble based on the provided cluster_id and col.

With ‘clustering’ we mean to extract a subset given a certain membership. Take two trajectories with 3 frames each as an ensemble. Let’s say we calculate the end-to-end distance of the trajectories and use it as a collective variable of the system. The values are [0.8, 1.3, 1.2, 1.9, 0.2, 1.3]. Based on these values, we define a boolean CV (using 0 as False and 1 as True) which says whether the end-to-end distance is smaller or grather than 1.0. We give this CV the name 'end_to_end_binary' and the values are [0, 1, 1, 1, 0, 1]. We can use this CV to ‘cluster’ the TrajEnsemble via:

cluster = trajs.cluster(cluster_id=0, col='end_to_end_binary'):
This gives a TrajEnsemble with 2 frames.

cluster = trajs.cluster(cluster_id=0, col='end_to_end_binary'):
This gives a TrajEnsemble with 4 frames.

Sometimes, you want to save this a cluster in a format that can be rendered by graphical programs (.xtc, .pdb), you can use either the join or stack method of the resulting :obj:``TrajEnsemble` to get a mdtraj.Trajectory, which is either stacked along the atom axis or joined along the time axis.

Note

If the resulting TrajEnsemble has inhomogeneous topologies, the join method will return a dict[md.Topology, md.Trajectory] instead. This dict can be used to save multiple (.xtc, .pdb) files and visualize your cluster in external programs.

The col parameter takes any CV name, that is per-frame and integer.

Parameters:

cluster_id (int) – The cluster id to use. Needs to be an integer, that is present in the col parameter.
col (str) – Which ‘column’ of the collective variables to use. Needs to be a key, that can be found in trajs.CVs.keys().
memberships (Optional[np.ndarray]) – If a numpy.ndarray is provided here, the memberships from this array will be used. In this case, the col argument will be unused.
n_points (int) – How many points the resulting cluster should contain. Subsamples the points in col == cluster_id evenly and without repeat. If set to -1, all points will be used.
overwrite (bool) – When the memberships argument is used, but the TrajEnsemble already has a CV under the name specified by col, you can set this to True to overwrite this column. Can be helpful, when you iteratively conduct multiple clusterings.

Return type:

TrajEnsemble

Examples

Import EncoderMap and NumPy.

>>> import encodermap as em
>>> import numpy as np

Load an example project.

>>> trajs = em.load_project("pASP_pGLU", load_autoencoder=False)

Create an array full of -1’s. These are the ‘outliers’.

>>> cluster_membership = np.ones(shape=(trajs.n_frames, )) * -1

Select the first 5 frames of every traj to be in cluster 0.

>>> cluster_membership[trajs.id[:, 1] < 5] = 0

Select all frames between 50 and 55 to be cluster 1.

>>> cluster_membership[(50 <= trajs.id[:, 1]) & (trajs.id[:, 1] <= 55)] = 1
>>> np.unique(cluster_membership)
array([-1.,  0.,  1.])

Load this array as a CV called 'clu_mem'.

>>> trajs.load_CVs(cluster_membership, attr_name='clu_mem')

Extract all of cluster 0 with n_points=-1.

>>> clu0 = trajs.cluster(0, "clu_mem")
>>> clu0.n_frames
35

Extract an evenly spaced subset of cluster 1 with 10 total points.

>>> clu1 = trajs.cluster(1, "clu_mem", n_points=10)
>>> clu1.n_frames
10

Cclusters with inhomogeneous topologies can be stacked along the atom axis.

>>> [t.n_atoms for t in trajs]
[69, 83, 103, 91, 80, 63, 73]
>>> stacked = clu1.stack()
>>> stacked.n_atoms
795

But joining the trajectories returns a dict[top, traj] if the topologies are inhomogeneous.

>>> joined = clu1.join()
>>> type(joined)
<class 'dict'>

copy()[source]#

dash_summary()[source]#

A pandas.DataFrame that summarizes this ensemble.

Returns:: The DataFrame.
Return type:: pd.DataFrame

del_CVs(CVs=None)[source]#

Deletes all CVs in all trajs. Does not affect the files.

Parameters:: CVs (Sequence[str] | None)
Return type:: None

del_featurizer()[source]#

Deletes the current instance of self.featurizer.

Return type:: None

property featurizer#

property frames: list[int]#

Frames of individual trajectories.

Type:: list

classmethod from_dataset(fname, basename_fn=None)[source]#

Parameters:

fname (str | Path)
basename_fn (Callable[[str], str] | None)

Return type:

TrajEnsemble

classmethod from_textfile(fname, basename_fn=None)[source]#

Creates a TrajEnsemble object from a textfile.

The textfile needs to be space-separated with two or three columns:

Column 1:
The trajectory file.
Column 2:
The corresponding topology file (If you are using .h5 trajs, column 1 and 2 will be identical, but column 2 needs to be there nonetheless).
Column 3:
The common string of the trajectory. This column can be left out, which will result in an TrajEnsemble without common strings.

Parameters:

fname (Union[str, Path]) – File to be read.
basename_fn (Union[None, Callable[[str], str]], optional) – A function to apply to the traj_file string to return the basename of the trajectory. If None is provided, the filename without extension will be used. When all files are named the same and the folder they’re in defines the name of the trajectory, you can supply lambda x: split('/')[-2] as this argument. Defaults to None.

Returns:

A TrajEnsemble instance.

Return type:

TrajEnsemble

get_single_frame(key)[source]#

Returns a single frame from all loaded trajectories.

Consider a TrajEnsemble class with two trajectories. One has 10 frames, the other 5 (trajs.n_frames is 15). Calling trajs.get_single_frame(12) is equal to calling trajs[1][1]. Calling trajs.get_single_frame(16) will error, and trajs.get_single_frame(1) is the same as trajs[0][1].

Parameters:: key (int) – The frame to return.
Returns:: The frame.
Return type:: encodermap.trajinfo.info_single.SingleTraj

property id: ndarray#

Duplication of self.index_arr

Type:: np.ndarray

property index_arr: ndarray#

Returns np.ndarray with ndim = 2. Clearly assigning every loaded frame an identifier of traj_num (self.index_arr[:,0]) and frame_num (self.index_arr[:,1]). Can be used to create an unspecified subset of frames and can be useful when used with clustering.

Type:: np.ndarray

iterframes()[source]#

Generator over the frames in this instance.

Yields:

tuple –

A tuple containing the following:

int: The traj_num
int: The frame_num
encodermap.SingleTraj: An SingleTraj object.

Return type:

Iterator[tuple[int, int, SingleTraj]]

Examples

Import EncoderMap and load an example TrajEnsemble.

>>> import encodermap as em
>>> trajs = em.TrajEnsemble(
...     [
...         'https://files.rcsb.org/view/1YUG.pdb',
...         'https://files.rcsb.org/view/1YUF.pdb',
...     ],
... )
>>> print(trajs.n_frames)
31

Subsample every tenth frame.

>>> trajs = trajs.subsample(10)
>>> trajs.n_frames
4

Call the iterframes method.

>>> for traj_num, frame_num, frame in trajs.iterframes():
...     print(traj_num, frame_num, frame.n_frames)
0 0 1
0 10 1
1 0 1
1 10 1

itertrajs()[source]#

Generator over the SingleTraj classes.

Yields:

tuple –

A tuple containing the following:

int: A loop-counter integer. Is identical with traj.traj_num.
encodermap.SingleTraj: An SingleTraj object.

Return type:

Iterator[tuple[int, SingleTraj]]

Examples

>>> import encodermap as em
>>> trajs = em.TrajEnsemble(
...     [
...         'https://files.rcsb.org/view/1YUG.pdb',
...         'https://files.rcsb.org/view/1YUF.pdb'
...     ]
... )
>>> for i, traj in trajs.itertrajs():
...     print(traj.basename)
1YUG
1YUF

join(align_string='name CA', superpose=True, ref_align_string='name CA', base_traj=None, progbar=None, dict_keys='top')[source]#

Parameters:

align_string (str)
superpose (bool)
ref_align_string (str)
base_traj (Trajectory | None)
progbar (Any | None)
dict_keys (Literal['top', 'cs'])

Return type:

dict[Topology | str, Trajectory]

load_CVs(data=None, attr_name=None, cols=None, deg=None, periodic=True, labels=None, directory=None, ensemble=False, override=False, custom_aas=None, alignment=None)[source]#

Loads CVs in various ways. The easiest way is to provide a single numpy.ndarray and a name for that array.

Besides np.ndarray, files (.txt and .npy) can be loaded. Features or Featurizers can be provided. A xarray.Dataset can be provided. A str can be provided which either is the name of one of EncoderMap’s features (encodermap.features) or the string can be ‘all’, which loads all features required for EncoderMap’s encodermap.autoencoder.autoencoder`AngleDihedralCartesianEncoderMap.

Parameters:

data (Optional[TrajEnsembleFeatureType]) – The CV to load. When a numpy.ndarray is provided, it needs to have a shape matching n_frames and the data will be distributed to the trajs, When a list of files is provided, len(data) (the files) needs to match n_trajs. The first file will be loaded by the first traj (based on the traj’s traj_num) and so on. If a list of numpy.ndarray is provided, the first array will be assigned to the first traj (based on the traj’s traj_num). If None is provided, the argument directory will be used to construct a str using this expression fname = directory + traj.basename + '_' + attr_name. If there are .txt or .npy files matching that string in the directory, the CVs will be loaded from these files to the corresponding trajs. Defaults to None.
attr_name (Optional[str]) – The name under which the CV should be found in the class. Choose whatever you like. 'highd', 'lowd', 'dists', etc. The CV can then be accessed via dot-notation: trajs.attr_name. Defaults to None, in which case, the argument data should point to existing files. The attr_name will be extracted from these files.
cols (Optional[list[int]]) –
A list of integers indexing the columns of the data to be loaded. This is useful if a file contains columns which are not features (i.e. an indexer or the error of the features. eg:
```
id   f1    f2    f1_err    f2_err
0    1.0   2.0   0.1       0.1
1    2.5   1.2   0.11      0.52
```
In that case, you would want to supply cols=[1, 2] to the cols argument. If None is provided all columns are loaded. Defaults to None.
deg (Optional[bool]) – Whether to return angular CVs using degrees. If None or False, CVs will be in radian. Defaults to None.
periodic (bool) – Whether to use the minimum image convention to calculate distances/angles/dihedrals. This is generally recommended, when you don’t clean up your trajectories and the proteins break over the periodic boundary conditions. However, when the protein is large, the distance between one site and another might be shorter through the periodic boundary. This can lead to wrong results in your distance calculations.
labels (list[str]) – A list containing the labels for the dimensions of the data. If you provide a numpy.ndarray with shape (n_trajs, n_frames, n_feat), this list needs to be of len(n_feat). An exception will be raised otherwise. If None is privided, the labels will be automatically generated. Defaults to None.
directory (Optional[str]) – If this argument is provided, the directory will be searched for .txt or .npy files which have the same names as the trajectories have basenames. The CVs will then be loaded from these files.
ensemble (bool) – Whether the trajs in this class belong to an ensemble. This implies that they contain either the same topology or are very similar (think wt, and mutant). Setting this option True will try to match the CVs of the trajs onto the same dataset. If a VAL residue has been replaced by LYS in the mutant, the number of sidechain dihedrals will increase. The CVs of the trajs with VAL will thus contain some NaN values. Defaults to False.
override (bool) – Whether to override CVs with the same name as attr_name.
custom_aas (Optional[CustomAAsDict]) – You can provide non-standard residue definitions in this argument. See encodermap.trajinfo.trajinfo_utils.CustomTopology for information how to use the custom_aas argument. If set to None (default), only standard residue names are assumed.
alignment (Optional[str]) – If your proteins have similar but different sequences, you can provide a CLUSTAL W alignment as this argument and the featurization will align the features accordingly.

Raises:

TypeError – When wrong Type has been provided for data.

Return type:

None

load_custom_topology(custom_top=None)[source]#

Loads a custom_topology from a CustomTopology class or a dict.

See also

CustomTopology

Parameters:: custom_top (CustomTopology | dict[str | tuple[str, str], None | tuple[str, None] | tuple[str, dict[Literal['bonds', 'optional_bonds', 'delete_bonds', 'optional_delete_bonds', 'PHI', 'PSI', 'OMEGA', 'not_PHI', 'not_PSI', 'not_OMEGA', 'CHI1', 'CHI2', 'CHI3', 'CHI4', 'CHI5'], list[str] | list[tuple[str | int, str | int]]]]] | None) – Optional[Union[CustomTopology, CustomAAsDict]]: An instance of the CustomTopology class or a dictionary that can be made into such.
Return type:: None

load_trajs()[source]#

Loads all trajs in self.

Return type:: None

property locations: list[str]#

Duplication of self.traj_files but using the trajs own traj_file attribute. Ensures that traj files are always returned independent of the current load state.

Type:: list

property n_frames: int#

Sum of the loaded frames.

Type:: int

property n_residues: int#

List of number of residues of the SingleTraj classes

Type:: list

property n_trajs: int#

Number of trajectories in this ensemble.

Type:: int

property name_arr: ndarray#

Trajectory names with the same length as self.n_frames.

Type:: np.ndarray

parse_clustal_w_alignment(aln)[source]#

Parse an alignment in ClustalW format and add the info to the trajectories.

Parameters:: aln (str) – The alignment in ClustalW format.
Return type:: None

save(fname, CVs='all', overwrite=False, only_top=False)[source]#

Saves this TrajEnsemble into a single .h5 file.

Parameters:

fname (Union[str, Path]) – Where to save the file.
CVs (Union[Literal["all"], list[str], Literal[False]]) – Which CVs to alos store in the file. If set to 'all', all CVs will be saved. Otherwise, a list[str] can be provided to only save specific CVs. Can also be set to False, no CVs are stored in the file.
overwrite (bool) – If the file exists, it is overwritten.
only_top (bool) – Only writes the trajectorie’s topologies into the file.

Raises:

IOError – If file already exists and overwrite is not True.

Return type:

None

save_CVs(path)[source]#

Saves the CVs to a NETCDF file using xarray.

Parameters:: path (str | Path)
Return type:: None

sidechain_info()[source]#

Indices used for the AngleDihedralCartesianEncoderMap class to allow training with multiple different sidechains.

Returns:: The indices. The key ‘-1’ is used for the hypothetical convex hull of all feature spaces (the output of the tensorflow model). The other keys match the common_str of the trajs.
Return type:: dict[str, Sequence[int]]
Raises:: Exception – When the common_strings and topologies are not aligned. An exception is raised. Aligned means that all trajs with the same common_str should possess the same topology.

split_into_frames(inplace=False)[source]#

Splits self into separate frames.

Parameters:: inplace (bool) – Whether to do the split inplace or not. Defaults to False and thus, returns a new TrajEnsemble class.
Return type:: None

stack(align_string='name CA', superpose=True, ref_align_string='name CA', base_traj=None, progbar=None)[source]#

Parameters:

align_string (str)
superpose (bool)
ref_align_string (str)
base_traj (Trajectory | None)
progbar (Any | None)

Return type:

Trajectory

subsample(stride=None, total=None)[source]#

Returns a subset of this TrajEnsemble given the provided stride or total.

This is a faster alternative than using the trajs[trajs.index_arr[::1000]] when HDF5 trajs are used, because the slicing information is saved in the respective encodermap.trajinfo.info_single.SingleTraj

and loading of single frames is faster in HDF5 formatted trajs.

Parameters:

stride (Optional[int]) – Return a frame ever stride frames.
total (Optional[int]) – Return a total of evenly sampled frames.

Returns:

A trajectory ensemble.

Return type:

TrajEnsemble

Note

The result from subsample(1000)` `is different from ``trajs[trajs.index_arr[::1000]]. With subsample every trajectory is sub-sampled independently. Consider a TrajEnsemble with two encodermap.trajinfo.info_single.SingleTraj trajectories with 18 frames each. subsampled = trajs.subsample(5) would return a TrajEnsemble with two trajs with 3 frames each (subsampled.n_frames == 6). Whereas, subsampled = trajs[trajs.index_arr[::5]] would return a TrajEnsemble with 7 SingleTrajs with 1 frame each (subsampled.n_frames == 7). Because the time and frame numbers are saved all the time, this should not be too much of a problem.

tf_dataset(batch_size, replace=False, sidechains=False, reconstruct_sidechains=False, CV_names=None, deterministic=False, prefetch=True, start=1)[source]#

Parameters:

batch_size (int)
replace (bool)
sidechains (bool)
reconstruct_sidechains (bool)
CV_names (Optional[list[str]])
deterministic (bool)
prefetch (bool)
start (int)

Return type:

tf.data.Dataset

to_alignment_query()[source]#

A string, that cen be put into sequence alignment software.

Return type:: str

to_dataframe(CV)[source]#

Parameters:: CV (str | Sequence[str])
Return type:: DataFrame

property top: list[Topology]#

Returns a minimal set of mdtraj.Topologies.

If all trajectories share the same topology a list with len 1 will be returned.

Type:: list

property top_files: list[str]#

Returns minimal set of topology files.

If yoy want a list of top files with the same length as self.trajs use self._top_files and self._traj_files.

Type:: list

property traj_files: list[str]#

A list of the traj_files of the individual SingleTraj classes.

Type:: list

property traj_joined: Trajectory#

Returns a mdtraj Trajectory with every frame of this class appended along the time axis.

Can also work if different topologies (with the same number of atoms) are loaded. In that case, the first frame in self will be used as topology parent and the remaining frames’ xyz coordinates are used to position the parents’ atoms accordingly.

Examples

>>> import encodermap as em
>>> trajs = em.load_project("pASP_pGLU")
>>> subsample = trajs[0][:20] + trajs[1][:20]
>>> subsample.split_into_frames().traj_joined  
<mdtraj.Trajectory with 40 frames, 69 atoms, 6 residues, and unitcells at ...>

Type:: mdtraj.Trajectory

property traj_nums: list[int]#

Number of info single classes in self.

Type:: list

property trajs_by_common_str: dict[None | str, TrajEnsemble]#

Returns the trajs in self ordered by top.

If all trajectories share the same common_str, a dict with one key will be returned. As the common_str can be None, None can also occur as a key in this dict.

Type:: dict[str, TrajEnsemble]

property trajs_by_top: dict[Topology, TrajEnsemble]#

Returns the trajs in self ordered by top.

If all trajectories share the same topology, a dict with one key will be returned.

Type:: dict[md.Topology, TrajEnsemble]

property trajs_by_traj_num: dict[int, SingleTraj]#

property tsel#

unload()[source]#

Unloads all trajs in self.

Return type:: None

classmethod with_overwrite_trajnums(*trajs)[source]#

Creates a TrajEnsemble by copying the provided encodermap.trajinfo.info_single.SingleTraj instances and changing their traj_num attribute to adhere to [0, 1, 2, ...].

Parameters:: trajs (Sequence[SingleTraj]) – The sequence of trajs.
Returns:: A TrajEnsemble instance.
Return type:: TrajEnsemble

property xyz: ndarray#

xyz coordinates of all atoms stacked along the traj-time axis.

Only works if all trajs share the same topology.

Type:: np.ndarray

encodermap.trajinfo.info_single module#

Classes to work with ensembles of trajectories.

The statistics of a protein can be better described by an ensemble of proteins, rather than a single long trajectory. Treating a protein in such a way opens great possibilities and changes the way one can treat molecular dynamics data. Trajectory ensembles allow:

Faster convergence via adaptive sampling.

Better anomaly detection of unique structural states.

This subpackage contains two classes which are containers of trajectory data. The SingleTraj trajectory contains information about a single trajectory. The TrajEnsemble class contains information about multiple trajectories. This adds a new dimension to MD data. The time and atom dimension are already established. Two frames can be appended along the time axis to get a trajectory with multiple frames. If they are appended along the atom axis, the new frame contains the atoms of these two. The trajectory works in a similar fashion. Adding two trajectories along the trajectory axis returns a trajectory ensemble, represented as a TrajEnsemble class in this package.

class SingleTraj(traj, top=None, common_str='', backend='no_load', index=None, traj_num=None, basename_fn=None, custom_top=None)[source]#

Bases: object

This class contains the info about a single trajectory.

This class contains many of the attributes and methods of mdtraj.Trajectory. It is meant to be used as a standalone single trajetcory or in an ensemble defined in the encodermap.trajinfo.info_all.TrajEnsemble class. Other than the standard mdtraj.Trajectory this class loads the MD data only when needed. The location of the file(s) and other attributes like indices (int, list[int], numpy.ndarray, slice) are stored until the traj is accessed via the SingleTraj.traj attribute. The returned traj is a mdtraj.Trajectory with the correct number of frames in the correct sequence.

Besides MD data, this class keeps track of your collective variables. Oftentimes the raw xyz data of a trajectory is not needed for understanding the conformation and suitable CVs are selected to represent a protein via internal coordinates (torsions, pairwise distances, etc.). This class keeps tack of your CVs. Whether you call them 'highd' or 'torsions', this class keeps track of everything and returns the values when you need them.

SingleTraj supports fancy indexing, so you can extract one or more frames from a Trajectory as a separate trajectory. For example, to form a trajectory with every other frame, you can slice with traj[::2].

Note

SingleTraj uses the nanometer, degree & picosecond unit system.

Parameters:

traj (Union[str, Path, md.Trajectory])
top (Optional[Union[str, Path]])
common_str (str)
backend (Literal['no_load', 'mdtraj'])
index (Optional[Union[int, list[int], np.ndarray, slice]])
traj_num (Optional[int])
basename_fn (Optional[Callable[[str], str]])
custom_top (Optional[CustomAAsDict])

backend#

Current state of loading. If backend == 'no_load' xyz data will be loaded from disk, if accessed. If backend == 'mdtraj', the data is already in RAM.

Type:: str

common_str#

Substring of traj_file and top_file. Used to group multiple trajectory and topology files. If traj files=['protein1_traj1.xtc', 'protein1_traj2.xtc'] have the same topolgy stored in a file called 'protein1.pdb', you can load them with common_str='protein1' together with more .xtc and .pdb files and these two .xtc files will use the correct .pdb file.

Type:: str

index#

A sequence of fancy slices of the trajectory. When file is loaded from disk, the fancy indexes will be applied one after the other.

Type:: Sequence[Union[None, int, list, numpy.ndarray, slice]]

traj_num#

Integer to identify a SingleTraj class in a TrajEnsemble class.

Type:: int

traj_file#

Trajectory file used to create this class.

Type:: str

top_file#

Topology file used to create this class. If a .h5 trajectory was used traj_file and top_file are identical. If a mdtraj.Trajectory was used to create SingleTraj, these strings are empty.

Type:: str

Examples

Load a pdb file with 14 frames from rcsb.org

>>> import encodermap as em
>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb")
>>> traj  
<encodermap.SingleTraj object...
>>> traj.n_frames
14

Providing common_str sets this attribute.

>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb", common_str="1GHC")
>>> traj.common_str
'1GHC'

Indexing using integers returns a SingleTraj with only one frame.

>>> frame = traj[5]
>>> frame.n_frames
1

Indexing can also use lists of integers.

>>> subset = traj[[0, 1, 5]]
>>> subset.n_frames
3

Further indexing this subset uses the current trajectory ‘as is’. Although frame 0, 1, and 5 have been extracted from traj, we get frame 5 from subset by indexing with 2.

>>> frame = subset[2]
>>> frame.id
array([5])

Indexing using the original frame indices from the file is done using the fsel[] accessor.

>>> frame = subset.fsel[5]
>>> frame.id
array([5])

Advanced slicing

>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb")[-1:7:-2]
>>> [frame.id[0] for frame in traj]
[13, 11, 9]

The traj_num argument is mainly used in encodermap.TrajEnsemble, but can be provided manually.

>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb", traj_num=2)
>>> traj.traj_num
2

The argument basename_fn should be a callable, that takes a string and returns a string.

>>> from pathlib import Path
>>> def my_basename_fn(filename):
...     stem = str(Path(filename).stem)
...     return "custom_" + stem
>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb", basename_fn=my_basename_fn)
>>> traj.basename
'custom_1GHC'

Build a trajectory ensemble from multiple SingleTraj objects.

>>> traj1 = em.SingleTraj("https://files.rcsb.org/view/1YUG.pdb")
>>> traj2 = em.SingleTraj("https://files.rcsb.org/view/1YUF.pdb")
>>> trajs = traj1 + traj2
>>> print(trajs.n_trajs, trajs.n_frames, [traj.n_frames for traj in trajs])
2 31 [15, 16]

property CVs: dict[str, ndarray]#

Returns a simple dict from the more complicated self._CVs xarray Dataset.

If self._CVs is empty and self.traj_file is a HDF5 (.h5) file, the contents of the HDF5 will be checked, whether CVs have been stored there. If not and empty dict will be returned.

Type:: dict[str, numpy.ndarray]

property CVs_in_file: bool#

Is True, if traj_file has exyension .h5 and contains CVs.

Type:: bool

property _frames: ndarray#

Applies self.index over self._orig_frames.

Type:: numpy.ndarray

_mdtraj_attr = ['n_frames', 'n_atoms', 'n_chains', 'n_residues', 'openmm_boxes', 'openmm_positions', 'time', 'timestep', 'xyz', 'unitcell_vectors', 'unitcell_lengths', 'unitcell_angles', '_check_valid_unitcell', '_distance_unit', '_have_unitcell', '_rmsd_traces', '_savers', '_string_summary_basic', '_time', '_time_default_to_arange', '_topology', '_unitcell_angles', '_unitcell_lengths', '_xyz']#

property _n_frames_base_h5_file: int#

Can be used to get n_frames without loading an HDF5 into memory.

Type:: int

property _original_frame_indices: ndarray#

If trajectory has not been loaded, it is loaded and the frames of the trajectory file on disk are put into a np.arange(). If the trajectory is sliced in weird ways, this array tracks the original frames.

Type:: numpy.ndarray

property _traj#: Needs to be here to complete setter. Not returning anything, because setter is also not returning anything.

atom_slice(atom_indices, invert=False)[source]#

Deletes atoms from this SingleTraj instance.

Parameters:

atom_indices (Union[list, numpy.ndarray]) – The indices of the atoms to keep.
invert (bool) – If False, it is assumed, that the atoms in atom_indices are the ones to be kept. If True, the atoms in atom_indices are the ones to be removed.

Return type:

None

property basename: str#

Basename is the filename without path and without extension. If basename_fn is not None, it will be applied to traj_file.

Type:: str

copy()[source]#

Returns a copy of self.

Return type:: SingleTraj

dash_summary()[source]#

Returns a pandas.DataFrame with useful information about this instance.

Returns:: The dataframe.
Return type:: pd.DataFrame

del_CVs()[source]#

Resets the _CVs attribute to an empty xarray.Dataset.

Return type:: None

property extension: str#

Extension is the file extension of the trajectory file (self.traj_file).

Type:: str

property featurizer#

classmethod from_pdb_id(pdb_id, traj_num=None)[source]#

Alternate constructor for the TrajEnsemble class.

Builds an SingleTraj class from a pdb-id.

Parameters:

pdb_id (str) – The 4-letter pdb id.
traj_num (int | None)

Returns:

An SingleTraj class.

Return type:

SingleTraj

property fsel#

get_single_frame(key)[source]#

Returns a single frame from the trajectory.

Parameters:: key (Union[int, np.int]) – Index of the frame.
Return type:: SingleTraj

Examples

Import EncoderMap and load SingleTraj.

>>> import encodermap as em
>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb")
>>> traj.n_frames
14

Load the same traj and give it a traj_num for recognition in a set of multiple trajectories.

>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb", traj_num=5)
>>> frame = traj.get_single_frame(2)
>>> frame.id
array([[5, 2]])

property id: ndarray#

id is an array of unique identifiers which identify the frames in this SingleTraj object when multiple Trajectories are considered.

If the traj was initialized from an TrajEnsemble class, the traj gets a unique identifier (traj_num) which will also be put into the id array, so that id can have two shapes ((n_frames, ), (n_frames, 2)) This corresponds to self.id.ndim = 1 and self.id.ndim = 2. In the latter case self.id[:,1] are the frames and self.id[:,0] is an array full of traj_num.

Type:: numpy.ndarray

property indices_chi1: ndarray#

A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.

Type:: numpy.ndarray

property indices_chi2: ndarray#

A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.

Type:: numpy.ndarray

property indices_chi3: ndarray#

A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.

Type:: numpy.ndarray

property indices_chi4: ndarray#

A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.

Type:: numpy.ndarray

property indices_chi5: ndarray#

A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.

Type:: numpy.ndarray

property indices_omega: ndarray#

A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.

Type:: numpy.ndarray

property indices_phi: ndarray#

A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.

Type:: numpy.ndarray

property indices_psi: ndarray#

A numpy array with shape (n_dihedrals, 4) indexing the atoms that take part in this dihedral angle. This index is 0-based.

Type:: numpy.ndarray

iterframes(with_traj_num: bool = False) → Iterable[tuple[int, SingleTraj]][source]#

iterframes(with_traj_num: bool = True) → Iterable[tuple[int, int, SingleTraj]]

Iterator over the frames in this class.

Parameters:

with_traj_num (bool) – Whether to return a three-tuple of traj_num, frame_num, frame (True) or just traj_num, frame (False).

Yields:

tuple –

A tuple containing the following:

int: The traj_num.
int: The frame_num.
encodermap.SingleTraj: An SingleTraj object.

Examples

Import EncoderMap and create SingleTraj instance.

>>> import encodermap as em
>>> traj = em.SingleTraj('https://files.rcsb.org/view/1YUG.pdb')
>>> traj.n_frames
15

Slicing the trajectory every 5th frame

>>> traj = traj[::5]
>>> traj.n_frames
3

Using the iterframes() iterator.

>>> for frame_num, frame in traj.iterframes():
...     print(frame_num, frame.n_frames)
0 1
5 1
10 1

join(other)[source]#

Join two trajectories together along the time/frame axis.

Note

Returns a mdtraj.Trajectory and thus loses CVs, filenames, etc.

Parameters:: other (SingleTraj | Trajectory)
Return type:: Trajectory

load_CV(data, attr_name=None, cols=None, deg=None, periodic=True, labels=None, override=False)[source]#

Load CVs into traj. Many options are possible. Provide xarray, numpy array, em.loading.feature, em.featurizer, and even string!

This method loads CVs into the SingleTraj instance. Many ways of doing so are available:

numpy.ndarray: The easiest way. Provide a np array and a name for
the array, and the data will be saved as an instance variable, accesible via SingleTraj.name.

xarray.DataArray: You can load a multidimensional xarray as
data into the class. Please refer to xarrays own documentation if you want to create one yourself.

xarray.Dataset: You can add another dataset to the existing _CVs.

encodermap.loading.features.Feature: If you provide one of the
features from encodermap.loading.features the resulting features will be loaded and also be placed under the set name.

encodermap.loading.featurizer.Featurizer: If you provide a
full featurizer, the data will be generated and be accessible as an attribute.

str: If a string is provided, the data will be loaded from a
.txt, .npy, or NetCDF / HDF5 .nc file.

Parameters:

(Union[str (data) – em.loading.features.Feature, em.loading.featurizer.Featurizer]): The CV to load. Either as numpy.ndarray, xarray.DataArray, EncoderMap feature, or EncoderMap Featurizer.
numpy.ndarray – em.loading.features.Feature, em.loading.featurizer.Featurizer]): The CV to load. Either as numpy.ndarray, xarray.DataArray, EncoderMap feature, or EncoderMap Featurizer.
xr.DataArray – em.loading.features.Feature, em.loading.featurizer.Featurizer]): The CV to load. Either as numpy.ndarray, xarray.DataArray, EncoderMap feature, or EncoderMap Featurizer.
data (SingleTrajFeatureType)
attr_name (Optional[str])
cols (Optional[list[int]])
deg (Optional[bool])
periodic (bool)
labels (Optional[list[str]])
override (bool)

Return type:

None

:paramem.loading.features.Feature, em.loading.featurizer.Featurizer]):: The CV to load. Either as numpy.ndarray, xarray.DataArray, EncoderMap feature, or EncoderMap Featurizer.

Parameters:

attr_name (Optional[str]) – The name under which the CV should be found in the class. Is needed, if a raw numpy array is passed, otherwise the name will be generated from the filename (if data == str), the DataArray.name (if data == xarray.DataArray), or the feature name.
cols (Optional[list]) – A list specifying the columns to use it for the high-dimensional data. If your highD data contains (x,y,z,…)-errors or has an enumeration column at col=0 this can be used to remove this unwanted data.
deg (Optional[bool]) – Whether the provided data is in radians (False) or degree (True). It can also be None for non-angular data.
labels (Optional[Union[list, str]]) – If you want to label the data you provided, pass a list of str. If set to None, the features in this dimension will be labeled as [f"{attr_name.upper()} FEATURE {i}" for i in range(self.n_frames)]. If a str is provided, the features will be labeled as [f"{attr_name.upper()} {label.upper()} {i}" for i in range(self.n_frames)]. If a list of str is provided, it needs to have the same length as the traj has frames. Defaults to None.
override (bool) – Whether to overwrite existing CVs. The method will also print a message which CVs have been overwritten.
data (SingleTrajFeatureType)
periodic (bool)

Return type:

None

Examples

Import EncoderMap and load an example Trajectory.

>>> import encodermap as em
>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb")

Load the central dihedrals using ``data=’central_dihedrals’` as shortcut.

>>> traj.load_CV("central_dihedrals")
>>> traj.central_dihedrals.shape
(14, 222)

>>> traj._CVs['central_dihedrals']['CENTRAL_DIHEDRALS'].values[:2]
['CENTERDIH PSI   RESID  MET:   1 CHAIN 0'
 'CENTERDIH OMEGA RESID  MET:   1 CHAIN 0']

Slicing the SingleTraj keeps all CVs in order.

>>> import numpy as np
>>> from pathlib import Path
>>> traj1 = em.SingleTraj(
...     Path(em.__file__).parent.parent / "tests/data/1am7_corrected.xtc",
...     Path(em.__file__).parent.parent / "tests/data/1am7_protein.pdb",
... )
>>> traj1.load_CV(traj1.xyz[..., -1], 'z_coordinate')
...
>>> for i, frame in enumerate(traj1):
...     print(np.array_equal(frame.z_coordinate[0], frame.xyz[0, :, -1]))
...     if i == 3:
...         break
True
True
True
True

Raises:

FileNotFoundError – When the file given by data does not exist.
IOError – When the provided filename does not have .txt, .npy or .nc extension.
TypeError – When data does not match the specified input types.
Exception – When a numpy array has been passed as data and no attr_name has been provided.
Exception – When the provided attr_name is str, but cannot be a python identifier.

Parameters:

data (SingleTrajFeatureType)
attr_name (Optional[str])
cols (Optional[list[int]])
deg (Optional[bool])
periodic (bool)
labels (Optional[list[str]])
override (bool)

Return type:

None

load_custom_topology(custom_top=None)[source]#

Loads a custom_topology from a CustomTopology class or a dict.

Parameters:: custom_top (Optional[Union[CustomTopology, CustomAAsDict]]) – Optional[Union[CustomTopology, CustomAAsDict]]: An instance of encodermap.trajinfo.trajinfo_utils.CustomTopology or a dictionary that can be made into such.
Return type:: None

load_traj(new_backend='mdtraj')[source]#

Loads the trajectory, with a new specified backend.

After this is called the instance variable self.trajectory will contain a mdtraj Trajectory object.

Parameters:

new_backend (str, optional) –

Can either be:

'mdtraj' to load the trajectory using mdtraj.
'no_load' to not load the traj (unload).

Defaults to 'mdtraj'.

Return type:

None

property n_atoms: int#

Number of atoms in traj.

Loads the traj into memory if not in HDF5 file format. Be aware.

Type:: int

property n_chains: int#

Number of chains in traj.

Type:: int

property n_frames: int#

Number of frames in traj.

Loads the traj into memory if not in HDF5 file format. Be aware.

Type:: int

property n_residues: int#

Number of residues in traj.

Type:: int

save(fname, CVs='all', overwrite=False)[source]#

Save the trajectory as HDF5 file format to disk.

Parameters:

fname (str) – The filename.
CVs (Union[List, 'all'], optional) – Either provide a list of strings of the CVs you would like to save to disk, or set to ‘all’ to save all CVs. Defaults to [].
overwrite (bool, optional) – Whether force overwrite an existing file. Defaults to False.

Raises:

IOError – When the file already exists and overwrite is False.

Return type:

None

save_CV_as_numpy(attr_name, fname=None, overwrite=False)[source]#

Saves a specified collective variable of this traj as a .npy file.

This got its own method for parallelization purposes.

Parameters:

attr_name (str) – Name of the CV to save.
fname (str, optional) – Can be either
overwrite (bool, opt) – Whether to overwrite the file. Defaults to False.

Raises:

IOError – When the file already exists and overwrite is set to False.

Return type:

None

select(sel_str='all')[source]#

Execute a selection against the topology.

Parameters:: sel_str (str, optional) – What to select. Defaults to ‘all’.
Return type:: ndarray

Examples

>>> import encodermap as em
>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb")
>>> select = traj.top.select("name CA and resSeq 1")
>>> select
array([1])

>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb")
>>> select = traj.top.select("name CA and resSeq 1")
>>> traj.top.atom(select[0])
MET1-CA

show_traj(gui=True)[source]#

Returns an nglview view object.

See also

https://nglviewer.org/nglview/latest/

Returns:: The nglview widget object.
Return type:: view (nglview.widget)
Parameters:: gui (bool)

sidechain_info()[source]#

Indices used for the AngleDihedralCartesianEncoderMap class to allow training with multiple different sidechains.

Returns:: The indices. The key ‘-1’ is used for the hypothetical convex hull of all feature spaces (the output of the tensorflow model). The other keys match the common_str of the trajs.
Return type:: dict[str, Sequence[int]]
Raises:: Exception – When the common_strings and topologies are not aligned. An exception is raised. Aligned means that all trajs with the same common_str should possess the same topology.

stack(other)[source]#

Stack two trajectories along the atom axis

Note

Returns a m``dtraj.Trajectory`` and thus loses CVs, filenames, etc.

Parameters:: other (SingleTraj)
Return type:: Trajectory

superpose(reference, frame=0, atom_indices=None, ref_atom_indices=None, parallel=True, inherit_CVs=False)[source]#

Superpose each conformation in this trajectory upon a reference

Parameters:

reference (Union[mdtraj.Trajectory, SingleTraj]) – The reference frame to align to. If the reference has multiple frames and you want to use a specific frame as reference, use the frame argument also.
frame (int, optional) – Align to this frame in reference. Default is 0.
atom_indices (Union[np.array, None], optional) – Indices in self, used to calculate RMS values. Defaults to None which means all atoms will be used.
ref_atom_indices (Union[np.array, None], optional) – Indices in reference, used to calculate RMS values. Defaults to None which means all atoms will be used.
parallel (bool, optional) – Use OpenMP to run the superposition in parallel over multiple cores.
inherit_CVs (bool, optional) – Whether to also inherit the CVs. This feature is currently not implemented. It would require additional code in all Feature classes discerning intrinsic (distance, angle, cluster_membership, etc.) or an extrinsic feature (absolute coordinate, COG position, etc.). Then this extrinsic/intrinsic boolean flag also needs to accompany the xarray Datasets, so that the intrinsic CVs can be inherited, and the extrinsic can be dropped with a corresponding message.

Returns:

A new trajectory with atoms aligned.

Return type:

SingleTraj

property top: Topology#

The structure of a Topology object is similar to that of a PDB file.

It consists. of a set of Chains (often but not always corresponding to polymer chains). Each Chain contains a set of Residues, and each Residue contains a set of Atoms. In addition, the Topology stores a list of which atom pairs are bonded to each other. Atom and residue names should follow the PDB 3.0 nomenclature for all molecules for which one exists

chains#

Iterate over chains.

Type:: generator

residues#

Iterate over residues.

Type:: generator

atoms#

Iterate over atoms.

Type:: generator

bonds#

Iterate over bonds.

Type:: generator

Type:: mdtraj.Topology

property top_file: str#

The topology file as a string (rather than a pathlib.Path).

Type:: str

property traj: Trajectory#

This attribute always returns an mdtraj.Trajectory. if backend is ‘no_load’, the trajectory will be loaded into memory and returned.

Type:: mdtraj.Trajectory

property traj_file: str#

The traj file as a string (rather than a pathlib.Path).

Type:: str

unload(CVs=False)[source]#

Clears up RAM by deleting the trajectory info and the CV data.

If CVs is set to True the loaded CVs will also be deleted.

Parameters:: CVs (bool, optional) – Whether to also delete CVs, defaults to False.
Return type:: None

encodermap.trajinfo.load_traj module#

Util functions for the TrajEnsemble and SingleTraj classes.

encodermap.trajinfo.trajinfo_utils module#

Util functions for the TrajEnsemble and SingleTraj classes.

class CustomTopology(*new_residues, traj=None)[source]#

Bases: object

Adds custom topology elements to a topology parsed by MDTraj.

Postpones parsing the custom AAs until requested.

The custom_aminoacids dictionary follows these styleguides:

The keys can be str or tuple[str, str]
If a key is str, it needs to be a 3-letter code (MET, ALA, GLY, …)
If a key is a tuple[str, str], the first str of the tuple is a common_str
(see the docstring for encodermap.TrajEnsemble to learn about common_str. This common_str can be used to apply custom topologies to an ensemble based on their common_str. For example:
{("CSR_mutant", "CSR"): ...}
A key can also affect only a single residue (not all resides called “CSR”).
For that, the 3-letter code of the residue needs to be postponed with a dash and the 1-based indexed resSeq of the residue:
{"CSR-2": ...}
The value to a key can be None, which means this residue will not be
used for building a topology. Because EncoderMap raises Exceptions, when it encounters unknown residues (to make sure, you don’t forget to featurize some important residues), it will also raise Exceptions when the topology contains unknown solvents/solutes. If you run a simulation in water/methanol mixtures with the residue names SOL and MOH, EncoderMap will raise an Exception upon encountering MOH, so your custom topology should contain 1{“MOH”: None}` to include MOH.
The value of a key can also be a tuple[str, Union[dict, None]]. In this
case, the first string should be the one-letter code of the residue or the residue most closely representing this residue. If you use phosphotyrosine (PTR) in your simulations and want to use it as a standard tyrosine residue, the custom topology should contain {“PTR”: (“Y”, None)}
If your residue is completely novel you need to define all possible
bonds, backbone and sidechain dihedrals yourself. For that, you want to provide a tuple[str, dict[str, Union[listr[str], list[int]]] type. This second level dict allows for the following keys: * bonds: For bonds between atoms. This key can contain a list[tuple[str, str]],

which defines bonds in this residue. This dict defines a bond between N and CA in phosphothreonine. {“PTR”: (“Y”, {

“bonds”: [
(“N”, “CA”),

],

}} These strings can cotain + and - signs to denote bonds to previous or following residues. To connect the residues MET1 to TPO2 to ALA3, you want to have this dict: {“TPO”: (“T”, {

“bonds”: [
(“-C”, “N”), # bond to MET1-C (“N”, “CA”), … (“C”, “+N”), # bond to ALA2-N

],

}} For exotic bonds, one of the strings can also be int to connect to any 0-based indexed atom in your topology. You can connect the residues CYS2 and CYS20 wit a sulfide bride like so: {“CYS-2”: (“C”, {

“bonds”: [
(“S”, 321), # connect to CYS20, the 321 is a placeholder

], },

“CYS-20”: (“C”, {

“bonds”: [
(20, “S”), # connect to CYS2

], },

}
- optional_bonds: This key accepts the same list[tuple] as ‘bonds’.
  However, bonds will raise an Exception if a bond already exists. The above example with a disulfide bridge between CYS2 and CYS20 will thus raise an exception. A better example is: {“CYS-2”: (“C”, {
  
  “optional_bonds”: [
  (“S”, 321), # connect to CYS20, the 321 is a placeholder
  
  ], },
  
  “CYS-20”: (“C”, {
  
  “optional_bonds”: [
  (20, “S”), # connect to CYS2
  
  ], },
  
  }
- delete_bonds: This key accepts the same list[tuple] as ‘bonds’,
  but will remove bonds. If a bond was marked for deletion, but does not exist in your topology, an Exception will be raised. To delete bonds, without raising an Exception, use:
- optional_delete_bonds: This will delete bonds, if they are present
  and won’t raise an Exception if no bond is present.
- PHI, PSI, OMEGA: These keys define the backbone torsions of this
  residue. You can just provide a list[str] for these keys. But the str can contain + and - to use atoms in previous or following residues. Example: {
  
  “CYS-2”: (
  “C”, {
  
  “PHI”: [“-C”, “N”, “CA”, “C”], “PSI”: [“N”, “CA”, “C”, “+N”], “OMEGA”: [“CA”, “C”, “+N”, “+CA”],
  
  },
  
  ),
  
  }
- not-PSI, not_OMEGA, not_PHI: Same as ‘PHI’, ‘PSI”, ‘OMEGA’, but
  will remove these dihedrals from consideration. The vales of these keys do not matter. Example: {
  
  “CYS-2”: (
  “C”, {
  
  “PHI”: [“-C”, “N”, “CA”, “C”], “not_PSI”: [], # value for not_* keys does not matter “not_OMEGA”: [], # it just makes EncoderMap skip these dihedrals.
  
  },
  
  ),
  
  }
- CHI1, …, CHI5: Finally, these keys define the atoms considered for
  the sidechain angles. If you want to add extra sidechain dihedrals for phosphothreonine, you can do: {
  
  “TPO”: (
  “T”, {
  
  “CHI2”: [“CA”, “CB”, “OG1”, “P”], # include phosphorus in sidechain angles “CHI3”: [“CB”, “OG1”, “P”, “OXT”], # include the terminal axygen in sidechain angles
  
  },
  
  )
  
  }

Examples

>>> # Aminoacids taken from https://www.swisssidechain.ch/
>>> # The provided .pdb file has only strange and unnatural aminoacids.
>>> # Its sequence is:
>>> # TPO - PTR - ORN - OAS - 2AG - CSR
>>> # TPO: phosphothreonine
>>> # PTR: phosphotyrosine
>>> # ORN: ornithine
>>> # OAS: o-acetylserine
>>> # 2AG: 2-allyl-glycine
>>> # CSR: selenocysteine
>>> # However, someone mis-named the 2AG residue to ALL
>>> # Let's fix that with EncoderMap's CustomTopology
>>> import encodermap as em
>>> from pathlib import Path
...
>>> traj = em.load(Path(em.__file__).resolve().parent.parent / "tests/data/unnatural_aminoacids.pdb")
...
>>> custom_aas = {
...     "ALL": ("A", None),  # makes EncoderMap treat 2-allyl-glycine as alanine
...     "OAS": (
...         "S",  # OAS is 2-acetylserine
...         {
...             "CHI2": ["CA", "CB", "OG", "CD"],  # this is a non-standard chi2 angle
...             "CHI3": ["CB", "OG", "CD", "CE"],  # this is a non-standard chi3 angle
...         },
...     ),
...     "CSR": (  # CSR is selenocysteine
...         "S",
...         {
...             "bonds": [   # we can manually define bonds for selenocysteine like so:
...                 ("-C", "N"),      # bond between previous carbon and nitrogen CSR
...                 ("N", "CA"),
...                 ("N", "H1"),
...                 ("CA", "C"),
...                 ("CA", "HA"),     # this topology includes hydrogens
...                 ("C", "O"),
...                 ("C", "OXT"),     # As the C-terminal residue, we don't need to put ("C", "+N") here
...                 ("CA", "CB"),
...                 ("CB", "HB1"),
...                 ("CB", "HB2"),
...                 ("CB", "SE"),
...                 ("SE", "HE"),
...             ],
...             "CHI1": ["N", "CA", "CB", "SE"],  # this is a non-standard chi1 angle
...         },
...     ),
...     "TPO": (  # TPO is phosphothreonine
...         "T",
...         {
...             "CHI2": ["CA", "CB", "OG1", "P"],  # a non-standard chi2 angle
...             "CHI3": ["CB", "OG1", "P", "OXT"],  # a non-standard chi3 angle
...         },
...     ),
... }
...
>>> # loading this will raise an Exception, because the bonds in CSR already exist
>>> traj.load_custom_topology(custom_aas)  
Traceback (most recent call last):
    ...
Exception: Bond between ALL5-C and CSR6-N already exists. Consider using the key 'optional_bonds' to not raise an Exception on already existing bonds.
>>> # If we rename the "bonds" section in "CSR" to "optional_bonds" it will work
>>> custom_aas["CSR"][1]["optional_bonds"] = custom_aas["CSR"][1].pop("bonds")
>>> traj.load_custom_topology(custom_aas)
>>> sidechains = em.features.SideChainDihedrals(traj).describe()
>>> "SIDECHDIH CHI2  RESID  OAS:   4 CHAIN 0" in sidechains
True
>>> "SIDECHDIH CHI3  RESID  OAS:   4 CHAIN 0" in sidechains
True
>>> "SIDECHDIH CHI1  RESID  CSR:   6 CHAIN 0" in sidechains
True
>>> "SIDECHDIH CHI2  RESID  TPO:   1 CHAIN 0" in sidechains
True
>>> "SIDECHDIH CHI3  RESID  TPO:   1 CHAIN 0" in sidechains
True

Parameters:

new_residues (NewResidue)
traj (Optional[SingleTraj])

add_amino_acid_codes()[source]#

Return type:: None

add_bonds()[source]#

Adds and deletes bonds specified in the custom topology.

Returns:: The new topology.
Return type:: md.Topology

add_new_residue(new_residue)[source]#

Adds an instance of NewResidue to the reisdues of this CustomTopology.

Parameters:: new_residue (NewResidue) – An instance of NewResidue.
Return type:: None

atom_sequence(type)[source]#

Returns either backbone or sidechain indices in a useful order.

Parameters:: type (Literal["OMEGA", "PHI", "PSI", "CHI1", "CHI2", "CHI3", "CHI4", "CHI5"]) – The angle, that is looked for.
Returns:: A tuple containing two numpy arrays:
Return type:: tuple[np.ndarray, np.ndarray]

backbone_sequence(atom_names, type)[source]#

Searches for a sequence along the backbone.

Parameters:

atom_names (list[str]) – The names of the atoms. Can use +/- to mark atoms in previous or following residue.
type (Literal["PHI", "PSI", "OMEGA"]) – The type of the dihedral sequence.

Returns:

The integer indices of the requested atoms.

Return type:

np.ndarray

combine_chains(chain_id1, chain_id2)[source]#

Function to combine two chains into one.

Parameters:

chain_id1 (int) – The 0-based index of chain 1.
chain_id2 (int) – The 0-based index of chain 2.

Return type:

None

classmethod from_dict(custom_aas, traj=None)[source]#

Instantiate the class from a dictionary.

Parameters:

custom_aas (CustomAAsDict) –
Custom AAs defined by a dict with the following properties: The keys are the residue names encountered in this traj. The values to the keys can be one of three types:
- None: if a key: None pair is supplied, this just adds the
  residue to the recognized residues. Nothing will be done with it.
- str: If a key: str pair is supplied, it is expected that the
  string matches one of the one-letter amino-acid codes. If your new residue is based on Lysine and you named it LYQ, you need to supply: {“LYQ”: “K”}
- tuple[str, dict]: If your residue has nonstandard side-chain
  angles (i.e. due to phosphorylation), you can supply a tuple of the one-letter amino-acid code and a dict which defines the sidechain angles like so: {“THR”: (“T”, {“CHI2”: [“CA”, “CB”, “CG”, “P”]})} In this example, the standard amino acid threonine was phosphorylated. The chi2 angle was added. If you want to add custom bonds you can add the “bond” key to the dict and give it either atom names or atom indices of other atoms like so: {“LYQ”: (“K”, {“bonds”: [(“N”, “CA”), (“N”, “H”), …], “CHI1”: [“N”, “CA”, “CB”, “CG”]}).
- tuple[str, str, dict]: In this case, the first string should
  be the name of the amino-acid, the second string should be a common_str, that is in self.common_str. That way, the different topologies in this TrajEnsemble can dynamically use different custom_aas.
traj (SingleTraj | None)

classmethod from_hdf5_file(fname, traj=None)[source]#

Parameters:

fname (Path | str)
traj (SingleTraj | None)

classmethod from_json(json_str, traj=None)[source]#

The same as from_dict, but using a json str.

Parameters:

json_str (str)
traj (SingleTraj | None)

classmethod from_yaml(path, traj=None)[source]#

Parameters:

path (str | Path)
traj (SingleTraj | None)

get_single_residue_atom_ids(atom_names, r, key_error_ok=False)[source]#

Gives the 0-based atom ids of a single residue.

Parameters:

atom_names (list[str]) – The names of the atoms. ie. [‘N’ ,’CA’, ‘C’, ‘+N’]
r (NewResidue) – An instance of NewResidue.
key_error_ok (bool) – Whether a key error when querying self._atom_dict raises an error or returns an empty np.ndarray.

Returns:

An integer array with the ids of the requested atoms.

Return type:

np.ndarray

indices_chi1()[source]#