encodermap.trajinfo package#

Submodules#

encodermap.trajinfo.hash_files module#

encodermap.trajinfo.hash_files.hash_files(files)[source]#

Returns a dict of file hashes

Parameters:: files (Union[str, list]) –
Returns:: A nested dict, indexed by filenames and sha1 and md5 hashes.
Return type:: dict

encodermap.trajinfo.info_all module#

Classes to work with ensembles of trajectories.

The statistics of a protein can be better described by an ensemble of proteins, rather than a single long trajectory. Treating a protein in such a way opens great possibilities and changes the way one can treat molecular dynamics data. Trajectory ensembles allow:

Faster convergence via adaptive sampling.

This subpackage contains two classes which are containers of trajecotry data. The SingleTraj trajecotry contains information about a single trajecotry. The TrajEnsemble class contains information about multiple trajectories. This adds a new dimension to MD data. The time and atom dimension are already established. Two frames can be appended along the time axis to get a trajectory with multiple frames. If they are appended along the atom axis, the new frame contains the atoms of these two. The trajectory works in a similar fashion. Adding two trajectories along the trajectory axis returns a trajectory ensemble, represented as an TrajEnsemble class in this package.

class encodermap.trajinfo.info_all.TrajEnsemble(trajs: Union[list[str], list[md.Trajectory], list[SingleTraj], list[Path]], tops: Optional[list[str]] = None, backend: Literal['mdtraj', 'no_load'] = 'no_load', common_str: Optional[list[str]] = None, basename_fn: Optional[Callable] = None)[source]#

Bases: object

This class contains the info about many trajectories. Topologies can be mismatching.

This class is a fancy list of encodermap.trajinfo.SingleTraj objects. Trajectories can have different topologies and will be grouped by the common_str argument.

TrajEnsemble supports fancy indexing. You can slice to your liking trajs[::5] returns an TrajEnsemble object that only consideres every fifth frame. Besides indexing by slices and integers you can pass a 2 dimensional np.array. np.array([[0, 5], [1, 10], [5, 20]]) will return a TrajEnsemble object with frame 5 of trajectory 0, frame 10 of trajectory 1 and frame 20 of trajectory 5. Simply passing an integer as index returns the corresponding SingleTraj object.

The TrajEnsemble class also contains an iterator to iterate over trajectores. You could do:: >>> for traj in trajs: … for frame in traj: … print(frame)

CVs#

The collective variables of the SingleTraj classes. Only CVs with matching names in all SingleTraj classes are returned. The data is stacked along a hypothetical time axis along the trajs.

Type:: dict

_CVs#

The same data as in CVs but with labels. Additionally, the xarray is not stacked along the time axis. It contains an extra dimension for trajectories.

Type:: xarray.Dataset

n_trajs#

Number of individual trajectories in this class.

Type:: int

n_frames#

Number of frames, sum over all trajectories.

Type:: int

locations#

A list with the locations of the trajectories.

Type:: list of str

top#

A list with the reference pdb for each trajecotry.

Type:: list of mdtraj.Topology

basenames#

A list with the names of the trajecotries. The leading path and the file extension is omitted.

Type:: list of str

name_arr#

An array with len(name_arr) == n_frames. This array keeps track of each frame in this object by identifying each frame with a filename. This can be useful, when frames are mixed inside an TrajEnsemble class.

Type:: np.ndarray of str

index_arr#

index_arr.shape = (n_frames, 2). This array keeps track of each frame with two ints. One giving the number of the trajectory, the other the frame.

Type:: np.ndarray of str

Examples

>>> # Create a trajectory ensemble from a list of files
>>> import encodermap as em
>>> trajs = em.TrajEnsemble(['https://files.rcsb.org/view/1YUG.pdb', 'https://files.rcsb.org/view/1YUF.pdb'])
>>> # trajs are inernally numbered
>>> print([traj.traj_num for traj in trajs])
[0, 1]
>>> # Build a new traj from random frames
>>> # Let's say frame 2 of traj 0, frame 5 of traj 1 and again frame 2 of traj 0
>>> # Doing this every frame will now be its own trajectory for easier bookkepping
>>> arr = np.array([[0, 2], [1, 5], [0, 2]])
>>> new_trajs = trajs[arr]
>>> print(new_trajs.n_trajs)
3
>>> # trace back a single frame
>>> frame_num = 28
>>> index = trajs.index_arr[frame_num]
>>> print('Frame {}, originates from trajectory {}, frame {}.'.format(frame_num, trajs.basenames[index[0]], index[1]))
Frame 28, originates from trajectory 1YUF, frame 13.

property CVs: dict[str, numpy.ndarray]#

Returns dict of CVs in SingleTraj classes. Only CVs with the same names in all SingleTraj classes are loaded.

Type:: dict

property CVs_in_file: bool#

Is true, if CVs can be loaded from file. Can be used to build a data generator from.

Type:: bool

property _CVs: Dataset#

Returns x-array Dataset of matching CVs. stacked along the trajectory-axis.

Type:: xarray.Dataset

__add__(y)[source]#: Addition of two TrajEnsemble objects returns new TrajEnsemble with trajectories joined along the traj axis.

__init__(trajs: Union[list[str], list[md.Trajectory], list[SingleTraj], list[Path]], tops: Optional[list[str]] = None, backend: Literal['mdtraj', 'no_load'] = 'no_load', common_str: Optional[list[str]] = None, basename_fn: Optional[Callable] = None) → None[source]#

Initialize the Info class with two lists of files.

Parameters:

trajs (Union[list[str], list[md.Trajectory], list[SingleTraj], list[Path]]) – List of strings with paths to trajectories.
tops (Optional[list[str]]) – List of strings with paths to reference pdbs.
backend (str, optional) – Chooses the backend to load trajectories. * ‘mdtraj’ uses mdtraj which loads all trajecoties into RAM. * ‘no_load’ creates an empty trajectory object. Defaults to ‘no_load’.
common_str (list of str, optional) – If you want to include trajectories with different topology. The common string is used to pair traj-files (.xtc, .dcd, .lammpstrj) with their topology (.pdb, .gro, …). The common-string should be a substring of matching trajs and topologies.
basename_fn (Union[None, function], optional) – A function to apply to the traj_file string to return the basename of the trajectory. If None is provided, the filename without extension will be used. When all files are named the same and the folder they’re in defines the name of the trajectory you can supply lambda x: split(‘/’)[-2] as this argument. Defaults to None.

Raises:

TypeError – If some of your inputs are mismatched. If your input lists contain other types than str or mdtraj.Trajecotry.

_pyemma_indexing(key: ndarray) → TrajEnsemble[source]#: Returns a new TrajEnsemble by giving the indices of traj and frame

_return_trajs_by_index(index: list[int]) → TrajEnsemble[source]#: Creates a TrajEnsemble object with the trajs specified by index.

_string_summary() → str[source]#

property basenames: list[str]#

List of the basenames in the Info single classes.

Type:: list

property frames: list[int]#

Frames of individual trajectories.

Type:: list

classmethod from_textfile(fname, basename_fn=None) → TrajEnsemble[source]#

Creates an TrajEnsemble object from a textfile.

The textfile needs to be space-separated with two or three columns. Column 1: The trajectory file. Column 2: The corresponding topology file (If you are using .h5 trajs,

column 1 and 2 will be identical).

Column 3: The common string of the trajectory. This column can be left: out, which will result in an TrajEnsemble without common_strings.

Parameters:

fname (str) – File to be read.
basename_fn (Union[None, function], optional) – A function to apply to the traj_file string to return the basename of the trajectory. If None is provided, the filename without extension will be used. When all files are named the same and the folder they’re in defines the name of the trajectory you can supply lambda x: split(‘/’)[-2] as this argument. Defaults to None.

Returns:

An instantiated TrajEnsemble class.

Return type:

TrajEnsemble

classmethod from_xarray(fnames, basename_fn=None) → TrajEnsemble[source]#

get_single_frame(key: int) → SingleTraj[source]#

Returns a single frame from all loaded trajectories.

Consider a TrajEnsemble class with two SingleTraj classes. One has 10 frames, the other 5 (trajs.n_frames is 15). Calling trajs.get_single_frame(12) is equal to calling trajs[1][1].

Parameters:: key (int) – The frame to return.
Returns:: The frame.
Return type:: encodermap.SingleTraj

property id: ndarray#

Duplication of self.index_arr

Type:: np.ndarray

property index_arr: ndarray#

Returns np.ndarray with ndim = 2. Clearly assigning every loaded frame an identifier of traj_num (self.index_arr[:,0]) and frame_num (self.index_arr[:,1]). Can be used to create a unspecified subset of frames and can be useful when used with clustering.

Type:: np.ndarray

iterframes() → Iterator[tuple[int, SingleTraj]][source]#

Generator over the frames in this class.

Yields:

tuple –

A tuple containing the following:: int: A loop-counter integer. encodermap.SingleTraj: An SingleTraj object.

Examples

>>> import encodermap as em
>>> trajs = em.TrajEnsemble(['https://files.rcsb.org/view/1YUG.pdb', 'https://files.rcsb.org/view/1YUF.pdb'])
>>> for i, frame in trajs.iterframes():
...     print(frame.basename)
...     print(frame.n_frames)
...     break
1YUG
1

itertrajs() → Iterator[tuple[int, SingleTraj]][source]#

Generator over the SingleTraj classes.

Yields:

tuple –

A tuple containing the following:: int: A loop-counter integer. Is identical with traj.traj_num. encodermap.SingleTraj: An SingleTraj object.

Examples

>>> import encodermap as em
>>> trajs = em.TrajEnsemble(['https://files.rcsb.org/view/1YUG.pdb', 'https://files.rcsb.org/view/1YUF.pdb'])
>>> for i, traj in trajs.itertrajs():
...     print(traj.basename)
1YUG
1YUF

load_CVs(data: TrajEnsembleFeatureType, attr_name: Optional[str] = None, cols: Optional[list[int]] = None, labels: Optional[list[str]] = None, directory: Optional[Union[str, Path]] = None, ensemble: bool = False) → None[source]#

Loads CVs in various ways. Easiest way is to provide a single numpy array and a name for that array.

Besides np.ndarrays, files (.txt and .npy) can be loaded. Features or Featurizers can be provided. An xarray.Dataset can be provided. A str can be provided that either is the name of one of encodermap’s features (encodermap.loading.features) or the string can be ‘all’, which loads all features required for encodermap’s AngleDihedralCarteisanEncoderMap class.

Parameters:

data (Union[str, list, np.ndarray, 'all', xr.Dataset]) – The CV to load. When a numpy array is provided, it needs to have a shape matching n_frames. The data is distributed to the trajs. When a list of files is provided, len(data) needs to match n_trajs. The first file will be loaded by the first traj and so on. If a list of np.arrays is provided, the first array will be assigned to the first traj. If a None is provided, the arg directory will be used to construct fname = directory + traj.basename + ‘_’ + attr_name. The filenames will be used. These files will then be loaded and put into the trajs. Defaults to None.
attr_name (Optional[str]) – The name under which the CV should be found in the class. Choose whatever you like. highd, lowd, dists, etc…
cols (Optional[list[int]]) – A list of integers indexing the columns of the data to be loaded. This is useful, if a file contains feature1, feature1, …, feature1_err, feature2_err formatted data. This option will only be used, when loading multiple .txt files. If None is provided all columns will be loaded. Defaults to None.
labels (list) – A list containing the labels for the dimensions of the data. Defaults to None.
directory (Optional[str]) – The directory to save the data at, if data is an instance of em.Featurizer and this featurizer has in_memory set to Fase. Defaults to ‘’.
ensemble (bool) – Whether the trajs in this class belong to an ensemble. This implies that they contain either the same topology or are very similar (think wt, and mutant). Setting this option True will try to match the CVs of the trajs onto a same dataset. If a VAL residue has been replaced by LYS in the mutant, the number of sidechain dihedrals will increase. The CVs of the trajs with VAL will thus contain some NaN values. Defaults to False.

Raises:

TypeError – When wrong Type has been provided for data.

load_trajs() → None[source]#: Loads all trajs in self.

property locations: list[str]#

Duplication of self.traj_files but using the trajs own traj_file attribute. Ensures that traj files are always returned independent from current load state.

Type:: list

property n_frames: int#

Sum of the loaded frames.

Type:: int

property n_residues: int#

List of number of residues of the SingleTraj classes

Type:: list

property n_trajs: int#

Number of trajectories in this encemble.

Type:: int

property name_arr: ndarray#

Trajectory names with the same length as self.n_frames.

Type:: np.ndarray

save()[source]#

save_CVs(path: Union[str, Path]) → None[source]#: Saves the CVs to a NETCDF file using xarray.

split_into_frames(inplace: bool = False) → None[source]#

Splits self into separate frames.

Parameters:: inplace (bool, optionale) – Whether to do the split inplace or not. Defaults to False and thus, returns a new TrajEnsemble class.

subsample(stride: int, inplace: bool = False) → Union[None, TrajEnsemble][source]#

Returns a subset of this TrajEnsemble class given the provided stride.

This is a faster alternative than using the trajs[trajs.index_arr[::1000]] when HDF5 trajs are used, because the slicing information is saved in the respective SingleTraj classes and loading of single frames is faster in HDF5 formatted trajs.

Note

The result from subsample() is different from trajs[trajs.index_arr[::1000]]. With subsample every trajectory is subsampled independently. Cosnider a TrajEnsemble with two SingleTraj trajectories with 18 frames each. subsampled = trajs.subsample(5) would return an TrajEnsemble with two trajs with 3 frames each (subsampled.n_frames is 6). Whereas subsampled = trajs[trajs.index_arr[::5]] would return an TrajEnsemble with 7 SingleTrajs with 1 frame each (subsampled.n_frames is 7). Because the times and frame numbers are saved all the time this should not be too much of a problem.

property top: list[mdtraj.core.topology.Topology]#

Returns a minimal set of mdtraj.Topologies.

If all trajectories share the same topology a list with len 1 will be returned.

Type:: list

property top_files: list[str]#

Returns minimal set of topology files.

If yoy want a list of top files with the same length as self.trajs use self._top_files and self._traj_files.

Type:: list

property traj_files: list[str]#

A list of the traj_files of the individual SingleTraj classes.

Type:: list

property traj_joined: Trajectory#

Returns a mdtraj Trajectory with every frame of this class appended along the time axis.

Can also work if different topologies (with the same number of atoms) are loaded. In that case, the first frame in self will be used as topology parent and the remaining frames’ xyz coordinates are used to position the parents’ atoms accordingly.

Examples

>>> import encodermap as em
>>> single_mdtraj = trajs.split_into_frames().traj_joined
>>> print(single_mdtraj)
<mdtraj.Trajectory with 31 frames, 720 atoms, 50 residues, without unitcells>

Type:: mdtraj.Trajectory

property traj_nums: list[int]#

Number of info single classes in self.

Type:: list

unload() → None[source]#: Unloads all trajs in self.

property xyz: ndarray#

xyz coordinates of all atoms stacked along the traj-time axis. Only works if all trajs share the same topology.

Type:: np.ndarray

encodermap.trajinfo.info_all._can_be_feature(inp)[source]#

Function to decide whether the input can be interpreted by the Featurizer class.

Outputs True, if inp == ‘all’ or inp is a list of strings contained in FEATURE_NAMES.

Parameters:: inp (Any) – The input.
Returns:: True, if inp can be interpreted by featurizer.
Return type:: bool

Example

>>> from encodermap.misc.misc import _can_be_feature
>>> _can_be_feature('all')
True
>>> _can_be_feature('no')
False
>>> _can_be_feature(['AllCartesians', 'central_dihedrals'])
True

encodermap.trajinfo.info_all._datetime_windows_and_linux_compatible()[source]#

Portable way to get now as either a linux or windows compatible string.

For linux systems strings in this manner will be returned:: 2022-07-13T16:04:04+02:00
For windows systems strings in this manner will be returned:: 2022-07-13_16-04-46

encodermap.trajinfo.info_single module#

Classes to work with ensembles of trajectories.

Faster convergence via adaptive sampling.

Better anomaly detection of unique structural states.

class encodermap.trajinfo.info_single.SingleTraj(traj: Union[str, Path, md.Trajectory], top: Optional[str, Path] = None, common_str: str = '', backend: Literal['no_load', 'mdtraj'] = 'no_load', index: Optional[Union[int, list[int], np.ndarray, slice]] = None, traj_num: Optional[int] = None, basename_fn: Optional[Callable] = None)[source]#

Bases: object

This class contains the info about a single trajectory.

This class contains many of the attributes and methods of mdtraj’s Trajectory. It is meant to be used as a single trajectory in a ensemble defined in the TrajEnsemble class. Other than the standard mdtraj Trajectory this class loads the MD data only when needed. The location of the file and other attributes like a single integer index (single frame of trajectory) or a list of integers (multiple frames of the same traj) are stored until the traj is accessed via the SingleTraj.traj attribute. The returned traj is a mdtraj Trajectory with the correct number of frames in the correct sequence.

Furthermore this class keeps track of your collective variables. Oftentimes the raw xyz data of a trajectory is not needed and suitable CVs are selected to represent a protein via internal coordinates (torsions, pairwise distances, etc.). This class keeps tack of your CVs. Whether you call them highd or torsions, this class keeps track of everything and returns the values when you need them.

SingleTraj supports fancy indexing, so you can extract one or more frames from a Trajectory as a separate trajectory. For example, to form a trajectory with every other frame, you can slice with traj[::2].

SingleTraj uses the nanometer, degree & picosecond unit system.

backend#

Current state of loading. If backend == ‘no_load’ xyz data will be loaded from disk, if accessed. If backend == ‘mdtraj’, the data is already in RAM.

Type:: str

common_str#

Substring of traj_file. Used to group multiple trajectories together based on common topology files. If traj files protein1_traj1.xtc and protein1_traj2.xtc share the sameprotein1.pdb common_str can be set to group them together.

Type:: str

index#

Fancy slices of the trajectory. When file is loaded from disk, the fancy indexes will be applied.

Type:: Union[int, list, np.array, slice]

traj_num#

Integer to identify a SingleTraj class in a TrajEnsemble class.

Type:: int

traj_file#

Trajectory file used to create this class.

Type:: str

top_file#

Topology file used to create this class. If a .h5 trajectory was used traj_file and top_file are identical. If a mdtraj.Trajectory was used to create SingleTraj, these strings are empty.

Type:: str

Examples

>>> # load a pdb file with 14 frames from rcsb.org
>>> import encodermap as em
>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb")
>>> print(traj)
encodermap.SingleTraj object. Current backend is no_load. Basename is 1GHC. Not containing any CVs.
>>> traj.n_frames
14

>>> # advanced slicing
>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb")[-1:7:-2]
>>> print([frame.id for frame in traj])
[13, 11, 9]

>>> # Build a trajectory ensemble from multiple trajs
>>> traj1 = em.SingleTraj("https://files.rcsb.org/view/1YUG.pdb")
>>> traj2 = em.SingleTraj("https://files.rcsb.org/view/1YUF.pdb")
>>> trajs = traj1 + traj2
>>> print(trajs.n_trajs, trajs.n_frames, [traj.n_frames for traj in trajs])
2 31 [15, 16]

property CVs: dict[str, numpy.ndarray]#

Returns a simple dict from the more complicated self._CVs xarray Dataset.

If self._CVs is empty and self.traj_file is a HDF5 (.h5) file, the contents of the HDF5 will be checked, whether CVs have been stored there. If not and empty dict will be returned.

Type:: dict

property CVs_in_file: bool#

Is True, if traj_file has exyension .h5 and contains CVs.

Type:: bool

__add__(y: SingleTraj) → TrajEnsemble[source]#

Addition of two SingleTraj classes yields TrajEnsemble class. A trajectory ensemble.

Parameters:: y (encodermap.SingleTraj) – The other traj, that will be added.
Returns:: The new trajs.
Return type:: encodermap.TrajEnsemble

__enter__()[source]#: Enters context manager. Inside context manager, the traj stays loaded.

__eq__(other: SingleTraj) → bool[source]#: Two SingleTraj objetcs are the same, when the trajectories are the same, the files are the same and the loaded CVs are the same.

__exit__(type, value, traceback)[source]#: Exits the context manager and deletes unwanted variables.

__getattr__(attr)[source]#

What to do when attributes can not be obtained in a normal way?.

This method allows access of the self.CVs dictionary’s values as instance variables. Furthermore, of a mdtraj variable is called, the traj is loaded and the correct variable is returned.

__getitem__(key)[source]#

This method returns another trajectory as an SingleTraj class.

Parameters:: key (Union[int, list[int], np.ndarray, slice]) – Indexing the trajectory can be done by int (returns a traj with 1 frame), lists of int or np.ndarray (returns a new traj with len(traj) == len(key)), or slice ([::3]), which returns a new traj with the correct number of frames.
Returns:: An SingleTraj object with this frame in it.
Return type:: Info_Single

__init__(traj: Union[str, Path, md.Trajectory], top: Optional[str, Path] = None, common_str: str = '', backend: Literal['no_load', 'mdtraj'] = 'no_load', index: Optional[Union[int, list[int], np.ndarray, slice]] = None, traj_num: Optional[int] = None, basename_fn: Optional[Callable] = None) → None[source]#

Initilaize the SingleTraj object with location and reference pdb file.

Parameters:

traj (Union[str, mdtraj.Trajectory]) – The trajectory. Can either be teh filename of a trajectory file (.xtc, .dcd, .h5, .trr) or a mdtraj.Trajectory.
top (Union[str, mdtraj.Topology], optional) – The path to the reference pdb file. Defaults to ‘’. If an mdtraj.Trajectory or a .h5 traj filename is provided this option is not needed.
common_str (str, optional) – A string to group traj of similar topology. If multiple trajs are loaded (TrajEnsemble) this common_str is used to group them together. Defaults to ‘’ and won’t be matched to other trajs. If traj files protein1_traj1.xtc and protein1_traj2.xtc share the sameprotein1.pdb and protein2_traj.xtc uses protein2.pdb as its topology this argument can be [‘protein1’, ‘protein2’].
backend (Literal['no_load', 'mdtraj'], optional) – Chooses the backend to load trajectories. * ‘mdtraj’ uses mdtraj which loads all trajecoties into RAM. * ‘no_load’ creates an empty trajectory object. Defaults to ‘no_load’
() (index) – An integer or an array giving the indices. If an integer is provided only the frame at this position will be loaded once the internal mdtraj.Trajectory is accessed. If an array or list is provided the corresponding frames will be used. These indices can have duplicates: [0, 1, 1, 2, 0, 1]. A slice object can also be provided. Supports fancy slicing like traj[1:50:3]. If None is provided the trajectory is simply loaded as is. Defaults to None
traj_num (Union[int, None], optional) – If working with multiple trajs this is the easiest unique identifier. If multiple SingleTrajs are instantiated by TrajEnsemble the traj_num is used as unique identifier per traj. Defaults to None.
basename_fn (Optional[Callable]) – A function to apply to traj_file to give it a unique identifier. If all your trajs are called traj.xtc and only the directory they’re in gives them a unique identifier you can provide a function into this argument to split the path. If None is provided the basename is extracted liek so: `lambda x: x.split(‘/’)[0].split(‘.’)[-1]. Defaults to None.

__iter__()[source]#: Iterate over frames in this class. Returns the correct CVs along with the frame of the trajectory.

__reversed__() → SingleTraj[source]#: Reverses the frame order of the traj. Same as traj[::-1]

_add_along_traj(y: SingleTraj) → TrajEnsemble[source]#

Puts self and y into a TrajEnsemble object.

This way the trajectories are not appended along the timed axis but rather along the trajectory axis.

Parameters:: y (SingleTraj) – The other ep.SingleTraj trajectory.

_gen_ensemble() → TrajEnsemble[source]#

Creates an TrajEnsemble class with this traj in it.

This method is needed to add two SingleTraj objects along the trajectory axis with the method add_new_traj. This method is also called by the __getitem__ method of the TrajEnsemble class.

_mdtraj_attr = ['n_frames', 'n_atoms', 'n_chains', 'n_residues', 'openmm_boxes', 'openmm_positions', 'time', 'timestep', 'xyz', 'unitcell_vectors', 'unitcell_lengths', 'unitcell_angles', '_check_valid_unitcell', '_distance_unit', '_have_unitcell', '_rmsd_traces', '_savers', '_string_summary_basic', '_time', '_time_default_to_arange', '_topology', '_unitcell_angles', '_unitcell_lengths', '_xyz']#

property _n_frames_base_h5_file: int#

Can be used to get n_frames without loading an HDF5 into memory.

Type:: int

property _original_frame_indices#

_string_summary() → str[source]#

Returns a summary about the current instance.

Number of frames, index, loaded CVs.

property _traj#: Needs to be here to complete setter. Not returning anything, because setter is also not returning anything.

_validate_uri(uri: str) → bool[source]#: Checks whether uri is a valid uri.

atom_slice(atom_indices: ndarray, inplace: bool = False) → Union[None, SingleTraj][source]#

Create a new trajectory from a subset of atoms.

Parameters:

atom_indices (Union[list, np.array]) – The indices of the
keep. (atoms to) –
inplace (bool, optional) – Whether to overwrite the current instance,
False. (or return a new instance. Defaults to) –

property basename: str#

Basename is the filename without path and without extension. If basename_fn is not None, it will be applied to traj_file.

Type:: str

property extension: str#

Extension is the file extension of the trajectory file (self.traj_file).

Type:: str

classmethod from_pdb_id(pdb_id: str) → SingleTraj[source]#

Alternate constructor for the TrajEnsemble class.

Builds an SingleTraj class from a pdb-id.

Parameters:: pdb_id (str) – The 4-letter pdb id.
Returns:: An SingleTraj class.
Return type:: SingleTraj

get_single_frame(key: int) → SingleTraj[source]#

Returns a single frame from the trajectory.

Parameters:: key (Union[int, np.int]) – Index of the frame.

Examples

>>> # Load traj from pdb
>>> import encodermap as em
>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb")
>>> traj.n_frames
14

>>> # Load the same traj and give it a number for recognition in a set of multiple trajs
>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb", traj_num=5)
>>> frame = traj.get_single_frame(2)
>>> frame.id
array([[5, 2]])

property id: ndarray#

id is an array of unique identifiers which identify the frames in this SingleTraj object when multiple Trajectories are considered.

If the traj was initialized from an TrajEnsemble class, the traj gets a unique identifier (traj_num) which will also be put into the id array, so that id can have two shapes ((n_frames, ), (n_frames, 2)) This corresponds to self.id.ndim = 1 and self.id.ndim = 2. In the latter case self.id[:,1] are the frames and self.id[:,0] is an array full of traj_num.

Type:: np.ndarray

join(other: SingleTraj) → Trajectory[source]#

Join two trajectories together along the time/frame axis.

Returns a mdtraj.Trajectory and thus loses CVs, filenames, etc.

load_CV(data: SingleTrajFeatureType, attr_name: Optional[str] = None, cols: Optional[list[int]] = None, labels: Optional[list[str]] = None, override: bool = False) → None[source]#

Load CVs into traj. Many options are possible. Provide xarray, numpy array, em.loading.feature, em.featurizer, and even string!

This method loads CVs into the SingleTraj class. Many ways of doing so are available:

np.ndarray: The easiest way. Provide a np array and a name for the array and the data
will be saved as a instance variable, accesible via instance.name.
xarray.DataArray: You can load a multidimensional xarray as data into the class. Please
refer to xarrays own documentation if you want to create one yourself.
xarray.Dataset: You can add another dataset to the existing _CVs.
em.loading.feature: If you provide one of the features from em.loading.features the resulting
features will be loaded and also placed under the provided name.
em.Featurizer: If you provide a full featurizer, the data will be generated and put as an
instance variable as the provided name.
str: If a string is provided, the data will be loaded from a .txt, .npy, or NetCDF / HDF5 .nc file.

Parameters:

data (Union[str, np.ndarray, xr.DataArray, em.loading.feature, em.Featurizer]) – The CV to load. Either as numpy array, xarray DataArray, encodermap or pyemma feature, or full encodermap Featurzier.
attr_name (Union[None, str], optional) – The name under which the CV should be found in the class. Is needed, if a raw numpy array is passed, otherwise the name will be generated from the filename (if data == str), the DataArray.name (if data == xarray.DataArray), or the feature name.
cols (Union[list, None], optional) – A list specifying the columns to use for the highD data. If your highD data contains (x,y,z,…)-errors or has an enumeration column at col=0 this can be used to remove this unwanted data.
labels (Union[list, str, None], optional) – If you want to label the data you provided pass a list of str. If set to None, the features in this dimension will be labelled as [f”{attr_name.upper()} FEATURE {i}” for i in range(self.n_frames)]. If a str is provided, the features will be labelled as [f”{attr_name.upper()} {label.upper()} {i}” for i in range(self.n_frames)]. If a list of str is provided it needs to have the same length as the traj has frames. Defaults to None.
override (bool) – Whether to overwrite existing CVs. The method will also print a message which CVs have been overwritten.

Examples

>>> # Load the backbone torsions from a time-resolved NMR ensemble from the pdb
>>> import encodermap as em
>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb")
>>> central_dihedrals = em.loading.features.CentralDihedrals(traj.top)
>>> traj.load_CV(central_dihedrals)
>>> traj.central_dihedrals.shape
(1, 14, 222)
>>> # The values are stored in an xarray Dataset to track every possible datafield
>>> traj = em.SingleTraj("https://files.rcsb.org/view/1GHC.pdb")
>>> traj.load_CV(em.loading.features.CentralDihedrals(traj.top))
>>> print(traj._CVs['central_dihedrals']['CENTRALDIHEDRALS'].values[:2])
['CENTERDIH PSI   RESID  MET:   1 CHAIN 0'
 'CENTERDIH OMEGA RESID  MET:   1 CHAIN 0']

Raises:

FileNotFoundError – When the file given by data does not exist.
IOError – When the provided filename does not have .txt, .npy or .nc extension.
TypeError – When data does not match the specified input types.
Exception – When a numpy array has been passed as data and no attr_name has been provided.
BadError – When the provided attr_name is str, but can not be a python identifier.

load_traj(new_backend: Literal['no_load', 'mdtraj'] = 'mdtraj') → None[source]#

Loads the trajectory, with a new specified backend.

After this is called the instance variable self.trajectory will contain an mdtraj Trajectory object.

Parameters:: new_backend (str, optional) – Can either be: * mdtraj to load the trajectory using mdtraj. * no_load to not load the traj (unload). Defaults to mdtraj.

property n_atoms: int#

Number of atoms in traj.

Loads the traj into memory if not in HDF5 file format. Be aware.

Type:: int

property n_chains: int#

Number of chains in traj.

Type:: int

property n_frames: int#

Number of frames in traj.

Loads the traj into memory if not in HDF5 file format. Be aware.

Type:: int

property n_residues: int#

Number of residues in traj.

Type:: int

save(fname: str, CVs: Union[str, list[str]] = 'all', overwrite: bool = False) → None[source]#

Save the trajectory as HDF5 fileformat to disk,

Parameters:

fname (str) – The filename.
CVs (Union[List, 'all'], optional) – Either provide a list of strings of the CVs you would like to save to disk, or set to ‘all’ to save all CVs. Defaults to [].
overwrite (bool, optional) – Whether to force overwrite an existing file. Defaults to False.

Raises:

IOError – When the file already exists and overwrite is False.

save_CV_as_numpy(attr_name: str, fname: Optional[str] = None, overwrite: bool = False) → None[source]#

Saves the highD data of this traj.

This got its own method for parallelization purposes.

Parameters:

attr_name (str) – Name of the CV to save.
fname (str, optional) – Can be either
overwrite (bool, opt) – Whether to overwrite the file. Defaults to False.

Raises:

IOError – When the file already exists and overwrite is set to False.

select(sel_str: str = 'all') → ndarray[source]#

Execute a selection against the topology

Parameters:: sel_str (str, optional) – What to select. Defaults to ‘all’.

encodermap.trajinfo.load_traj module#

Util functions for the TrajEnsemble and SingleTraj classes.

encodermap.trajinfo.load_traj._load_traj(*index: Unpack(Ts), traj_file: Union[str, Path], top_file: Union[str, Path]) → tuple[md.Trajectory, np.ndarray][source]#

Loads a trajectory from disc and applies the indices from *index.

Parameters:

*index (Unpack[Ts]) – Variable length indices of which all need to be one of these datatypes: None, int, np.int, list[int], slice, np.ndarray. These indices are applied to the traj in order. So for a traj with 100 frames, the indices (slice(None, None, 5), [0, 2, 4, 6]) would yield the frames 0, 10, 20, 30, 40. A None will not slice the traj at all.
traj_file (Union[str, Path]) – The pathlib.Path to the traj_file. A string can also be supplied. This also allows to pass a URL, like e.g: https://files.rcsb.org/view/1GHC.pdb.
top_file (Union[str, Path]) – The pathlib.Path to the top_file. Can also be str.

Returns:

The trajectory and a numpy array, which: is the result of np.arange() of the unadulterated trajectory. Can be useful for continued slicing and indexing to keep track of everyhting.

Return type:

tuple[md.Trajectory, np.ndarray]

encodermap.trajinfo.load_traj._load_traj_and_top(traj_file: Path, top_file: Path, index: Optional[Union[int, list[int], ndarray, slice]] = None) → Trajectory[source]#

Loads a traj and top file and raises FileNotFoundError, if they do not exist.

Parameters:

traj_file (Path) – The pathlib.Path to the traj_file.
top_file (Path) – The pathlib.Path to the top_file.
index (Optional[Union[int, list[int], np.ndarray, slice]]) – The index to load the traj at. If ints are provided, the load_frame method is used.

Returns:

The trajectory.

Return type:

md.Trajectory

Raises:

FileNotFoundError – If any of the files are not real.

encodermap.trajinfo.load_traj._validate_uri(str_)[source]#: Checks whether the str_ is a valid uri.

encodermap.trajinfo.repository module#

Python endpoint to download files from a webserver on the fly.

Idea from Christoph Wehmeyer: markovmodel/mdshare I liked his idea of the possibility to distribute MD data via a simple python backend, but wanted to make it smaller. A simple fetch() should suffice. Also I liked the yaml syntax and wanted to use it.

References

@article{wehmeyer2018introduction,: title={Introduction to Markov state modeling with the PyEMMA software [Article v1. 0]}, author={Wehmeyer, Christoph and Scherer, Martin K and Hempel, Tim and Husic, Brooke E and Olsson, Simon and No{‘e}, Frank}, journal={Living Journal of Computational Molecular Science}, volume={1}, number={1}, pages={5965}, year={2018}

}

class encodermap.trajinfo.repository.Repository(repo_source='data/repository.yaml', checksum_file='data/repository.md5', ignore_checksums=False, debug=True)[source]#

Bases: object

Main Class to work with Repositories of MD data and download the data.

This class handles the download of files from a repository source. All data are obtained from a .yaml file (default at data/repository.yaml), which contains trajectory files and topology files organized in a readable manner. With this class the repository.yaml file can be queried using unix-like file patterns. Files can be downloaded on-the-fly (if they already exist, they won’t be downloaded again). Besides files full projects can be downloaded and rebuilt.

current_path#

Path of the .py file containing this class. If no working directory is given (None), all files will be downloaded to a directory named ‘data’ (will be created) which will be placed in the directory of this .py file.

Type:: str

url#

The url to the current repo source.

Type:: str

maintainer#

The maintainer of the current repo source.

Type:: str

files_dict#

A dictionary summarizing the files in this repo. dict keys are built from ‘project_name’ + ‘filetype’. So for a project called ‘protein_sim’, possible keys are ‘protein_sim_trajectory’, ‘protein_sim_topology’, ‘protein_sim_log’. The values of these keys are all str and they give the actual filename of the files. If ‘protein_sim’ was conducted with GROMACS, these files would be ‘traj_comp.xtc’, ‘confout.gro’ and ‘md.log’.

Type:: dict

files#

Just a list of str of all downloadable files.

Type:: list

data#

The main organization of the repository. This is the complete .yaml file as it was read and returned by pyyaml.

Type:: dict

Examples

>>> import encodermap as em
>>> repo = em.Repository()
>>> print(repo.search('*PFFP_sing*')) 
{'PFFP_single_trajectory': 'PFFP_single.xtc', 'PFFP_single_topology': 'PFFP_single.gro', 'PFFP_single_input': 'PFFP.mdp', 'PFFP_single_log': 'PFFP.log'}
>>> print(repo.url)
http://134.34.112.158

__init__(repo_source='data/repository.yaml', checksum_file='data/repository.md5', ignore_checksums=False, debug=True)[source]#

Initialize the repository,

Parameters:

repo_source (str) – The source .yaml file to build the repository from. Defaults to ‘data/repository.yaml’.
checksum_file (str) – A file containing the md5 hash of the repository file. This ensures no one tampers with the repository.yaml file and injects malicious code. Defaults to ‘data/repository.md5’.
ignore_checksums (bool) – If you want to ignore the checksum check of the repo_source file set this top True. Can be useful for developing, when the repository.yaml file undergoes a lot of changes. Defaults to False.
debug (bool, optional) – Whether to print debug info. Defaults to False.

_get_connection()[source]#: Also compatibility with mdshare

static _split_proj_filetype(proj_filetype)[source]#: Splits the strings that index the self.datasets dictionary.

property catalogue#

Returns the underlying catalogue data.

Type:: dict

property datasets#

A set of datasets in this repository. A dataset can either be characterized by a set of trajectory-, topology-, log- and input-file or a dataset is a .tar.gz container, which contains all necessary files.

Type:: set

fetch(remote_filenames, working_directory=None, overwrite=False, max_attempts=3, makdedir=False, progress_bar=True)[source]#

This fetches a singular file from self.files.

Displays also progress bar with the name of the file. Uses requests.

Parameters:

remote_filename (str) – The name of the remote file. Check self.files for more info.
working_directory (Union[str, None], optional) – Can be a string to a directory to save the files at. Can also be None. In that case self.current_path + ‘/data’ will be used to save the file at. Which is retrieved by inspect.getfile(inspect.currentframe)). If the files are already there and overwrite is false, the file path is simply returned. Defaults to None.
overwrite (bool, optional) – Whether to overwrite local files. Defaults to False.
max_attempts (int, optional) – Number of download attempts. Defaults to 3.
makdedir (bool, optional) – Whether to create working_directory, if it is not already existing. Defaults to False.
progress_bar (bool, optional) – Uses the package progress-reporter to display a progress bar.

Returns:

A tuple containing the following:: list: A list of files that have just been downloaded. str: A string leading to the directory the files have been downloaded to.

Return type:

tuple

get_sizes(pattern)[source]#

Returns a list of file-sizes of a given pattern.

Parameters:: pattern (Union[str, list]) – A unix-like pattern (‘traj*.xtc’) or a list of files ([‘traj_1.xtc’, ‘traj_2.xtc’]).
Returns:: A list of filesizes in bytes.
Return type:: list

load_project(project, working_directory=None, overwrite=False, max_attempts=3, makdedir=False, progress_bar=True)[source]#

This will return TrajEnsemble / SingleTraj objects that are correctly formatted.

This method allows one to directly rebuild projects from the repo source, using encodermap’s own SingleTraj and TrajEnsemble classes.

Parameters:

project (str) – The name of the project to be loaded. See Repository.projects.keys() for a list of projects.
working_directory (Union[str, None], optional) – Can be a string to a directory to save the files at. Can also be None. In that case self.current_path + ‘/data’ will be used to save the file at. Which is retrieved by inspect.getfile(inspect.currentframe)). If the files are already there and overwrite is false, the file path is simply returned. Defaults to None.
overwrite (bool, optional) – Whether to overwrite local files. Defaults to False.
max_attempts (int, optional) – Number of download attempts. Defaults to 3.
makdedir (bool, optional) – Whether to create working_directory, if it is not already existing. Defaults to False.
progress_bar (bool, optional) – Uses the package progress-reporter to display a progress bar.

Returns:

The project already loaded into encodermap’s: SingleTraj or TrajEnsemble classes.

Return type:

Union[encodermap.SingleTraj, encodermap.TrajEnsemble]

Examples

>>> import encodermap as em
>>> repo = em.Repository()
>>> trajs = repo.load_project('Tetrapeptides_Single')
>>> print(trajs)
encodermap.TrajEnsemble object. Current backend is no_load. Containing 2 trajs. Common str is ['PFFP', 'FPPF']. Not containing any CVs.
>>> print(trajs.n_trajs)
2

lookup(file)[source]#

Piece of code to allow some compatibility to mdshare.

The complete self.data dictionary will be traversed to find file and its location in the self.data dictionary. This will be used to get the filesize and its md5 hash. The returned tuple also tells whether the file is a .tar.gz container or not. In the case of a container, the container needs to be extracted using tarfile.

Parameters:

file (str) – The file to search for.

Returns:

A tuple containing the follwing:: str: A string that is either ‘container’ or ‘index’ (for normal files). dict: A dict with dict(file=filename, hash=filehas, size=filesize)

Return type:

tuple

print_catalogue()[source]#: Prints the catalogue nicely formatted.

property projects#

A dictionary containing project names and their associated files. Projects are a larger collection of individual sims, that belong together. The project names are the dictionary’s keys, the files are given as lists in the dict’s values.

Type:: dict

search(pattern)[source]#

stack(pattern)[source]#

Creates a stack to prepare for downloads.

Parameters:

pattern (Union[str, list]) – A unix-like pattern (‘traj*.xtc’) or a list of files ([‘traj_1.xtc’, ‘traj_2.xtc’]).

Returns:

A list of dicts. Each dict contains filename, size and a boolean: value telling whether the downloaded file needs to be extracted after downloading.

Return type:

list

encodermap.trajinfo.trajinfo_deprecated module#

encodermap.trajinfo.trajinfo_utils module#

Util functions for the TrajEnsemble and SingleTraj classes.

encodermap.trajinfo.trajinfo_utils.load_CVs_ensembletraj(trajs: TrajEnsemble, data: TrajEnsembleFeatureType, attr_name: Optional[list[str]] = None, cols: Optional[list[int]] = None, labels: Optional[list[str]] = None, directory: Optional[Union[Path, str]] = None, ensemble: bool = False) → None[source]#

encodermap.trajinfo.trajinfo_utils.load_CVs_singletraj(data: SingleTrajFeatureType, traj: SingleTraj, attr_name: Optional[str] = None, cols: Optional[list[int]] = None, labels: Optional[list[str]] = None) → xr.Dataset[source]#

encodermap.trajinfo package#

Submodules#

encodermap.trajinfo.hash_files module#

encodermap.trajinfo.info_all module#

encodermap.trajinfo.info_single module#

encodermap.trajinfo.load_traj module#

encodermap.trajinfo.repository module#

encodermap.trajinfo.trajinfo_deprecated module#

encodermap.trajinfo.trajinfo_utils module#

Module contents#