encodermap package#

Subpackages#

Submodules#

encodermap._typing module#

Typing for the encodermap package

encodermap._version module#

Encodermap’s versioning follows semantic versioning guidelines. Read more about them here: https://semver.org/

tldr: Given a version number MAJOR.MINOR.PATCH, increment the:

  • MAJOR version when you make incompatible API changes,

  • MINOR version when you add functionality in a backwards compatible manner, and

  • PATCH version when you make backwards compatible bug fixes.

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

Current example: Currently I am writing this documentation. Writing this will not break an API, nor does it add functionality, nor does it fixes bugs. Thus, the version stays at 3.0.0

get_config()[source]#

Create, populate and return the VersioneerConfig() object.

Return type:

VersioneerConfig

get_keywords()[source]#

Get the keywords needed to look up the version information.

Return type:

dict[str, str]

get_versions()[source]#

Get version information or return default if unable to do so.

Return type:

dict[str, Any]

git_get_keywords(versionfile_abs)[source]#

Extract version information from the given file.

Parameters:

versionfile_abs (str)

Return type:

dict[str, str]

git_pieces_from_vcs(tag_prefix, root, verbose, runner=<function run_command>)[source]#

Get version from ‘git describe’ in the root of the source tree.

This only gets called if the git-archive ‘subst’ keywords were not expanded, and _version.py hasn’t already been rewritten with a short version string, meaning we’re inside a checked out source tree.

Parameters:
Return type:

dict[str, Any]

git_versions_from_keywords(keywords, tag_prefix, verbose)[source]#

Get version information from git keywords.

Parameters:
Return type:

dict[str, Any]

pep440_split_post(ver)[source]#

Split pep440 version string at the post-release segment.

Returns the release segments before the post-release and the post-release version number (or -1 if no post-release segment is present).

Parameters:

ver (str)

Return type:

tuple[str, int | None]

plus_or_dot(pieces)[source]#

Return a + if we don’t already have one, else return a .

Parameters:

pieces (dict[str, Any])

Return type:

str

register_vcs_handler(vcs, method)[source]#

Create decorator to mark a method as the handler of a VCS.

Parameters:
Return type:

Callable

render(pieces, style)[source]#

Render the given version pieces into the requested style.

Parameters:
Return type:

dict[str, Any]

render_git_describe(pieces)[source]#

TAG[-DISTANCE-gHEX][-dirty].

Like ‘git describe –tags –dirty –always’.

Exceptions: 1: no tags. HEX[-dirty] (note: no ‘g’ prefix)

Parameters:

pieces (dict[str, Any])

Return type:

str

render_git_describe_long(pieces)[source]#

TAG-DISTANCE-gHEX[-dirty].

Like ‘git describe –tags –dirty –always -long’. The distance/hash is unconditional.

Exceptions: 1: no tags. HEX[-dirty] (note: no ‘g’ prefix)

Parameters:

pieces (dict[str, Any])

Return type:

str

render_pep440(pieces)[source]#

Build up version string, with post-release “local version identifier”.

Our goal: TAG[+DISTANCE.gHEX[.dirty]] . Note that if you get a tagged build and then dirty it, you’ll get TAG+0.gHEX.dirty

Exceptions: 1: no tags. git_describe was just HEX. 0+untagged.DISTANCE.gHEX[.dirty]

Parameters:

pieces (dict[str, Any])

Return type:

str

render_pep440_branch(pieces)[source]#

TAG[[.dev0]+DISTANCE.gHEX[.dirty]] .

The “.dev0” means not master branch. Note that .dev0 sorts backwards (a feature branch will appear “older” than the master branch).

Exceptions: 1: no tags. 0[.dev0]+untagged.DISTANCE.gHEX[.dirty]

Parameters:

pieces (dict[str, Any])

Return type:

str

render_pep440_old(pieces)[source]#

TAG[.postDISTANCE[.dev0]] .

The “.dev0” means dirty.

Exceptions: 1: no tags. 0.postDISTANCE[.dev0]

Parameters:

pieces (dict[str, Any])

Return type:

str

render_pep440_post(pieces)[source]#

TAG[.postDISTANCE[.dev0]+gHEX] .

The “.dev0” means dirty. Note that .dev0 sorts backwards (a dirty tree will appear “older” than the corresponding clean one), but you shouldn’t be releasing software with -dirty anyways.

Exceptions: 1: no tags. 0.postDISTANCE[.dev0]

Parameters:

pieces (dict[str, Any])

Return type:

str

render_pep440_post_branch(pieces)[source]#

TAG[.postDISTANCE[.dev0]+gHEX[.dirty]] .

The “.dev0” means not master branch.

Exceptions: 1: no tags. 0.postDISTANCE[.dev0]+gHEX[.dirty]

Parameters:

pieces (dict[str, Any])

Return type:

str

render_pep440_pre(pieces)[source]#

TAG[.postN.devDISTANCE] – No -dirty.

Exceptions: 1: no tags. 0.post0.devDISTANCE

Parameters:

pieces (dict[str, Any])

Return type:

str

run_command(commands, args, cwd=None, verbose=False, hide_stderr=False, env=None)[source]#

Call the given command(s).

Parameters:
Return type:

tuple[str | None, int | None]

versions_from_parentdir(parentdir_prefix, root, verbose)[source]#

Try to determine the version from the parent directory name.

Source tarballs conventionally unpack into a directory that includes both the project name and a version string. We will also support searching up two directory levels for an appropriately named parent directory

Parameters:
  • parentdir_prefix (str)

  • root (str)

  • verbose (bool)

Return type:

dict[str, Any]

encodermap.kondata module#

Functions for interfacing with the University of Konstanz’s repository service KonDATA.

get_from_kondata(dataset_name, output=None, force_overwrite=False, mk_parentdir=False, silence_overwrite_message=False, tqdm_class=None, download_extra_data=False, download_checkpoints=False, download_h5=True)[source]#

Get dataset from the University of Konstanz’s data repository KONData.

Parameters:
  • dataset_name (str) – The name of the dataset. Refer to DATASET_URL_MAPPING to get a list of the available datasets.

  • output (Union[str, Path]) – The output directory.

  • force_overwrite (bool) – Whether to overwrite existing files. Defaults to False.

  • mk_parentdir (bool) – Whether to create the output directory if it does not already exist. Defaults to False.

  • silence_overwrite_message (bool) – Whether to silence the ‘file already exists’ warning. Can be useful in scripts. Defaults to False.

  • tqdm_class (Optional[Any]) – A class that is similar to tqdm.tqdm. This is mainly useful if this function is used inside a rich.status.Status context manager, as the normal tqdm does not work inside this context. If None is provided, the default tqdm will be used.

  • download_extra_data (bool) – Whether to download extra data. It Is only used if the dataset is not available on KonDATA. Defaults to False.

  • download_checkpoints (bool) – Whether to download pretrained checkpoints. It is only used if the dataset is not available on KonDATA. Defaults to False.

  • download_h5 (bool) – Whether to also download an h5 file of the ensemble. Defaults to True.

Returns:

The output directory.

Return type:

str

Module contents#

EncoderMap: Dimensionality reduction for molecular dynamics.

EncoderMap provides a framework for using molecular dynamics data with the tensorflow library. It started as the implementation of a neural network autoencoder to do dimensionality reduction and also create new high-dimensional data from the low-dimensional embedding. The user was still required to create their own dataset and provide the numpy arrays. In the second iteration of EncoderMap, the possibility to provide molecular dynamics data with the MolData class was added. A new neural network architecture was implemented to try and rebuild cartesian coordinates from the low-dimensional embedding.

This iteration of EncoderMap continues this endeavour by porting the old code to the newer tensorflow version (2.x). However, more has been added which should aid computational chemists and also structural biologists:

  • New trajectory classes with lazy loading of coordinates to accelerate analysis.

  • Featurization which can be parallelized using the distributed computing

    library dask.

  • Interactive plotly plots for clustering and structure creation.

  • Neural network building blocks that allows users to easily build new

    neural networks.

  • Sparse networks allow comparison of proteins with different topologies.

Todo

  • [ ] Rework all notebooks.
    • [x] 01 Basic cube

    • [x] 02 asp7

    • [x] 03 your data

    • [ ] customization

    • [ ] Ensembles and ensemble classes

    • [ ] Ub mutants

    • [ ] sidechain reconstruction (if possible)

    • [ ] FAT10 (if possible)

  • [ ] Rewrite the install encodermap script in a github gist and add that to the notebooks.

  • [ ] Record videos.

  • [~] Fix FAT 10 Nans
    • [ ] NaNs are fixed, but training still bad.

    • [x] Check whether sigmoid values are good for FAT10
      • [x] Test [40, 10, 5, 1, 2, 5] (from linear dimers) and compare.

    • [ ] Test (20, 10, 5, 1, 2, 5)

  • [~] Fix sidechain reconstruction NaNs
    • [ ] Try out LSTM layers

    • [ ] Try out gradient clipping

    • [~] Try out a higher regularization cost (increase l2 reg constant from 0.001 to 0.1)

  • [ ] Remove OTU11 from tests

  • [ ] Image for FAT10 decoding, if NaN error is fixed.

  • [ ] Delete commented stuff (i.e. all occurrences of more than 3 # signs in lines)

  • [ ] Fix the deterministic training for M1diUb

  • [ ] Add FAT10 to the deterministic training.

class ADCParameters(**kwargs)[source]#

Bases: ParametersFramework

This is the parameter object for the AngleDihedralCartesianEncoder. It holds all the parameters that the Parameters object includes, plus the following attributes:

Parameters:

kwargs (ParametersData)

track_clashes#

Whether to track the number of clashes during training. The average number of clashes is the average number of distances in the reconstructed cartesian coordinates with a distance smaller than 1 (nm). Defaults to False.

Type:

bool

track_RMSD#

Whether to track the RMSD of the input and reconstructed cartesians during training. The RMSDs are computed along the batch by minimizing the .. math:

\text{RMSD}(\mathbf{x}, \mathbf{x}^{\text{ref}}) = \min_{\mathsf{R}, \mathbf{t}} %
 \sqrt{\frac{1}{N} \sum_{i=1}^{N} \left[ %
     (\mathsf{R}\cdot\mathbf{x}_{i}(t) + \mathbf{t}) - \mathbf{x}_{i}^{\text{ref}} \right]^{2}}

This results in n RMSD values, where n is the size of the batch. A mean RMSD of this batch and the values for this batch will be logged to tensorboard.

Type:

bool

cartesian_pwd_start#

Index of the first atom to use for the pairwise distance calculation.

Type:

int

cartesian_pwd_stop#

Index of the last atom to use for the pairwise distance calculation.

Type:

int

cartesian_pwd_step#

Step for the calculation of paiwise distances. E.g. for a chain of atoms N-C_a-C-N-C_a-C… cartesian_pwd_start=1 and cartesian_pwd_step=3 will result in using all C-alpha atoms for the pairwise distance calculation.

Type:

int

use_backbone_angles#

Allows to define whether backbone bond angles should be learned (True) or if instead mean values should be used to generate conformations (False).

Type:

bool

use_sidechains#

Whether sidechain dihedrals should be passed through the autoencoder.

Type:

bool

angle_cost_scale#

Adjusts how much the angle cost is weighted in the cost function.

Type:

int

angle_cost_variant#

Defines how the angle cost is calculated. Must be one of:

  • “mean_square”

  • “mean_abs”

  • “mean_norm”.

Type:

str

angle_cost_reference#

Can be used to normalize the angle cost with the cost of same reference model (dummy).

Type:

int

dihedral_cost_scale#

Adjusts how much the dihedral cost is weighted in the cost function.

Type:

int

dihedral_cost_variant#

Defines how the dihedral cost is calculated. Must be one of:

  • “mean_square”

  • “mean_abs”

  • “mean_norm”.

Type:

str

dihedral_cost_reference#

Can be used to normalize the dihedral cost with the cost of same reference model (dummy).

Type:

int

side_dihedral_cost_scale#

Adjusts how much the side dihedral cost is weighted in the cost function.

Type:

int

side_dihedral_cost_variant#

Defines how the side dihedral cost is calculated. Must be one of:

  • “mean_square”

  • “mean_abs”

  • “mean_norm”.

Type:

str

side_dihedral_cost_reference#

Can be used to normalize the side dihedral cost with the cost of same reference model (dummy).

Type:

int

cartesian_cost_scale#

Adjusts how much the cartesian cost is weighted in the cost function.

Type:

int

cartesian_cost_scale_soft_start#

Allows to slowly turn on the cartesian cost. Must be a tuple with (start, end) or (None, None) If begin and end are given,

cartesian_cost_scale will be increased linearly in the

given range.

Type:

tuple

cartesian_cost_variant#

Defines how the cartesian cost is calculated. Must be one of:

  • “mean_square”

  • “mean_abs”

  • “mean_norm”.

Type:

str

cartesian_cost_reference#

Can be used to normalize the cartesian cost with the cost of same reference model (dummy).

Type:

int

cartesian_dist_sig_parameters#

Parameters for the sigmoid functions applied to the high- and low-dimensional distances in the following order (sig_h, a_h, b_h, sig_l, a_l, b_l).

Type:

tuple of floats

cartesian_distance_cost_scale#

Adjusts how much the cartesian distance cost is weighted in the cost function.

Type:

int

multimer_training#

Experimental feature.

Type:

Any

multimer_topology_classes#

Experimental feature.

Type:

Any

multimer_connection_bridges#

Experimental feature.

Type:

Any

multimer_lengths#

Experimental feature.

Type:

Any

reconstruct_sidechains#

Whether to also reconstruct sidechains.

Type:

bool

Examples

>>> import encodermap as em
>>> import tempfile
>>> from pathlib import Path
...
>>> with tempfile.TemporaryDirectory() as td:
...     td = Path(td)
...     p = em.Parameters()
...     print(p.auto_cost_variant)
...     savepath = p.save(td / "parameters.json")
...     print(savepath)
...     new_params = em.Parameters.from_file(td / "parameters.json")
...     print(new_params.main_path)  
mean_abs
/tmp...parameters.json
seems like the parameter file was moved to another directory. Parameter file is updated ...
/home...
_defaults = {'activation_functions': ['', 'tanh', 'tanh', ''], 'analysis_path': '', 'angle_cost_reference': 1, 'angle_cost_scale': 0, 'angle_cost_variant': 'mean_abs', 'auto_cost_scale': None, 'auto_cost_variant': 'mean_abs', 'batch_size': 256, 'batched': True, 'cartesian_cost_reference': 1, 'cartesian_cost_scale': 1, 'cartesian_cost_scale_soft_start': (None, None), 'cartesian_cost_variant': 'mean_abs', 'cartesian_dist_sig_parameters': (4.5, 12, 6, 1, 2, 6), 'cartesian_distance_cost_scale': 1, 'cartesian_pwd_start': None, 'cartesian_pwd_step': None, 'cartesian_pwd_stop': None, 'center_cost_scale': 0.0001, 'checkpoint_step': 5000, 'current_training_step': 0, 'dihedral_cost_reference': 1, 'dihedral_cost_scale': 1, 'dihedral_cost_variant': 'mean_abs', 'dist_sig_parameters': (4.5, 12, 6, 1, 2, 6), 'distance_cost_scale': None, 'gpu_memory_fraction': 0, 'id': '', 'l2_reg_constant': 0.001, 'learning_rate': 0.001, 'loss': 'emap_cost', 'model_api': 'functional', 'multimer_connection_bridges': None, 'multimer_lengths': None, 'multimer_topology_classes': None, 'multimer_training': None, 'n_neurons': [128, 128, 2], 'n_steps': 1000, 'periodicity': 6.283185307179586, 'reconstruct_sidechains': False, 'seed': None, 'side_dihedral_cost_reference': 1, 'side_dihedral_cost_scale': 0.5, 'side_dihedral_cost_variant': 'mean_abs', 'summary_step': 10, 'tensorboard': False, 'track_RMSD': False, 'track_clashes': False, 'trainable_dense_to_sparse': False, 'training': 'auto', 'use_backbone_angles': False, 'use_sidechains': False, 'using_hypercube': False, 'write_summary': False}#
classmethod defaults_description()[source]#

str: A string that contains tabulated default parameter values.

Return type:

str

class AngleDihedralCartesianEncoderMap(trajs=None, parameters=None, model=None, read_only=False, dataset=None, ensemble=False, use_dataset_when_possible=True, deterministic=False)[source]#

Bases: object

Different __init__ method, than Autoencoder Class. Uses callbacks to tune-in cartesian cost.

Overwritten methods: _set_up_callbacks and generate.

Examples

>>> import encodermap as em
>>> from pathlib import Path
>>> # Load two trajectories
>>> test_data = Path(em.__file__).parent.parent / "tests/data"
>>> test_data.is_dir()
True
>>> xtcs = [test_data / "1am7_corrected_part1.xtc", test_data / "1am7_corrected_part2.xtc"]
>>> tops = [test_data / "1am7_protein.pdb", test_data  /"1am7_protein.pdb"]
>>> trajs = em.load(xtcs, tops)
>>> print(trajs)
encodermap.TrajEnsemble object. Current backend is no_load. Containing 2 trajectories. Not containing any CVs.
>>> # load CVs
>>> # This step can be omitted. The AngleDihedralCartesianEncoderMap class automatically loads CVs
>>> trajs.load_CVs('all')
>>> print(trajs.CVs['central_cartesians'].shape)
(51, 474, 3)
>>> print(trajs.CVs['central_dihedrals'].shape)
(51, 471)
>>> # create some parameters
>>> p = em.ADCParameters(periodicity=360, use_backbone_angles=True, use_sidechains=True,
...                      cartesian_cost_scale_soft_start=(6, 12))
>>> # Standard is functional model, as it offers more flexibility
>>> print(p.model_api)
functional
>>> print(p.distance_cost_scale)
None
>>> # Instantiate the class
>>> e_map = em.AngleDihedralCartesianEncoderMap(trajs, p, read_only=True)  
Model...
>>> # dataset contains these inputs:
>>> # central_angles, central_dihedrals, central_cartesians, central_distances, sidechain_dihedrals
>>> print(e_map.dataset)  
<BatchDataset element_spec=(TensorSpec(shape=(None, 472), dtype=tf.float32, name=None), TensorSpec(shape=(None, 471), dtype=tf.float32, name=None), TensorSpec(shape=(None, 474, 3), dtype=tf.float32, name=None), TensorSpec(shape=(None, 473), dtype=tf.float32, name=None), TensorSpec(shape=(None, 316), dtype=tf.float32, name=None))>
>>> # output from the model contains the following data:
>>> # out_angles, out_dihedrals, back_cartesians, pairwise_distances of inp cartesians, pairwise of back-mapped cartesians, out_side_dihedrals
>>> for data in e_map.dataset.take(1):
...     pass
>>> out = e_map.model(data)
>>> print([i.shape for i in out])  
[TensorShape([256, 472]), TensorShape([256, 471]), TensorShape([256, 474, 3]), TensorShape([256, 112101]), TensorShape([256, 112101]), TensorShape([256, 316])]
>>> # get output of latent space by providing central_angles, central_dihedrals, sidehcain_dihedrals
>>> latent = e_map.encoder([data[0], data[1], data[-1]])
>>> print(latent.shape)
(256, 2)
>>> # Rebuild central_angles, central_dihedrals and sidechain_angles from latent
>>> dih, ang, side_dih = e_map.decode(latent)
>>> print(dih.shape, ang.shape, side_dih.shape)
(256, 472) (256, 471) (256, 316)
Parameters:
  • trajs (Optional[TrajEnsemble])

  • parameters (Optional[ADCParameters])

  • model (Optional[tf.keras.Model])

  • read_only (bool)

  • dataset (Optional[tf.data.Dataset])

  • ensemble (bool)

  • use_dataset_when_possible (bool)

  • deterministic (bool)

add_callback(callback)[source]#

Adds a new callback to the existing callbacks.

add_images_to_tensorboard(*args, **kwargs)[source]#

Adds images of the latent space to tensorboard.

Parameters:
  • data (Optional[Union[np.ndarray, Sequence[np.ndarray]]) – The input-data will be passed through the encoder part of the autoencoder. If None is provided, a set of 10_000 points from self.train_data will be taken. A list[np.ndarray] is needed for the functional API of the AngleDihedralCartesianEncoderMap, that takes a list of [angles, dihedrals, side_dihedrals]. Defaults to None.

  • image_step (Optional[int]) – The interval in which to plot images to tensorboard. If None is provided, the image_step will be the same as Parameters.summary_step. Defaults to None.

  • max_size (int) – The maximum size of the high-dimensional data, that is projected. Prevents excessively large-datasets from being projected at every image_step. Defaults to 10_000.

  • scatter_kws (Optional[dict[str, Any]]) – A dict with items that plotly.express.scatter() will accept. If None is provided, a dict with size 20 will be passed to px.scatter(**{‘size_max’: 10, ‘opacity’: 0.2}), which sets an appropriate size of scatter points for the size of datasets encodermap is usually used for.

  • hist_kws (Optional[dict[str, Any]]) – A dict with items that encodermap.plot.plotting._plot_free_energy() will accept. If None is provided a dict with bins 50 will be passed to encodermap.plot.plotting._plot_free_energy(**{‘bins’: 50}). You can choose a colormap here by providing {‘bins’: 50, ‘cmap’: ‘plasma’} for this argument.

  • additional_fns (Optional[Sequence[Callable]]) – A list of functions that will accept the low-dimensional output of the Autoencoder latent/bottleneck layer and return a tf.Tensor that can be logged by tf.summary.image(). See the notebook ‘writing_custom_images_to_tensorboard.ipynb’ in tutorials/notebooks_customization for more info. If None is provided, no additional functions will be used to plot to tensorboard. Defaults to None.

  • when (Literal["epoch", "batch"]) – When to log the images can be either ‘batch’, then the images will be logged after every step during training, or ‘epoch’, then only after every image_step epoch the images will be written. Defaults to ‘epoch’.

  • save_to_disk (bool) – Whether to also write the images to disk.

  • args (Any)

  • kwargs (Any)

Return type:

None

add_loss(loss)[source]#

Adds a new loss to the existing losses.

add_metric(metric)[source]#

Adds a new metric to the existing metrics.

close()[source]#

Clears the current keras backend and frees up resources.

Return type:

None

decode(data)[source]#

Calls the decoder part of the model.

AngleDihedralCartesianAutoencoder will, like the other two classes’ output a list of np.ndarray.

Parameters:

data (np.ndarray) – The data to be passed to the decoder part of the model. Make sure that the shape of the data matches the number of neurons in the latent space.

Returns:

Outputs from the decoder part.

For AngleDihedralCartesianEncoderMap, this will be a list of np.ndarray.

Return type:

Union[list[np.ndarray], np.ndarray]

property decoder: Model#

The decoder Model.

Type:

tf.keras.Model

encode(data=None)[source]#

Runs the central_angles, central_dihedrals, (side_dihedrals) through the autoencoder. Make sure that data has the correct shape.

Parameters:

data (Sequence[np.ndarray]) – Provide a sequence of angles, and central_dihedrals, if you used sidechain_dihedrals during training append these to the end of the sequence.

Returns:

The latent space representation of the provided data.

Return type:

np.ndarray

property encoder: Model#

The encoder Model.

Type:

tf.keras.Model

classmethod from_checkpoint(trajs, checkpoint_path, dataset=None, use_previous_model=False, compat=False)[source]#

Reconstructs the model from a checkpoint.

Although the model can be loaded from disk without any form of data and still yield the correct input and output shapes, it is required to either provide trajs or dataset to double-check, that the correct model will be reloaded.

This is also, whe the sparse argument is not needed, as sparcity of the input data is a property of the TrajEnsemble provided.

Parameters:
  • trajs (Union[None, TrajEnsemble]) – Either None (in which case, the argument dataset is required), or an instance of TrajEnsemble, which was used to instantiate the AngleDihedralCartesianEncoderMap, before it was saved to disk.

  • checkpoint_path (Union[Path, str]) – The path to the checkpoint. Can either be the path to a .keras file or to a directory containing .keras files, in which case the most recently created .keras file will be used.

  • dataset (Optional[tf.data.Dataset]) – If trajs is not provided, a dataset is required to make sure the input shapes match the model, that is stored on the disk.

  • use_previous_model (bool) – Set this flag to True, if you load a model from an in-between checkpoint step (e.g., to continue training with different parameters). If you have the files saved_model_0.keras, saved_model_500.keras and saved_model_1000.keras, setting this to True and loading the saved_model_500.keras will back up the saved_model_1000.keras.

  • compat (bool) – Whether to use compatibility mode when missing or wrong parameter files are present. In this special case, some assumptions about the network architecture are made from the model and the parameters in parameters.json overwritten accordingly (a backup will also be made).

Returns:

An instance of AngleDihedralCartesianEncoderMap.

Return type:

AngleDihedralCartesianEncoderMapType

generate(points: ndarray, top: str | int | Topology | None, backend: Literal['mdtraj'], progbar: Any | None) Trajectory[source]#
generate(points: ndarray, top: str | int | Topology | None, backend: Literal['mdanalysis'], progbar: Any | None) Universe

Overrides the parent class’ generate method and builds a trajectory.

Instead of just providing data to decode using the decoder part of the network, this method also takes a molecular topology as its top argument. This topology is then used to rebuild a time-resolved trajectory.

Parameters:
  • points (np.ndarray) – The low-dimensional points from which the trajectory should be rebuilt.

  • top (Optional[str, int, mdtraj.Topology]) – The topology to be used for rebuilding the trajectory. This should be a string pointing towards a <*.pdb, *.gro, *.h5> file. Alternatively, None can be provided; in which case, the internal topology (self.top) of this class is used. Defaults to None.

  • backend (str) –

    Defines what MD python package is to use, to build the trajectory and also what type this method returns, needs to be one of the following:

    • ”mdtraj”

    • ”mdanalysis”

Returns:

The trajectory after

applying the decoded structural information. The type of this depends on the chosen backend parameter.

Return type:

Union[mdtraj.Trajectory, MDAnalysis.universe]

static get_train_data_from_trajs(trajs, p, attr='CVs', max_size=-1)[source]#

Builds train data from a TrajEnsemble.

Parameters:
  • trajs (TrajEnsemble) – A TrajEnsemble instance.

  • p (encodermap.parameters.ADCParameters) – An instance of encodermap.parameters.ADCParameters.

  • attr (str) – Which attribute to get from TrajEnsemble. This defaults to ‘CVs’, because ‘CVs’ is usually a dict containing the CV data. However, you can build the train data from any dict in the TrajEnsemble.

  • max_size (int) – When you only want a subset of the CV data. Set this to the desired size.

Returns:

A tuple containing the following:
  • bool: A bool that shows whether some ‘CV’ values are np.nan (True),

    which will be used to decide whether the sparse training will be used.

  • list[np.ndarray]: An array of features fed into the autoencoder,

    concatenated along the feature axis. The order of the features is: central_angles, central_dihedral, (side_dihedrals if p.use_sidechain_dihedrals is True).

  • dict[str, np.ndarray]: The training data as a dict. Containing

    all values in trajs.CVs.

Return type:

tuple

plot_network()[source]#

Tries to plot the network using pydot, pydotplus and graphviz. Doesn’t raise an exception if plotting is not possible.

Return type:

None

save(step=None)[source]#

Saves the model to the current path defined in parameters.main_path.

Parameters:

step (Optional[int]) – Does not save the model at the given training step, but rather changes the string used for saving the model from a datetime format to another.

Returns:

When the model has been saved, the Path will

be returned. If the model could not be saved. None will be returned.

Return type:

Union[None, Path]

set_train_data(data)[source]#

Resets the train data for reloaded models.

Parameters:

data (TrajEnsemble)

Return type:

None

train()[source]#

Overwrites the parent class’ train() method to implement references.

Return type:

dict[str, Any] | None

train_for_references(subsample=100, maxiter=500)[source]#

Calculates the angle, dihedral, and cartesian costs to so-called references, which can be used to bring these costs to a similar magnitude.

Parameters:
  • subsample (int)

  • maxiter (int)

Return type:

None

class Autoencoder(parameters=None, train_data=None, model=None, read_only=False, sparse=False)[source]#

Bases: object

Main Autoencoder class. Presents all high-level functions.

This is the main class for neural networks inside EncoderMap. The class prepares the data (batching and shuffling), creates a tf.keras.Model of layers specified by the attributes of the encodermap.Parameters class. Depending on what Parent/Child-Class is instantiated, a combination of various cost functions is set up. Callbacks to Tensorboard are also set up.

Parameters:
  • train_data (Optional[Union[np.ndarray, tf.data.Dataset]])

  • model (Optional[tf.keras.Model])

  • read_only (bool)

  • sparse (bool)

train_data#

The numpy array of the train data passed at init.

Type:

np.ndarray

p#

An encodermap.Parameters class containing all info needed to set up the network.

Type:

AnyParameters

dataset#

The dataset that is actually used in training the keras model. The dataset is a batched, shuffled, infinitely-repeating dataset.

Type:

tensorflow.data.Dataset

read_only#

Variable telling the class whether it is allowed to write to disk (False) or not (True).

Type:

bool

metrics#

A list of metrics passed to the model when it is compiled.

Type:

list[Any]

callbacks#

A list of tf.keras.callbacks.Callback subclasses changing the behavior of the model during training. Some standard callbacks are always present like:

  • encodermap.callbacks.callbacks.ProgressBar:

    A progress bar callback using tqdm giving the current progress of training and the current loss.

  • CheckPointSaver:

    A callback that saves the model every parameters.checkpoint_step steps into the main directory. This callback will only be used, when read_only is False.

  • TensorboardWriteBool:

    A callback that contains a boolean Tensor that will be True or False, depending on the current training step and the summary_step in the parameters class. The loss functions use this callback to decide whether they should write to Tensorboard. This callback will only be present when read_only is False and parameters.tensorboard is True.

You can append your own callbacks to this list before executing self.train().

Type:

list[Any]

encoder#

The encoder submodel of self.model.

Type:

tf.keras.Model

decoder#

The decoder submodel of self.model.

Type:

tf.keras.Model

loss#

A list of loss functions passed to the model when it is compiled. When the main Autoencoder class is used and parameters.loss is ‘emap_cost’, this list comprises center_cost, regularization_cost, auto_cost. When the EncoderMap sub-class is used and parameters.loss is ‘emap_cost’, distance_cost is added to the list. When parameters.loss is not ‘emap_cost’, the loss can either be a string (‘mse’), or a function, that both are acceptable arguments for loss, when a keras model is compiled.

Type:

Sequence[Callable]

from_checkpoint()[source]#

Rebuild the model from a checkpoint.

Parameters:
Return type:

AutoencoderType

add_images_to_tensorboard()[source]#

Make tensorboard plot images.

Parameters:
Return type:

None

train()[source]#

Starts the training of the tf.keras.models.Model.

Return type:

dict[str, Any] | None

plot_network()[source]#

Tries to plot the network. For this method to work graphviz, pydot and pydotplus need to be installed.

Return type:

None

encode()[source]#

Takes high-dimensional data and sends it through the encoder.

Parameters:

data (Sequence[ndarray] | None)

Return type:

ndarray

decode()[source]#

Takes low-dimensional data and sends it through the encoder.

Parameters:

data (ndarray)

Return type:

Sequence[ndarray]

generate()[source]#

Same as decode. For AngleDihedralCartesianAutoencoder classes, this will build a protein strutcure.

Parameters:

data (ndarray)

Return type:

ndarray

Note

Performance of tensorflow is not only dependent on your system’s hardware and how the data is presented to the network (for this check out https://www.tensorflow.org/guide/data_performance), but also how you compiled tensorflow. Normal tensorflow (pip install tensorflow) is build without CPU extensions to work on many CPUs. However, Tensorflow can greatly benefit from using CPU instructions like AVX2, AVX512 that bring a speed-up in linear algebra computations of 300%. By building tensorflow from source, you can activate these extensions. However, the speed-up of using tensorflow with a GPU dwarfs the CPU speed-up. To check whether a GPU is available run: print(len(tf.config.list_physical_devices(‘GPU’))). Refer to these pages to install tensorflow for the best performance: https://www.tensorflow.org/install/pip and https://www.tensorflow.org/install/gpu

Examples

>>> import encodermap as em
>>> # without providing any data, default parameters and a 4D
>>> # hypercube as input data will be used.
>>> e_map = em.EncoderMap(read_only=True)
>>> print(e_map.train_data.shape)
(16000, 4)
>>> print(e_map.dataset)  
<BatchDataset element_spec=(TensorSpec(shape=(None, 4), dtype=tf.float32, name=None), TensorSpec(shape=(None, 4), dtype=tf.float32, name=None))>
>>> print(e_map.encode(e_map.train_data).shape)
(16000, 2)
add_callback(callback)[source]#

Adds a new callback to the existing callbacks.

add_images_to_tensorboard(*args, **kwargs)[source]#

Adds images of the latent space to tensorboard.

Parameters:
  • data (Optional[Union[np.ndarray, Sequence[np.ndarray]]) – The input-data will be passed through the encoder part of the autoencoder. If None is provided, a set of 10_000 points from self.train_data will be taken. A list[np.ndarray] is needed for the functional API of the AngleDihedralCartesianEncoderMap, that takes a list of [angles, dihedrals, side_dihedrals]. Defaults to None.

  • image_step (Optional[int]) – The interval in which to plot images to tensorboard. If None is provided, the image_step will be the same as Parameters.summary_step. Defaults to None.

  • max_size (int) – The maximum size of the high-dimensional data, that is projected. Prevents excessively large-datasets from being projected at every image_step. Defaults to 10_000.

  • scatter_kws (Optional[dict[str, Any]]) – A dict with items that plotly.express.scatter() will accept. If None is provided, a dict with size 20 will be passed to px.scatter(**{‘size_max’: 10, ‘opacity’: 0.2}), which sets an appropriate size of scatter points for the size of datasets encodermap is usually used for.

  • hist_kws (Optional[dict[str, Any]]) – A dict with items that encodermap.plot.plotting._plot_free_energy() will accept. If None is provided a dict with bins 50 will be passed to encodermap.plot.plotting._plot_free_energy(**{‘bins’: 50}). You can choose a colormap here by providing {‘bins’: 50, ‘cmap’: ‘plasma’} for this argument.

  • additional_fns (Optional[Sequence[Callable]]) – A list of functions that will accept the low-dimensional output of the Autoencoder latent/bottleneck layer and return a tf.Tensor that can be logged by tf.summary.image(). See the notebook ‘writing_custom_images_to_tensorboard.ipynb’ in tutorials/notebooks_customization for more info. If None is provided, no additional functions will be used to plot to tensorboard. Defaults to None.

  • when (Literal["epoch", "batch"]) – When to log the images can be either ‘batch’, then the images will be logged after every step during training, or ‘epoch’, then only after every image_step epoch the images will be written. Defaults to ‘epoch’.

  • save_to_disk (bool) – Whether to also write the images to disk.

  • args (Any)

  • kwargs (Any)

Return type:

None

add_loss(loss)[source]#

Adds a new loss to the existing losses.

add_metric(metric)[source]#

Adds a new metric to the existing metrics.

close()[source]#

Clears the current keras backend and frees up resources.

Return type:

None

decode(data)[source]#

Calls the decoder part of the model.

AngleDihedralCartesianAutoencoder will, like the other two classes’ output a list of np.ndarray.

Parameters:

data (np.ndarray) – The data to be passed to the decoder part of the model. Make sure that the shape of the data matches the number of neurons in the latent space.

Returns:

Outputs from the decoder part.

For AngleDihedralCartesianEncoderMap, this will be a list of np.ndarray.

Return type:

Union[list[np.ndarray], np.ndarray]

property decoder: Model#

Decoder part of the model.

Type:

tf.keras.Model

encode(data=None)[source]#

Calls encoder part of self.model.

Parameters:

data (Optional[np.ndarray]) – The data to be passed top the encoder part. It can be either numpy ndarray or None. If None is provided, a set of 10000 points from the provided train data will be taken. Defaults to None.

Returns:

The output from the bottleneck/latent layer.

Return type:

np.ndarray

property encoder: Model#

Encoder part of the model.

Type:

tf.keras.Model

classmethod from_checkpoint(checkpoint_path, train_data=None, sparse=False, use_previous_model=False, compat=False)[source]#

Reconstructs the class from a checkpoint.

Parameters:
  • checkpoint_path (Union[str, Path]) – The path to the checkpoint. Can be either a directory, in which case the most recently saved model will be loaded. Or a direct .keras file, in which case, this specific model will be loaded.

  • train_data (Optional[np.ndarray]) – can provide the train data here.

  • sparse (bool) – Whether the reloaded model should be sparse.

  • use_previous_model (bool) – Set this flag to True, if you load a model from an in-between checkpoint step (e.g., to continue training with different parameters). If you have the files saved_model_0.keras, saved_model_500.keras and saved_model_1000.keras, setting this to True and loading the saved_model_500.keras will back up the saved_model_1000.keras.

  • compat (bool) – Whether to use compatibility mode when missing or wrong parameter files are present. In this special case, some assumptions about the network architecture are made from the model and the parameters in parameters.json overwritten accordingly (a backup will also be made).

Returns:

Encodermap Autoencoder class.

Return type:

Autoencoder

generate(data)[source]#

Duplication of self.decode.

In Autoencoder and EncoderMap this method is equivalent to decode(). In AngleDihedralCartesianEncoderMap this method will be overwritten to produce output molecular conformations.

Parameters:

data (np.ndarray) – The data to be passed to the decoder part of the model. Make sure that the shape of the data matches the number of neurons in the latent space.

Returns:

Outputs from the decoder part. For

AngleDihedralCartesianEncoderMap, this will either be a mdtraj.Trajectory or MDAnalysis.Universe.

Return type:

np.ndarray

plot_network()[source]#

Tries to plot the network using pydot, pydotplus and graphviz. Doesn’t raise an exception if plotting is not possible.

Return type:

None

save(step=None)[source]#

Saves the model to the current path defined in parameters.main_path.

Parameters:

step (Optional[int]) – Does not save the model at the given training step, but rather changes the string used for saving the model from a datetime format to another.

Returns:

When the model has been saved, the Path will

be returned. If the model could not be saved. None will be returned.

Return type:

Union[None, Path]

set_train_data(data)[source]#

Resets the train data for reloaded models.

Parameters:

data (ndarray | DatasetV2)

Return type:

None

train()[source]#

Starts the training of the model.

Returns:

If training succeeds, an

instance of tf.keras.callbacks.History is returned. If not, None is returned.

Return type:

Union[tf.keras.callbacks.History, None]

class EncoderMap(parameters=None, train_data=None, model=None, read_only=False, sparse=False)[source]#

Bases: Autoencoder

Complete copy of Autoencoder class but uses additional distance cost scaled by the SketchMap sigmoid params

Parameters:
  • train_data (Optional[Union[np.ndarray, tf.data.Dataset]])

  • model (Optional[tf.keras.Model])

  • read_only (bool)

  • sparse (bool)

classmethod from_checkpoint(checkpoint_path, train_data=None, sparse=False, use_previous_model=False, compat=False)[source]#

Reconstructs the class from a checkpoint.

Parameters:
  • checkpoint_path (Union[str, Path]) – The path to the checkpoint. Can be either a directory, in which case the most recently saved model will be loaded. Or a direct .keras file, in which case, this specific model will be loaded.

  • train_data (Optional[np.ndarray]) – can provide the train data here.

  • sparse (bool) – Whether the reloaded model should be sparse.

  • use_previous_model (bool) – Set this flag to True, if you load a model from an in-between checkpoint step (e.g., to continue training with different parameters). If you have the files saved_model_0.keras, saved_model_500.keras and saved_model_1000.keras, setting this to True and loading the saved_model_500.keras will back up the saved_model_1000.keras.

  • compat (bool) – Whether to use compatibility mode when missing or wrong parameter files are present. In this special case, some assumptions about the network architecture are made from the model and the parameters in parameters.json overwritten accordingly (a backup will also be made).

Returns:

EncoderMap EncoderMap class.

Return type:

EncoderMap

class EncoderMapBaseCallback(parameters=None)[source]#

Bases: Callback

Base class for callbacks in EncoderMap.

The Parameters class in EncoderMap has a summary_step variable that dictates when variables and other tensors are logged to TensorBoard. No matter what property is logged there will always be a code section executing a if train_step % summary_step == 0 code snippet. This is handled centrally in this class. This class is instantiated inside the user-facing AutoEncoderClass classes and is provided with the appropriate parameters (Parameters for EncoderMap and ADCParameters for AngleDihedralCartesianEncoderMap). Thus, subclassing this class does not need to implement a new __init__ method. Only the on_summary_step and the on_checkpoint_step methods need to be implemented for sub-classes if this class with code that should happen when these events happen.

Examples:

In this example, the on_summary_step method causes an exception.

>>> from typing import Optional
>>> import encodermap as em
...
>>> class MyCallback(em.callbacks.EncoderMapBaseCallback):
...     def on_summary_step(self, step: int, logs: Optional[dict] = None) -> None:
...         raise Exception(f"Summary step {self.steps_counter} has been reached.")
...
>>> emap = em.EncoderMap()  
Output...
>>> emap.add_callback(MyCallback)
>>> emap.train()  
Traceback (most recent call last):
    ...
Exception: Summary step 10 has been reached.
Parameters:

parameters (Optional['AnyParameters'])

steps_counter#

The current step counter. Increases every on_train_batch_end.

Type:

int

p (Union[encodermap.parameters.Parameters, encodermap.parameters.ADCParameters]

The parameters for this callback. Based on the summary_step and checkpoint_step of the encodermap.parameters.Parameters class different class-methods are called.

on_checkpoint_step(step, logs=None)[source]#

Executed, when the currently finished batch matches encodermap.Parameters.checkpoint_step

Parameters:
  • step (int) – The number of the current step.

  • logs (Optional[dict]) – logs is a dict containing the metrics results.

Return type:

None

on_summary_step(step, logs=None)[source]#

Executed, when the currently finished batch matches encodermap.Parameters.summary_step

Parameters:
  • step (int) – The number of the current step.

  • logs (Optional[dict]) – logs is a dict containing the metrics results.

Return type:

None

on_train_batch_end(batch, logs=None)[source]#

Called after a batch ends. The number of batch is provided by keras.

This method is the backbone of all of EncoderMap’s callbacks. After every batch is method is called by keras. When the number of that batch matches either encodermap.Parameters.summary_step or encodermap.Parameters.checkpoint_step the code on self.on_summary_step, or self.on_checkpoint_step is executed. These methods should be overwritten by child classes.

Parameters:
  • batch (int) – The number of the current batch. Provided by keras.

  • logs (Optional[dict]) – logs is a dict containing the metrics results.

Return type:

None

class Featurizer(traj)[source]#

Bases: object

EncoderMap’s featurization has drawn much inspiration from PyEMMA (markovmodel/PyEMMA).

EncoderMap’s Featurizer collects and computes collective variables (CVs). CVs are data that are aligned with MD trajectories on the frame/time axis. Trajectory data contains (besides the topology) an axis for atoms, and an axis for cartesian coordinate (x, y, z), so that a trajectory can be understood as an array with shape (n_frames, n_atoms, 3). A CV is an array that is aligned with the frame/time and has its own feature axis. If the trajectory in our example has 3 residues (MET, ALA, GLY), we can define 6 dihedral angles along the backbone of this peptide. These angles are:

  • PSI1: Between MET1-N - MET1-CA - MET1-C - ALA2-N

  • OMEGA1: Between MET1-CA - MET1-C - ALA2-N - ALA2-CA

  • PHI1: Between MET1-C - ALA2-N - ALA2-CA - ALA2-C

  • PSI2: Between ALA2-N - ALA2-CA - ALA2-C - GLY3-N

  • OMEGA2: Between ALA2-CA - ALA2-C - GLY3-N - GLY3-CA

  • PHI2: Between ALA2-C - GLY3-N - GLY3-CA - GLY3-C

Thus, the collective variable ‘backbone-dihedrals’ provides an array of shape (n_frames, 6) and is aligned with the frame/time axis of the trajectory.

Parameters:

traj (Union[SingleTraj, TrajEnsemble])

class InteractivePlotting(autoencoder=None, trajs=None, lowd_data=None, highd_data=None, align_string='name CA', top=None, ball_and_stick=False, histogram_type='free_energy', superpose=True, ref_align_string='name CA', base_traj=None)[source]#

Bases: object

EncoderMap’s interactive plotting for jupyter notebooks.

Instantiating this class will display an interactive display in your notebook. The display will look like this:

┌─────────────────────┐ ┌───────────┐
│Display              │ │Top        │
└─────────────────────┘ └───────────┘
┌─────────────┐ ┌───┐ ┌─────────────┐
│             │ │   │ │             │
│             │ │ T │ │             │
│  Main       │ │ R │ │  Molecular  │
│  Plotting   │ │ A │ │  Conform.   │
│  Area       │ │ C │ │  Area       │
│             │ │ E │ │             │
│             │ │   │ │             │
└─────────────┘ └───┘ └─────────────┘
┌───┐ ┌─────────────────────────────┐
│   │ │Progress Bar                 │
└───┘ └─────────────────────────────┘
┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌───────────────────┐
│C│ │G│ │S│ │D│ │Slider             │
└─┘ └─┘ └─┘ └─┘ └───────────────────┘
┌────────────────┐  ┌───────────────┐
│                │  │               │
│ Data           │  │               │
│ Overview       │  │               │
│                │  │               │
│                │  │               │
└────────────────┘  └───────────────┘
The components do the following:
  • Display:

    This part will display debug information.

  • Top (Top selector):

    Select which topology to use when creating new molecular conformations from the autoencoder network.

  • Main plotting area:

    In this area, a scatter plot will be displayed. The coordinates of the scatter plot will be taken from the low-dimensional projection of the trajectories. The data for this plotting area can be taken from different sources. See the _lowd_parser docstring for information on how the lowd data is selected. Clicking on a point in the scatter plot displays the conformation of that point.

  • TRACE:

    Displays the high-dimensinal data of selected points or clusters.

  • Molecular conformation area:

    Displays molecular conformations.

  • Progress Bar:

    Displays progress.

  • C (Cluster button):

    After selecting point in the main plotting area with the lasso tool, hit this button to display the molecular conformations of the selected cluster.

  • G (Generate Button):

    Switch to density using the density button. Then, you can draw a freeform path into the Main plotting area. Pressing the generate button will generate the appropriate molecular conformations. If your data has multiple conformations, you can choose which conformation to use for decoding with the top selector.

  • S (Save button):

    Writes either a cluster or generated path to your disk. Uses the main_path of the autoencoder (the same directory as the training data will be stored).

  • D (Density button):

    Switch the main plotting area to Density.

  • Slider:

    In scatter mode this slider defines how many structures to select from a cluster for representation in the molecular conformations window. In density mode, this slider defines how many points along the user-drawn path should be sampled.

Parameters:
  • autoencoder (Optional[AutoencoderClass])

  • trajs (Optional[Union[str, list[str], TrajEnsemble, SingleTraj]])

  • lowd_data (Optional[np.ndarray])

  • highd_data (Optional[np.ndarray])

  • align_string (str)

  • top (Optional[Union[str, list[str], Topology]])

  • ball_and_stick (bool)

  • histogram_type (Union[None, Literal['free_energy', 'density']])

  • superpose (bool)

  • ref_align_string (str)

  • base_traj (Optional[Trajectory])

_cluster_col: str = '_user_selected_points'#
_cluster_method: Literal['stack', 'join'] = 'join'#
_help_url: str = 'https://github.com/AG-Peter/encodermap'#
_max_filepath_len: int = 50#
_max_slider_len: int = 200#
_nbins: int = 50#
advance_path(n)[source]#
cluster(b)[source]#
property density: Any#
classmethod from_project(project_name)[source]#
Parameters:

project_name (Literal['linear_dimers'])

generate(b)[source]#
help(n)[source]#
on_canvas_mouse_down(x, y)[source]#
on_canvas_mouse_move(x, y)[source]#
on_canvas_mouse_up(x, y)[source]#
on_select(trace, points, selector)[source]#
save(b)[source]#
property scatter: Any#

The scatter plot using the low-dimensional data.

Type:

go.Scattergl

scatter_on_click(trace, points, selector)[source]#
stride: int = 10#
switch_between_density_and_scatter(b)[source]#
MolData#

alias of NewMolData

class Parameters(**kwargs)[source]#

Bases: ParametersFramework

Class to hold Parameters for the Autoencoder

Parameters can be set via keyword args while instantiating the class, set as instance attributes or read from disk. This class can write parameters to disk in .yaml or .json format.

Parameters:

kwargs (ParametersData)

defaults#

Classvariable dict that holds the defaults even when the current values might have changed.

Type:

dict

main_path#

Defines a main path where the parameters and other things might be stored.

Type:

str

n_neurons#

List containing number of neurons for each layer up to the bottleneck layer. For example [128, 128, 2] stands for an autoencoder with the following architecture {i, 128, 128, 2, 128, 128, i} where i is the number of dimensions of the input data. These are Input/Output Layers that are not trained.

Type:

list of int

activation_functions#

List of activation function names as implemented in TensorFlow. For example: “relu”, “tanh”, “sigmoid” or “” to use no activation function. The encoder part of the network takes the activation functions from the list starting with the second element. The decoder part of the network takes the activation functions in reversed order starting with the second element form the back. For example [“”, “relu”, “tanh”, “”] would result in a autoencoder with {“relu”, “tanh”, “”, “tanh”, “relu”, “”} as sequence of activation functions.

Type:

list of str

periodicity#

Defines the distance between periodic walls for the inputs. For example 2pi for angular values in radians. All periodic data processed by EncoderMap must be wrapped to one periodic window. E.g. data with 2pi periodicity may contain values from -pi to pi or from 0 to 2pi. Set the periodicity to float(“inf”) for non-periodic inputs.

Type:

float

learning_rate#

Learning rate used by the optimizer.

Type:

float

n_steps#

Number of training steps.

Type:

int

batch_size#

Number of training points used in each training step

Type:

int

summary_step#

A summary for TensorBoard is writen every summary_step steps.

Type:

int

checkpoint_step#

A checkpoint is writen every checkpoint_step steps.

Type:

int

dist_sig_parameters#

Parameters for the sigmoid functions applied to the high- and low-dimensional distances in the following order (sig_h, a_h, b_h, sig_l, a_l, b_l)

Type:

tuple of floats

distance_cost_scale#

Adjusts how much the distance based metric is weighted in the cost function.

Type:

int

auto_cost_scale#

Adjusts how much the autoencoding cost is weighted in the cost function.

Type:

int

auto_cost_variant#

defines how the auto cost is calculated. Must be one of: * mean_square * mean_abs * mean_norm

Type:

str

center_cost_scale#

Adjusts how much the centering cost is weighted in the cost function.

Type:

float

l2_reg_constant#

Adjusts how much the L2 regularisation is weighted in the cost function.

Type:

float

gpu_memory_fraction#

Specifies the fraction of gpu memory blocked. If set to 0, memory is allocated as needed.

Type:

float

analysis_path#

A path that can be used to store analysis

Type:

str

id#

Can be any name for the run. Might be useful for example for specific analysis for different data sets.

Type:

str

model_api#

A string defining the API to be used to build the keras model. Defaults to sequntial. Possible strings are: * functional will use keras’ functional API. * sequential will define a keras Model, containing two other models with the Sequential API.

These two models are encoder and decoder.

  • custom will create a custom Model where even the layers are custom.

Type:

str

loss#

A string defining the loss function. Defaults to emap_cost. Possible losses are: * reconstruction_loss will try to train output == input * mse: Returns a mean squared error loss. * emap_cost is the EncoderMap loss function. Depending on the class Autoencoder,

Encodermap, `ADCAutoencoder, different contributions are used for a combined loss. Autoencoder uses atuo_cost, reg_cost, center_cost. EncoderMap class adds sigmoid_loss.

Type:

str

batched#

Whether the dataset is batched or not.

Type:

bool

training#

A string defining what kind of training is performed when autoencoder.train() is callsed. * auto does a regular model.compile() and model.fit() procedure. * custom uses gradient tape and calculates losses and gradients manually.

Type:

str

tensorboard#

Whether to print tensorboard information. Defaults to False.

Type:

bool

seed#

Fixes the state of all operations using random numbers. Defaults to None.

Type:

Union[int, None]

current_training_step#

The current training step. Aids in reloading of models.

Type:

int

write_summary#

If True writes a summar.txt of the models into main_path if tensorboard is True, summaries will also be written.

Type:

bool

trainable_dense_to_sparse#

When using different topologies to train the AngleDihedralCartesianEncoderMap, some inputs might be sparse, which means, they have missing values. Creating a dense input is done by first passing these sparse tensors through tf.keras.layers.Dense layers. These layers have trainable weights, and if this parameter is True, these weights will be changed by the optimizer.

Type:

bool

using_hypercube#

This parameter is not meant to be set by the user. It allows us to print better error messages when re-loading and re-training a model. It contains a boolean whether a model has been trained on the hypercube example data. If your data is 4-dimensional and you reload a model and forget to prvide your data, the model will happily train with the hypercube (and not your) data. This variable implements a check.

Type:

bool

Examples

>>> import encodermap as em
>>> import tempfile
>>> from pathlib import Path
...
>>> with tempfile.TemporaryDirectory() as td:
...     td = Path(td)
...     p = em.Parameters()
...     print(p.auto_cost_variant)
...     savepath = p.save(td / "parameters.json")
...     print(savepath)
...     new_params = em.Parameters.from_file(td / "parameters.json")
...     print(new_params.main_path)  
mean_abs
/tmp...parameters.json
seems like the parameter file was moved to another directory. Parameter file is updated ...
/home...
_defaults = {'activation_functions': ['', 'tanh', 'tanh', ''], 'analysis_path': '', 'auto_cost_scale': 1, 'auto_cost_variant': 'mean_abs', 'batch_size': 256, 'batched': True, 'center_cost_scale': 0.0001, 'checkpoint_step': 5000, 'current_training_step': 0, 'dist_sig_parameters': (4.5, 12, 6, 1, 2, 6), 'distance_cost_scale': 500, 'gpu_memory_fraction': 0, 'id': '', 'l2_reg_constant': 0.001, 'learning_rate': 0.001, 'loss': 'emap_cost', 'model_api': 'sequential', 'n_neurons': [128, 128, 2], 'n_steps': 1000, 'periodicity': 6.283185307179586, 'seed': None, 'summary_step': 10, 'tensorboard': False, 'trainable_dense_to_sparse': False, 'training': 'auto', 'using_hypercube': False, 'write_summary': False}#
classmethod defaults_description()[source]#

str: A string that contains tabulated default parameter values.

Return type:

str

function(debug=False)[source]#

Encodermap’s implementation of tf.function.

Parameters:

debug (bool) – If True, the decorated function will not be compiled. Defaults to False.

Return type:

Any

load(trajs, tops=None, common_str=None, backend='no_load', index=None, traj_num=None, basename_fn=None, custom_top=None)[source]#

Load MD data.

Based what’s provided for trajs, you either get a SingleTraj object that collects information about a single traj, or a TrajEnsemble object, that contains information of multiple trajectories (even with different topologies).

Parameters:
  • trajs (Union[str, md.Trajectory, Sequence[str], Sequence[md.Trajectory], Sequence[SingleTraj]]) – Here, you can provide a single string pointing to a trajectory on your computer (/path/to/traj_file.xtc) or (/path/to/protein.pdb) or a list of such strings. In the former case, you will get a SingleTraj object which is EncoderMap’s way of storing data (positions, CVs, times) of a single trajectory. In the latter case, you will get a TrajEnsemble object, which is Encodermap’s way of working with mutlipel SingleTrajs.

  • tops (Optional[Union[str, md.Topology, Sequence[str], Sequence[md.Topology]]]) – For this argument, you can provide the topology(ies) of the corresponding traj(s). Trajectory file formats like .xtc and .dcd only store atomic positions and not weights, elements, or bonds. That’s what the tops argument is for. There are some trajectory file formats out there (MDTraj HDF5, AMBER netCDF4) that store both trajectory and topology in a single file. Also .pdb file can also be used as If you provide such files for trajs, you can leave tops as None. If you provide multiple files for trajs, you can still provide a single tops file, if the trajs in trajs share the same topology. If that is not the case, you can either provide a list of topologies, matched to the trajs in trajs, or use the common_str argument to match them. Defaults to None.

  • common_str (Optional[str, list[str]]) –

    If you provided a different number of trajs and tops, this argument is used to match them. Let’s say, you have 5 trajectories of a wild type protein and 5 trajectories of a mutant. If the path to these files is somewhat consistent (e.g:

    • /path/to/wt/traj1.xtc

    • /different/path/to/wt/traj_no_water.xtc

    • /data/path/to/mutant/traj0.xtc

    • /data/path/to/mutant/traj0.xtc

    ), you can provide [‘wt’, ‘mutant’] for the common_str argument and the files are grouped based on the occurence of ‘wt’ and ‘mutant’ in ther filepaths. Defaults to None.

  • backend (Literal["no_load", "mdtraj"]) – Normally, encodermap postpones the actual loading of the atomic positions until you really need them. This accelerates the handling of large trajectory ensembles. Choosing ‘mdtraj’ as the backend, all atomic positions are always loaded, taking up space on your system memory, but accessing positions in a non-sequential fashion is faster. Defaults to ‘no_load’.

  • index (Optional[Union[int, np.ndarray, list[int], slice]]) –

    Only used, if argument trajs is a single trajectory. This argument can be used to index the trajectory data. If you want to exclude the first 100 frames of your trajectory, because the protein relaxes from its crystal structure, you can load it like so:

    em.load(traj_file, top_file, index=slice(100))

    As encodermap lazily evaluates positional data, the slice(100) argument is stored until the data is accessed in which case the first 100 frames are not accessible. Just like, if you would have deleted them. Besides a slice, you can also provide int (which returns a single frame at the requested index) and lists of int (which returns frames at the locations indexed by the ints in the list). If None is provided the trajectory data is not sliced/subsampled. Defaults to None.

  • traj_num (Optional[int]) –

    Only used, if argument trajs is a single trajectory. This argument is meant to organize the SingleTraj trajectories in a TrajEnsemble class. Of course you can build your own TrajEnsemble from

    a list of SingleTraj`s and provide this list as the `trajs argument to

    em.load(). In this case you need to set the `traj_num`s of the `SingleTraj`s yourself. Defaults to None.

  • basename_fn (Optional[Callable[[str], str]]) – A function to apply to the traj_file string to return the basename of the trajectory. If None is provided, the filename without extension will be used. When all files are named the same and the folder they’re in defines the name of the trajectory you can supply lambda x: split(‘/’)[-2] as this argument. Defaults to None.

  • custom_top (Optional['CustomAAsDict'])

Return type:

Union[SingleTraj, TrajEnsemble]

Examples

>>> # load a pdb file with 14 frames from rcsb.org
>>> import encodermap as em
>>> traj = em.load("https://files.rcsb.org/view/1GHC.pdb")
>>> print(traj)
encodermap.SingleTraj object. Current backend is no_load. Basename is 1GHC. At indices (None,). Not containing any CVs.
>>> traj.n_frames
14
>>> # load multiple trajs
>>> trajs = em.load([
...     'https://files.rcsb.org/view/1YUG.pdb',
...     'https://files.rcsb.org/view/1YUF.pdb'
... ])
>>> # trajs are internally numbered
>>> print([traj.traj_num for traj in trajs])
[0, 1]