encodermap.autoencoder package#

Submodules#

encodermap.autoencoder.autoencoder module#

Forward facing Autoencoder classes. Contains four classes:

  • Autoencoder: Simple NN dense, fully connected AE architecture. Reg loss, auto loss and center loss

  • EncoderMap: Uses the same architecture as Autoencoder, but adds another loss function.

  • DihedralEncoderMap: Basically the same as EncoderMap, but rewrites the generate method to use

    an atomistic topology to rebuild a trajectory.

  • AngleDihedralCartesianEncoderMap: Uses more loss functions and tries to learn a full all atom conformation.

class encodermap.autoencoder.autoencoder.AngleDihedralCartesianEncoderMap(trajs: encodermap.TrajEnsemble, parameters: Optional[encodermap.ADCParameters] = None, model: Optional[tensorflow.keras.Model] = None, read_only: bool = False, cartesian_loss_step: int = 0, top: Optional[mdtraj.Topology] = None)[source]#

Bases: Autoencoder

Different __init__ method, than Autoencoder Class. Uses callbacks to tune-in cartesian cost.

Overwritten methods: _set_up_callbacks and generate.

Examples

>>> import encodermap as em
>>> # Load two trajectories
>>> xtcs = ["tests/data/1am7_corrected_part1.xtc", "tests/data/1am7_corrected_part2.xtc"]
>>> tops = ["tests/data/1am7_protein.pdb", "tests/data/1am7_protein.pdb"]
>>> trajs = em.load(xtcs, tops)
>>> print(trajs)
encodermap.TrajEnsemble object. Current backend is no_load. Containing 2 trajs. Not containing any CVs.
>>> # load CVs
>>> # This step can be omitted. The AngleDihedralCartesianEncoderMap class automatically loads CVs
>>> trajs.load_CVs('all')
>>> print(trajs.CVs['central_cartesians'].shape)
(51, 474, 3)
>>> print(trajs.CVs['central_dihedrals'].shape)
(51, 471)
>>> # create some parameters
>>> p = em.ADCParameters(periodicity=360, use_backbone_angles=True, use_sidechains=True,
...                      cartesian_cost_scale_soft_start=(6, 12))
>>> # Standard is functional model, as it offers more flexibility
>>> print(p.model_api)
functional
>>> print(p.distance_cost_scale)
None
>>> # Instantiate the class
>>> e_map = em.AngleDihedralCartesianEncoderMap(trajs, p, read_only=True)
>>> # dataset contains these inputs:
>>> # central_angles, central_dihedrals, central_cartesians, central_distances, sidechain_dihedrals
>>> print(e_map.dataset)
<BatchDataset element_spec=(TensorSpec(shape=(None, 472), dtype=tf.float32, name=None), TensorSpec(shape=(None, 471), dtype=tf.float32, name=None), TensorSpec(shape=(None, 474, 3), dtype=tf.float32, name=None), TensorSpec(shape=(None, 473), dtype=tf.float32, name=None), TensorSpec(shape=(None, 316), dtype=tf.float32, name=None))>
>>> # output from the model contains the following data:
>>> # out_angles, out_dihedrals, back_cartesians, pairwise_distances of inp cartesians, pairwise of back-mapped cartesians, out_side_dihedrals
>>> for data in e_map.dataset.take(1):
...     pass
>>> out = e_map.model(data)
>>> print([i.shape for i in out])
[TensorShape([256, 472]), TensorShape([256, 471]), TensorShape([256, 474, 3]), TensorShape([256, 112101]), TensorShape([256, 112101]), TensorShape([256, 316])]
>>> # get output of latent space by providing central_angles, central_dihedrals, sidehcain_dihedrals
>>> latent = e_map.encoder([data[0], data[1], data[-1]])
>>> print(latent.shape)
(256, 2)
>>> # Rebuild central_angles, central_dihedrals and sidechain_angles from latent
>>> dih, ang, side_dih = e_map.decode(latent)
>>> print(dih.shape, ang.shape, side_dih.shape)
(256, 472) (256, 471) (256, 316)
__init__(trajs: encodermap.TrajEnsemble, parameters: Optional[encodermap.ADCParameters] = None, model: Optional[tensorflow.keras.Model] = None, read_only: bool = False, cartesian_loss_step: int = 0, top: Optional[mdtraj.Topology] = None) None[source]#

Instantiate the AngleDihedralCartesianEncoderMap class.

Parameters:
  • trajs (em.TrajEnsemble) – The trajectories to be used as input. If trajs contain no CVs, correct CVs will be loaded.

  • parameters (Optional[em.ACDParameters]) – The parameters for the current run. Can be set to None and the default parameters will be used. Defaults to None.

  • model (Optional[tf.keras.models.Model]) – The keras model to use. You can provide your own model with this argument. If set to None, the model will be built to the specifications of parameters using either the functional or sequential API. Defaults to None

  • read_only (bool) – Whether to write anything to disk (False) or not (True). Defaults to False.

  • cartesian_loss_step (int, optional) – For loading and re-training the model. The cartesian_distance_loss is tuned in step-wise. For this the start step of the training needs to be accounted for. If the scale of the cartesian loss should increase from epoch 6 to epoch 12 and the model is saved at epoch 9, this argument should also be set to 9, to continue training with the correct scaling factor. Defaults to 0.

_setup_callbacks() None[source]#

Overwrites the parent class’ _setup_callbacks method.

Due to the ‘soft start’ of the cartesian cost, the cartesiand_increase_callback needs to be added to the list of callbacks.

encode(data=None)[source]#

Calls encoder part of model.

Parameters:

data (Union[np.ndarray, None], optional) – The data to be passed top the encoder part. Can be either numpy ndarray or None. If None is provided a set of 10000 points from the provided train data will be taken. Defaults to None.

Returns:

The output from the bottlenack/latent layer.

Return type:

np.ndarray

classmethod from_checkpoint(trajs, checkpoint_path, read_only=True, overwrite_tensorboard_bool=False)[source]#

Reconstructs the model from a checkpoint.

generate(points: np.ndarray, top: Optional[str, int, mdtraj.Topology] = None, backend: Literal['mdtraj', 'mdanalysis'] = 'mdtraj') Union[MDAnalysis.Universe, mdtraj.Trajectory][source]#

Overrides the parent class’ generate method and builds a trajectory.

Instead of just providing data to decode using the decoder part of the network, this method also takes a molecular topology as its top argument. This topology is then used to rebuild a time-resolved trajectory.

Parameters:
  • points (np.ndarray) – The low-dimensional points from which the trajectory should be rebuilt.

  • top (Optional[str, int, mdtraj.Topology]) – The topology to be used for rebuilding the trajectory. This should be a string pointing towards a <*.pdb, *.gro, *.h5> file. Alternatively, None can be provided, in which case, the internal topology (self.top) of this class is used. Defaults to None.

  • backend (str) – Defines what MD python package to use, to build the trajectory and also what type this method returns, needs to be one of the following: * “mdtraj” * “mdanalysis”

Returns:

The trajectory after

applying the decoded structural information. The type of this depends on the chosen backend parameter.

Return type:

Union[mdtraj.Trajectory, MDAnalysis.universe]

static get_train_data_from_trajs(trajs, p, attr='CVs')[source]#
property loss#

A list of loss functions passed to the model when it is compiled. When the main Autoencoder class is used and parameters.loss is ‘emap_cost’ this list is comprised of center_cost, regularization_cost, auto_cost. When the EncoderMap sub-class is used and parameters.loss is ‘emap_cost’ distance_cost is added to the list. When parameters.loss is not ‘emap_cost’, the loss can either be a string (‘mse’), or a function, that both are acceptable arguments for loss, when a keras model is compiled.

Type:

(Union[list, string, function])

save(step: Optional[int] = None) None[source]#

Saves the model to the current path defined in parameters.main_path.

Parameters:

step (Optional[int]) – Does not actually save the model at the given training step, but rather changes the string used for saving the model from an datetime format to another.

train() None[source]#

Overrides the parent class’ train method.

After the training is finished, an additional file is written to disk, which saves the current epoch. In the event that training will continue, the current state of the soft-start cartesian cost is read from that file.

class encodermap.autoencoder.autoencoder.Autoencoder(parameters=None, train_data: Optional[Union[np.ndarray, tf.Dataset]] = None, model=None, read_only=False, sparse=False)[source]#

Bases: object

Main Autoencoder class preparing data, setting up the neural network and implementing training.

This is the main class for neural networks inside EncoderMap. The class prepares the data (batching and shuffling), creates a tf.keras.Model of layers specified by the attributes of the encodermap.Parameters class. Depending on what Parent/Child-Class is instantiated a combination of cost functions is set up. Callbacks to Tensorboard are also set up.

train_data#

The numpy array of the train data passed at init.

Type:

np.ndarray

p#

An encodermap.Parameters() class containing all info needed to set up the network.

Type:

encodermap.Parameters

dataset#

The dataset that is actually used in training the keras model. The dataset is a batched, shuffled, infinitely-repeating dataset.

Type:

tensorflow.data.Dataset

read_only#

Variable telling the class whether it is allowed to write to disk (False) or not (True).

Type:

bool

optimizer#

Instance of the Adam optimizer with learning rate specified by the Parameters class.

Type:

tf.keras.optimizers.Adam

metrics#

A list of metrics passed to the model when it is compiled.

Type:

list

callbacks#

A list of tf.keras.callbacks.Callback Sub-classes changing the behavior of the model during training. Some standard callbacks are always present like:

  • encodermap.callbacks.callbacks.ProgressBar:

    A progress bar callback using tqdm giving the current progress of training and the current loss.

  • CheckPointSaver:

    A callback that saves the model every parameters.checkpoint_step steps into the main directory. This callback will only be used, when read_only is False.

  • TensorboardWriteBool:

    A callback that contains a boolean Tensor that will be True or False, depending on the current training step and the summary_step in the parameters class. The loss functions use this callback to decide whether they should write to Tensorboard. This callback will only be present, when read_only is False and parameters.tensorboard is True.

You can append your own callbacks to this list before executing Autoencoder.train().

Type:

list

encoder#

The encoder (sub)model of model.

Type:

tf.keras.models.Model

decoder#

The decoder (sub)model of model.

Type:

tf.keras.models.Model

from_checkpoint()[source]#

Rebuild the model from a checkpoint.

add_images_to_tensorboard()[source]#

Make tensorboard plot images.

train()[source]#

Starts the training of the tf.keras.models.Model.

plot_network()[source]#

Tries to plot the network. For this method to work graphviz, pydot and pydotplus needs to be installed.

encode()[source]#

Takes high-dimensional data and sends it through the encoder.

decode()[source]#

Takes low-dimensional data and sends it through the encoder.

generate()[source]#

Same as decode. For AngleDihedralCartesianAutoencoder classes this will build a protein strutcure.

Note

Performance of tensorflow is not only dependant on your system’s hardware and how the data is presented to the network (for this check out https://www.tensorflow.org/guide/data_performance), but also how you compiled tensorflow. Normal tensorflow (pip install tensorflow) is build without CPU extensions to work on many CPUs. However, Tensorflow can greatly benefit from using CPU instructions like AVX2, AVX512 that bring a speed-up in linear algebra computations of 300%. By building tensorflow from source you can activate these extensions. However, the CPU speed-up is dwarfed by the speed-up when you allow tensorflow to run on your GPU (grapohics card). To check whether a GPU is available run: print(“Num GPUs Available: “, len(tf.config.list_physical_devices(‘GPU’))). Refer to these pages to install tensorflow for best performance: https://www.tensorflow.org/install/pip, https://www.tensorflow.org/install/gpu

Examples

>>> import encodermap as em
>>> # without providing any data, default parameters and a 4D hypercube as input data will be used.
>>> e_map = em.EncoderMap(read_only=True)
>>> print(e_map.train_data.shape)
(16000, 4)
>>> print(e_map.dataset)
<BatchDataset element_spec=(TensorSpec(shape=(None, 4), dtype=tf.float32, name=None), TensorSpec(shape=(None, 4), dtype=tf.float32, name=None))>
>>> print(e_map.encode(e_map.train_data).shape)
(16000, 2)
__init__(parameters=None, train_data: Optional[Union[np.ndarray, tf.Dataset]] = None, model=None, read_only=False, sparse=False)[source]#

Instantiate the Autoencoder class.

Parameters:
  • parameters (Union[encodermap.Parameters, None], optional) – The parameters to be used. If None is provided default values (check them with print(em.Parameters.defaults_description())) are used. Defaults to None.

  • train_data (Union[np.ndarray, tf.data.Dataset, None], optional) –

    The train data. Can be one of the following: * None: If None is provided points on the edges of a 4-dimensional hypercube will be used as train data. * np.ndarray: If a numpy array is provided, it will be transformed into a batched tf.data.Dataset by

    first making it an infinitely repeating dataset, shuffling it and the batching it with a batch size specified by parameters.batch_size.

    • tf.data.Dataset: If a dataset is provided it will be used without making any adjustments. Make

      sure, that the dataset uses float32 as its type.

    Defaults to None.

  • model (Union[tf.keras.models.Model, None], optional) – Providing a keras model to this argument will make the Autoencoder/EncoderMap class use this model instead of the predefined ones. Make sure the model can accept EncoderMap’s loss functions. If None is provided the model will be built using the specifications in parameters. Defaults to None.

  • read_only (bool, optional) – Whether the class is allowed to write to disk (False) or not (True). Defaults to False and will allow the class to write to disk.

Raises:

BadError – When read_only is True and parameters.tensorboard is True, this Exception will be raised, because they are mutually exclusive.

_setup_callbacks()[source]#

Sets up a list with callbacks to be passed to self.model.fit()

add_images_to_tensorboard(data=None, image_step=None, scatter_kws={'s': 20}, hist_kws={'bins': 50}, additional_fns=None, when='epoch')[source]#

Adds images to Tensorboard using the data in data and the ids in ids.

Parameters:
  • data (Union[np.ndarray, list, None], optional) – The input-data will be passed through the encoder part of the autoencoder. If None is provided a set of 10000 points from the provided train data will be taken. A list is needed for the functional API of the ADCAutoencoder, that takes a list of [angles, dihedrals, side_dihedrals]. Defaults to None.

  • image_step (Union[int, None], optional) – The interval in which to plot images to tensorboard. If None is provided, the update step will be the same as parameters.summary_step. Defaults to None.

  • scatter_kws (dict, optional) – A dict with items that matplotlib.pyplot.scatter() will accept. Defaults to {‘s’: 20}, which sets an appropriate size of scatter points for the size of datasets encodermap is usually used for.

  • hist_kws (dict, optional) – A dict with items that matplotlib.pyplot.scatter() will accept. You can choose a colorbar here. Defaults to {‘bins’: 50} which sets an appropriate bin count for the size of datasets encodermap is usually used for.

  • additional_fns (Union[list, None], optional) – A list of functions that will accept the low-dimensional output of the autoencoder’s latent/bottleneck layer and return a tf.Tensor that can be logged by tf.summary.image(). See the notebook ‘writing_custom_images_to_tensorboard.ipynb’ in tutorials/notebooks_customization for more info. If None is provided no additional functions will be used to plot to tensorboard. Defaults to None.

  • when (str, optional) – When to log the images can be either ‘batch’, then the images will be logged after every step during training, or ‘epoch’, then only after every image_step epoch the images will be written. Defaults to ‘epoch’.

close()[source]#

Clears the current keras backend and frees up resources.

decode(data)[source]#

Calls the decoder part of the model.

AngleDihedralCartesianAutoencoder will, like the other two classes’ output a tuple of data.

Parameters:

data (np.ndarray) – The data to be passed to the decoder part of the model. Make sure that the shape of the data matches the number of neurons in the latent space.

Returns:

Oue output from the decoder part.

Return type:

np.ndarray

property decoder#

Decoder part of the model.

Type:

tf.keras.models.Model

encode(data=None)[source]#

Calls encoder part of model.

Parameters:

data (Union[np.ndarray, None], optional) – The data to be passed top the encoder part. Can be either numpy ndarray or None. If None is provided a set of 10000 points from the provided train data will be taken. Defaults to None.

Returns:

The output from the bottlenack/latent layer.

Return type:

np.ndarray

property encoder#

Encoder part of the model.

Type:

tf.keras.models.Model

classmethod from_checkpoint(checkpoint_path, read_only=True, overwrite_tensorboard_bool=False, sparse=False)[source]#

Reconstructs the class from a checkpoint.

Parameters:
  • path (Checkpoint) – The path to the checkpoint. Most models are saved in parts (encoder, decoder) and thus the provided path often needs a wildcard (*). The save() method of this class prints a string with which the model can be reloaded.

  • read_only (bool, optional) – Whether to reload the model in read_only mode (True) or allow the Autoencoder class to write to disk (False). This option might collide with the tensorboard Parameter in the respective parameters.json file in the maith_path. Defaults to True.

  • overwrite_tensorboard_bool (bool, optional) – Whether to overwrite the tensorboard Parameter while reloading the class. This can be set to True to set the tensorboard parameter False and allow read_only. Defaults to False.

Raises:

BadError – When read_only is True, overwrite_tensorboard_bool is False and the reloaded parameters have tensorboard set to True.

Returns:

Encodermap Autoencoder class.

Return type:

Autoencoder

generate(data)[source]#

Duplication of decode.

In Autoencoder and EncoderMap this method is equivalent to decode(). In AngleDihedralCartesianAutoencoder this method will be overwritten to produce output molecular conformations.

Parameters:

data (np.ndarray) – The data to be passed to the decoder part of the model. Make sure that the shape of the data matches the number of neurons in the latent space.

Returns:

Oue output from the decoder part.

Return type:

np.ndarray

property loss#

A list of loss functions passed to the model when it is compiled. When the main Autoencoder class is used and parameters.loss is ‘emap_cost’ this list is comprised of center_cost, regularization_cost, auto_cost. When the EncoderMap sub-class is used and parameters.loss is ‘emap_cost’ distance_cost is added to the list. When parameters.loss is not ‘emap_cost’, the loss can either be a string (‘mse’), or a function, that both are acceptable arguments for loss, when a keras model is compiled.

Type:

(Union[list, string, function])

property model#

The tf.keras.Model model used for training.

Type:

tf.keras.models.Model

plot_network()[source]#

Tries to plot the network using pydot, pydotplus and graphviz. Doesn’t raise an exception if plotting is not possible.

save(step=None)[source]#

Saves the model to the current path defined in parameters.main_path.

Parameters:

step (Union[int, None], optional) – Does not actually save the model at the given training step, but rather changes the string used for saving the model from an datetime format to another.

train()[source]#

Starts the training of the model.

class encodermap.autoencoder.autoencoder.DihedralEncoderMap(parameters=None, train_data: Optional[Union[np.ndarray, tf.Dataset]] = None, model=None, read_only=False, sparse=False)[source]#

Bases: EncoderMap

Similar to the EncoderMap class, but overwrites the generate method.

Using this class, instead of tbe EncoderMap class, the generate method, needs an additional argument: top, which should be a topology file. This topology will be used as a base on which the dihedrals of the decode method are applied.

generate(data: np.ndarray, top: str) MDAnalysis.Universe[source]#

Overwrites EncoderMap’s generate method and actually does backmapping if a list of dihedrals is provided.

Parameters:
  • data (np.ndarray) – The low-dimensional/latent/bottleneck data. A ndim==2 numpy array with xy coordinates of points in latent space.

  • top (str) – Topology file for this run of EncoderMap (can be .pdb, .gro, .. etc.).

Returns:

The topology with the provided backbone torsions.

Return type:

MDAnalysis.Universe

Examples

>>> # get some time-resolved pdb files
>>> import requests
>>> import numpy as np
>>> pdb_link = 'https://files.rcsb.org/view/1YUF.pdb'
>>> contents = requests.get(pdb_link).text
>>> print(contents.splitlines()[0]) 
HEADER    GROWTH FACTOR                           01-APR-96   1YUF
>>> # fake a file with stringio
>>> from io import StringIO
>>> import MDAnalysis as mda
>>> import numpy as np
>>> file = StringIO(contents)
>>> # pass it to MDAnalysis
>>> u = mda.Universe(file, format='PDB')
>>> print(u)
<Universe with 720 atoms>
>>> # select the atomgroups
>>> ags = [*[res.psi_selection() for res in u.residues],
...        *[res.omega_selection() for res in u.residues],
...        *[res.phi_selection() for res in u.residues]
...        ]
>>> # filter Nones
>>> ags = list(filter(lambda x: False if x is None else True, ags))
>>> print(ags[0][0]) 
<Atom 3: C of type C of resname VAL, resid 1 and segid A and altLoc >
>>> # Run dihedral Angles
>>> from MDAnalysis.analysis.dihedrals import Dihedral
>>> R = np.deg2rad(Dihedral(ags).run().results.angles)
>>> print(R.shape)
(16, 147)
>>> # import EncoderMap and define parameters
>>> from encodermap.autoencoder import DihedralEncoderMap
>>> import encodermap as em
>>> parameters = em.Parameters(
... dist_sig_parameters = (4.5, 12, 6, 1, 2, 6),
... periodicity = 2*np.pi,
... l2_reg_constant = 10.0,
... summary_step = 5,
... tensorboard = False,
... )
>>> e_map = DihedralEncoderMap(parameters, R, read_only=True)
>>> print(e_map.__class__.__name__)
DihedralEncoderMap
>>> # get some low-dimensional data
>>> lowd = np.random.random((100, 2))
>>> # use the generate method to get a new MDAnalysis universe
>>> # but first remove the time resolution
>>> file = StringIO(contents.split('MODEL        2')[0])
>>> new = e_map.generate(lowd, file)
>>> print(new.trajectory.coordinate_array.shape)
(100, 720, 3)
>>> # check whether frame 0 of u and new_u are different
>>> for ts in u.trajectory:
...     a1 = ts.positions
...     break
>>> print(np.array_equal(a1, new.trajectory.coordinate_array[0]))
False
class encodermap.autoencoder.autoencoder.EncoderMap(parameters=None, train_data: Optional[Union[np.ndarray, tf.Dataset]] = None, model=None, read_only=False, sparse=False)[source]#

Bases: Autoencoder

Complete copy of Autoencoder class but uses additional distance cost scaled by the SketchMap sigmoid params

classmethod from_checkpoint(checkpoint_path, read_only=True, overwrite_tensorboard_bool=False, sparse=False)[source]#

Reconstructs the model from a checkpoint.

property loss#

A list of loss functions passed to the model when it is compiled. When the main Autoencoder class is used and parameters.loss is ‘emap_cost’ this list is comprised of center_cost, regularization_cost, auto_cost. When the EncoderMap sub-class is used and parameters.loss is ‘emap_cost’ distance_cost is added to the list. When parameters.loss is not ‘emap_cost’, the loss can either be a string (‘mse’), or a function, that both are acceptable arguments for loss, when a keras model is compiled.

Type:

(Union[list, string, function])

Module contents#

Front-facing autoencoder classes.