Asp7 Example - Advanced Usage#

Run this notebook on Google Colab:

Find the documentation of EncoderMap:

https://ag-peter.github.io/encodermap

For Google colab only:#

If you’re on Google colab, please uncomment these lines and install EncoderMap.

[1]:

# !wget https://raw.githubusercontent.com/AG-Peter/encodermap/main/tutorials/install_encodermap_google_colab.sh
# !sudo bash install_encodermap_google_colab.sh

If you’re on Google colab, you also want to download the Data we will use in this notebook.

[2]:

# !wget https://raw.githubusercontent.com/AG-Peter/encodermap/main/tutorials/notebooks_starter/asp7.csv
# !wget https://raw.githubusercontent.com/AG-Peter/encodermap/main/tutorials/notebooks_starter/asp7.pdb
# !wget https://raw.githubusercontent.com/AG-Peter/encodermap/main/tutorials/notebooks_starter/asp7.xtc

Primer#

In this tutorial we will use example data from a molecular dynamics simulation and learn more about advanced usage of EncoderMap. Encoder map can create low-dimensional maps of the vast conformational spaces of molecules. This allows easy identification of the most common molecular conformations and helps to understand the relations between these conformations. In this example, we will use data from a simulation of a simple peptide: hepta-aspartic-acid.

First we need to import some libraries:

[3]:

import encodermap as em
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from math import pi
%config Completer.use_jedi=False
%load_ext autoreload
%autoreload 2

2023-02-07 11:06:57.757562: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-07 11:06:57.913863: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.16/x64/lib
2023-02-07 11:06:57.913889: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-02-07 11:06:58.724477: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.16/x64/lib
2023-02-07 11:06:58.724568: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.16/x64/lib
2023-02-07 11:06:58.724578: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Fix the random state of tensorflow for reproducibility.

[4]:

import tensorflow as tf
tf.random.set_seed(3)

Next, we need to load the input data. Different kinds of variables can be used to describe molecular conformations: e.g. Cartesian coordinates, distances, angles, dihedrals… In principle EncoderMap can deal with any of these inputs, however, some are better suited than others. The molecular conformation does not change when the molecule is translated or rotated. The chosen input variables should reflect that and be translationally and rotationally invariant.

In this example we use the backbone dihedral angles phi and psi as input as they are translationally and rotationally invariant and describe the backbone of a protein/peptide very well.

The “asp7.csv” file contains one column for each dihedral and one row for each frame of the trajectory. Additionally, the last column contains a cluster_id from a gromos clustering which we can later use for comparison. We can load this data using numpy.loadtxt:

[5]:

csv_path = "asp7.csv"
data = np.loadtxt(csv_path, skiprows=1, delimiter=",")
dihedrals = data[:, :-1]
cluster_ids = data[:, -1]

[6]:

import nglview as nv
import mdtraj as md
traj = md.load('asp7.xtc', top='asp7.pdb')
view = nv.show_mdtraj(traj)
view.add_representation('hyperball')
view

Similarly to the previous example, we need to set some parameters. In contrast to the Cube example we now have periodic input data. The dihedral angles are in radians with a 2pi periodicity. We also set some further parameters but don’t bother for now.

[7]:

parameters = em.Parameters()
parameters.main_path = em.misc.run_path("runs/asp7")
parameters.n_steps = 100
parameters.dist_sig_parameters = (4.5, 12, 6, 1, 2, 6)
parameters.periodicity = 2*pi
parameters.l2_reg_constant = 10.0
parameters.summary_step = 1
parameters.tensorboard

%matplotlib inline
em.plot.distance_histogram(dihedrals[::10],
                           parameters.periodicity,
                           parameters.dist_sig_parameters,
                           bins=50)

[7]:

(<AxesSubplot: title={'center': 'high-d'}, xlabel='distance'>,
 <AxesSubplot: >,
 <AxesSubplot: title={'center': 'low-d'}, xlabel='distance'>)

../_images/user_guide_and_examples_02_advanced_13_1.png

Next we can run the dimensionality reduction:

[8]:

e_map = em.EncoderMap(parameters, dihedrals)

Output files are saved to runs/asp7/run0 as defined in 'main_path' in the parameters.

2023-02-07 11:07:03.625789: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.16/x64/lib
2023-02-07 11:07:03.625818: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2023-02-07 11:07:03.625840: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (fv-az267-630): /proc/driver/nvidia/version does not exist
2023-02-07 11:07:03.626100: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

The new tensorflow 2 version of EncoderMap allows you to also view the output of the latent space during the training. Switch that feature on with e_map.add_images_to_tensorboard().

[9]:

e_map.add_images_to_tensorboard()

Nothing is written to Tensorboard for this Model. Please change parameters.tensorboard to True.

[10]:

e_map.train()

100%|██████████| 100/100 [00:04<00:00, 21.68it/s, Loss after step 100=2.02e+3]
WARNING:absl:Function `_wrapped_model` contains input name(s) Encoder_0_input with unsupported characters which will be renamed to encoder_0_input in the SavedModel.
WARNING:absl:Function `_wrapped_model` contains input name(s) Decoder_0_input with unsupported characters which will be renamed to decoder_0_input in the SavedModel.

project all dihedrals to the low-dimensional space…

[11]:

low_d_projection = e_map.encode(dihedrals)

and plot the result:

[12]:

%matplotlib inline

# rgba color from cluster_id. All 1.0s are grey RGBA=(.5, .5, .5, .1)
# the rest are colored with maptlotlib C0, C1, C2, ...
def colors_from_cluster_ids(cluster_ids, max_clusters=10):
    colors = np.full(shape=(len(cluster_ids), 4), fill_value=(.5, .5, .5, .1))
    for i in range(2, max_clusters + 2):
        where = np.where(cluster_ids == i)
        color = (*mpl.colors.to_rgb(f'C{i - 2}'), 0.3)
        colors[where] = color
    return colors

# define max clusters
max_clusters = 5

# create figure
fig, ax = plt.subplots()
scatter = ax.scatter(*low_d_projection.T, s=20, c=colors_from_cluster_ids(cluster_ids, max_clusters))

# fake a legend, because using scatter with RGBA values will not produce a legend
recs = []
for i in range(max_clusters):
    recs.append(mpl.patches.Rectangle((0, 0), 1, 1, fc=f"C{i}"))
ax.legend(recs, range(max_clusters), loc=4)

plt.show()

../_images/user_guide_and_examples_02_advanced_22_0.png

In the above map points from different clusters (different colors) should be well separated. However, if you didn’t change the parameters, they are probably not. Some of our parameter settings appear to be unsuitable. Let’s see how we can find out what goes wrong.

Visualize Learning with TensorBoard#

Running tensorboard on Google colab#

To use tensorboard in google colabs notebooks, you neet to first load the tensorboard extension

%load_ext tensorboard

And then activate it with:

%tensorboard --logdir .

The next code cell contains these commands. Uncomment them and then continue.

Running tensorboard locally#

TensorBoard is a visualization tool from the machine learning library TensorFlow which is used by the EncoderMap package. During the dimensionality reduction step, when the neural network autoencoder is trained, several readings are saved in a TensorBoard format. All output files are saved to the path defined in parameters.main_path. Navigate to this location in a shell and start TensorBoard. Change the paramter Tensorboard to True to make Encodermap log to Tensorboard.

In case you run this tutorial in the provided Docker container you can open a new console inside the container by typing the following command in a new system shell.

docker exec -it emap bash

Navigate to the location where all the runs are saved. e.g.:

cd notebooks_easy/runs/asp7/

Start TensorBoard in this directory with:

tensorboard --logdir .

You should now be able to open TensorBoard in your webbrowser on port 6006.

0.0.0.0:6006 or 127.0.0.1:6006

In the SCALARS tab of TensorBoard you should see among other values the overall cost and different contributions to the cost. The two most important contributions are auto_cost and distance_cost. auto_cost indicates differences between the inputs and outputs of the autoencoder. distance_cost is the part of the cost function which compares pairwise distances in the input space and the low-dimensional (latent) space.

Fixing Reloading issues Using Tensorboard we often encountered some issues while training multiple models and writing mutliple runs to Tensorboard’s logdir. Reloading the data and event refreshing the web page did not display the data of the current run. We needed to kill tensorboard and restart it in order to see the new data. This issue was fixed by setting reload_multifile True.

tensorboard --logdir . --reload_multifile True

In your case, probably the overall cost as well as the auto_cost and the distance_cost are still decreasing after all training iterations. This tells us that we can simply improve the result by increasing the number of training steps. The following cell contains the same code as above. Set a larger number of straining steps to improve the result (e.g. 3000).

For Goole colab, you can load the Tensorboard extension with:

[13]:

# %load_ext tensorboard
# %tensorboard --logdir .

[14]:

parameters = em.Parameters(
    main_path=em.misc.run_path("runs/asp7"),
    n_steps=100,
    dist_sig_parameters=(4.5, 12, 6, 1, 2, 6),
    periodicity=2*pi,
    l2_reg_constant=10,
    summary_step=1,
    tensorboard=True
)

e_map = em.EncoderMap(parameters, dihedrals)

# Logging images to Tensorboard can greatly reduce performance.
# So they need to be specifically turned on
e_map.add_images_to_tensorboard(dihedrals, 2,
                                scatter_kws={'s': 50, 'c': colors_from_cluster_ids(cluster_ids, 5)}
                               )

e_map.train()

Output files are saved to runs/asp7/run1 as defined in 'main_path' in the parameters.
You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.
Saved a text-summary of the model and an image in runs/asp7/run1, as specified in 'main_path' in the parameters.
Logging images with (10001, 12)-shaped data every 2 epochs to Tensorboard at runs/asp7/run1

100%|██████████| 100/100 [00:51<00:00,  1.95it/s, Loss after step 100=2.17e+3]
WARNING:absl:Function `_wrapped_model` contains input name(s) Encoder_0_input with unsupported characters which will be renamed to encoder_0_input in the SavedModel.
WARNING:absl:Function `_wrapped_model` contains input name(s) Decoder_0_input with unsupported characters which will be renamed to decoder_0_input in the SavedModel.

The molecule conformations form different clusters (different colors) should be separated a bit better now. In TensorBoard you should see the cost curves for this new run. When the cost curve becomes more or less flat towards the end, longer training does not make sense.

The resulting low-dimensional projection is probably still not very detailed and clusters are probably not well separated. Currently we use a regularization constant parameters.l2_reg_constant = 10.0. The regularization constant influences the complexity of the network and the map. A high regularization constant will result in a smooth map with little details. A small regularization constant will result in a rougher more detailed map.

Go back to the previous cell and decrease the regularization constant (e.g. parameters.l2_reg_constant = 0.001). Play with different settings to improve the separation of the clusters in the map. Have a look at TensorBoard to see how the cost changes for different parameters.

[15]:

%matplotlib inline
plt.close('all')
plt.scatter(*e_map.encode(dihedrals).T)

[15]:

<matplotlib.collections.PathCollection at 0x7f8024e06580>

../_images/user_guide_and_examples_02_advanced_29_1.png

Here is what you can see in Tensorboard:

9004876c8c304571903b843a31c72fd0

5431f91e3ef647edb58bb4ce68425c81

b5d8f0ad21144a7da8a5bc83dc62578c

0ad744020c1f407695c543bcbcef81c2

Save and Load#

Once you are satisfied with your EncoderMap, you might want to save the result. The good news is: Encoder map automatically saves checkpoints during the training process in parameters.main_path. The frequency of writing checkpoints can be defined with patameters.checkpoint_step. Also, your selected parameters are saved in a file called parameters.json. Navigate to the driectory of your last run and open this parameters.json file in some text editor. You should find all the parameters that we have set so far. You also find some parameters which were not set by us specifically and where EncoderMap used its default values.

Let’s start by looking at the parameters from the last run and printing them in a nicely formatted table with the .parameters attribute.

[16]:

loaded_parameters = em.Parameters.from_file('runs/asp7/run0/parameters.json')
print(loaded_parameters.parameters)

    Parameter            | Value                    | Description
    ---------------------+--------------------------+---------------------------------------------------
    n_neurons            | [128, 128, 2]            | List containing number of neurons for each layer
                         |                          | up to the bottleneck layer. For example [128, 128,
                         |                          | 2] stands for an autoencoder with the following
                         |                          | architecture {i, 128, 128, 2, 128, 128, i} where i
                         |                          | is the number of dimensions of the input data.
                         |                          | These are Input/Output Layers that are not
                         |                          | trained.
    ---------------------+--------------------------+---------------------------------------------------
    activation_functions | ['', 'tanh', 'tanh', ''] | List of activation function names as implemented
                         |                          | in TensorFlow. For example: "relu", "tanh",
                         |                          | "sigmoid" or "" to use no activation function. The
                         |                          | encoder part of the network takes the activation
                         |                          | functions from the list starting with the second
                         |                          | element. The decoder part of the network takes the
                         |                          | activation functions in reversed order starting
                         |                          | with the second element form the back. For example
                         |                          | ["", "relu", "tanh", ""] would result in a
                         |                          | autoencoder with {"relu", "tanh", "", "tanh",
                         |                          | "relu", ""} as sequence of activation functions.
    ---------------------+--------------------------+---------------------------------------------------
    periodicity          | 6.283185307179586        | Defines the distance between periodic walls for
                         |                          | the inputs. For example 2pi for angular values in
                         |                          | radians. All periodic data processed by EncoderMap
                         |                          | must be wrapped to one periodic window. E.g. data
                         |                          | with 2pi periodicity may contain values from -pi
                         |                          | to pi or from 0 to 2pi. Set the periodicity to
                         |                          | float("inf") for non-periodic inputs.
    ---------------------+--------------------------+---------------------------------------------------
    learning_rate        | 0.001                    | Learning rate used by the optimizer.
    ---------------------+--------------------------+---------------------------------------------------
    n_steps              | 100                      | Number of training steps.
    ---------------------+--------------------------+---------------------------------------------------
    batch_size           | 256                      | Number of training points used in each training
                         |                          | step
    ---------------------+--------------------------+---------------------------------------------------
    summary_step         | 1                        | A summary for TensorBoard is writen every
                         |                          | summary_step steps.
    ---------------------+--------------------------+---------------------------------------------------
    checkpoint_step      | 5000                     | A checkpoint is writen every checkpoint_step
                         |                          | steps.
    ---------------------+--------------------------+---------------------------------------------------
    dist_sig_parameters  | [4.5, 12, 6, 1, 2, 6]    | Parameters for the sigmoid functions applied to
                         |                          | the high- and low-dimensional distances in the
                         |                          | following order (sig_h, a_h, b_h, sig_l, a_l, b_l)
    ---------------------+--------------------------+---------------------------------------------------
    distance_cost_scale  | 500                      | Adjusts how much the distance based metric is
                         |                          | weighted in the cost function.
    ---------------------+--------------------------+---------------------------------------------------
    auto_cost_scale      | 1                        | Adjusts how much the autoencoding cost is weighted
                         |                          | in the cost function.
    ---------------------+--------------------------+---------------------------------------------------
    auto_cost_variant    | mean_abs                 | defines how the auto cost is calculated. Must be
                         |                          | one of: * `mean_square` * `mean_abs` * `mean_norm`
    ---------------------+--------------------------+---------------------------------------------------
    center_cost_scale    | 0.0001                   | Adjusts how much the centering cost is weighted in
                         |                          | the cost function.
    ---------------------+--------------------------+---------------------------------------------------
    l2_reg_constant      | 10.0                     | Adjusts how much the L2 regularisation is weighted
                         |                          | in the cost function.
    ---------------------+--------------------------+---------------------------------------------------
    gpu_memory_fraction  |                          | Specifies the fraction of gpu memory blocked. If
                         |                          | set to 0, memory is allocated as needed.
    ---------------------+--------------------------+---------------------------------------------------
    analysis_path        |                          | A path that can be used to store analysis
    ---------------------+--------------------------+---------------------------------------------------
    id                   |                          | Can be any name for the run. Might be useful for
                         |                          | example for specific analysis for different data
                         |                          | sets.
    ---------------------+--------------------------+---------------------------------------------------
    model_api            | sequential               | A string defining the API to be used to build the
                         |                          | keras model. Defaults to `sequntial`. Possible
                         |                          | strings are: * `functional` will use keras'
                         |                          | functional API. * `sequential` will define a keras
                         |                          | Model, containing two other models with the
                         |                          | Sequential API. These two models are encoder and
                         |                          | decoder. * `custom` will create a custom Model
                         |                          | where even the layers are custom.
    ---------------------+--------------------------+---------------------------------------------------
    loss                 | emap_cost                | A string defining the loss function. Defaults to
                         |                          | `emap_cost`. Possible losses are: *
                         |                          | `reconstruction_loss` will try to train output ==
                         |                          | input * `mse`: Returns a mean squared error loss.
                         |                          | * `emap_cost` is the EncoderMap loss function.
                         |                          | Depending on the class `Autoencoder`, `Encodermap,
                         |                          | `ACDAutoencoder`, different contributions are used
                         |                          | for a combined loss. Autoencoder uses atuo_cost,
                         |                          | reg_cost, center_cost. EncoderMap class adds
                         |                          | sigmoid_loss.
    ---------------------+--------------------------+---------------------------------------------------
    training             | auto                     | A string defining what kind of training is
                         |                          | performed when autoencoder.train() is callsed. *
                         |                          | `auto` does a regular model.compile() and
                         |                          | model.fit() procedure. * `custom` uses gradient
                         |                          | tape and calculates losses and gradients manually.
    ---------------------+--------------------------+---------------------------------------------------
    batched              | True                     | Whether the dataset is batched or not.
    ---------------------+--------------------------+---------------------------------------------------
    tensorboard          |                          | Whether to print tensorboard information. Defaults
                         |                          | to False.
    ---------------------+--------------------------+---------------------------------------------------
    seed                 |                          | Fixes the state of all operations using random

Before we can reload our trained network we need to save it manually, because the checkpoint step was set to 5000 and we did only write a checkpoint at 0 (random initial weights). We call e_map.save() to do so.

[17]:

e_map.save()

WARNING:absl:Function `_wrapped_model` contains input name(s) Encoder_0_input with unsupported characters which will be renamed to encoder_0_input in the SavedModel.
WARNING:absl:Function `_wrapped_model` contains input name(s) Decoder_0_input with unsupported characters which will be renamed to decoder_0_input in the SavedModel.

Saved current state of model.
Use em.EncoderMap.from_checkpoint('runs/asp7/run1/saved_model_2023-02-07T11:08:03+00:00.model*') to reload the current state of the two submodels.

And now we reload it.

[18]:

# get the most recent file
import os
from pathlib import Path
latest_checkpoint_file = str(list(sorted(Path("runs/asp7").rglob("*model*"), key=os.path.getmtime, reverse=True))[0]).rstrip("_decoder").rstrip("_encoder")
print(latest_checkpoint_file)
loaded_e_map = em.EncoderMap.from_checkpoint(latest_checkpoint_file, overwrite_tensorboard_bool=True)

runs/asp7/run1/saved_model_2023-02-07T11:08:03+00:00.model
rebuilding Model with input_dim = 12 and periodicity = 6.283185307179586

Now we are finished with loading and we can for example use the loaded EncoderMap object to project data to the low_dimensional space and plot the result:

[19]:

low_d_projection = e_map.encode(dihedrals)

# Plotting:
%matplotlib inline
fig, ax = plt.subplots()
ax.plot(low_d_projection[:, 0], low_d_projection[:, 1], linestyle="", marker=".",
         markersize=5, color="0.7", alpha=0.1)
for i in range(9):
    mask = cluster_ids == i + 1
    ax.plot(low_d_projection[:, 0][mask], low_d_projection[:, 1][mask], label=str(i),
             linestyle="", marker=".", markersize=5, alpha=0.3)

legend = ax.legend()
for lh in legend.legendHandles:
    if hasattr(lh, "legmarker"):
        lh.legmarker.set_alpha(1)

../_images/user_guide_and_examples_02_advanced_42_0.png

Generate Molecular Conformations#

Already in the cube example, you have seen that with EncoderMap it is not only possible to project points to the low-dimensional space. Also, a projection of low-dimensional points into the high-dimensional space is possible.

Here, we will use a tool form the EncoderMap library to interactively select a path in the low-dimensional map. We will project points along this path into the high-dimensional dihedral space, and use these dihedrals to reconstruct molecular conformations. This can be very useful to explore the landscape an to see what changes in the molecular conformation going from one cluster to another.

The next cell instantiates a class with which you can interact with the low-dimensional projection of the Autoencoder. You can select clusters with the Polygon, Ellipse,, Rectangle and Lasso tools. The clusters will be selected fron the input conformations in asp7.xtc.

More interesting is the Bezier and Path tool. With these you can generate molecular conformations from a path in the latent space.

Click Set Points and then Write/Generate to make your own clusters/paths. You can have a look at what you selected using the sess.view attribute of the InteractivePlotting class. For this you need to have nglview installed.

Give the InteractivePlotting a try. We would like to hear your feedback at GitHub.

Note: This notebook uses a non-interactive matploltib backend to render for web pages. Switch to an interactive backend by using either:

%matploltib qt5

%matploltib notebook

[20]:

#%matploltib notebook
sess = em.InteractivePlotting(e_map, "asp7.xtc", data=low_d_projection,
                              top='asp7.pdb', scatter_kws={'s': 2})

../_images/user_guide_and_examples_02_advanced_44_0.png

[21]:

sess.view

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[21], line 1
----> 1 sess.view

AttributeError: 'InteractivePlotting' object has no attribute 'view'

As backbone dihedrals contain no information about the side-chains, only the backbone of the molecule can be reconstructed. In case the generated conformations change very abruptly it might be sensible to increase the regularization constant to obtain a smoother representation. If the generated conformations along a path are not changing at all, the regularization is probably to strong and prevents the network form generating different conformations.

Conclusion#

In this tutorial we applied EncoderMap to a molecular system. You have learned how to monitor the EncoderMap training procedure with TensorBoard, how to restore previously saved EncoderMaps and how to generate Molecular conformations using the path selection tool.