Intermediate EncoderMap: Different Topologies#

Welcome

Welcome to the intermediate part of the EncoderMap tutorials. The notebooks in this section contain more in-depth explanations of the concepts in EncoderMap.

What are different topologies#

With the word ‘topology’ we mean the connectivity of a protein. Think of it as a graph. The nodes are the atoms and the connections are the bonds. Two topologies are identical if their graphs are identical. Two proteins are different if you exchange or remove a single atom.

For some proteins, we observe and describe so-called protein families. In such families, proteins with similar (but not identical) sequences perform very similar tasks in an organism. As their sequences are different proteins in such families exhibit different topologies. Using standard features (distances, angles, dihedrals) that depend on the sequence can’t be used classically to describe and characterize such protein families. A pre-selection of features has to be performed on these topologically different proteins to achieve feature spaces with the same number of dimensions for all topologies.

EncoderMap III introduces a way to forego the pre-selection by allowing the input to the ``AngleDihedralCartesianEncoderMap`` to be sparse.

Authors: Kevin Sawade (kevin.sawade@uni-konstanz.de), Tobias Lemke, Christine Peter (christine.peter@uni-konstanz.de)

Find the documentation of EncoderMap:

https://ag-peter.github.io/encodermap

Goals

In this tutorial, you will learn:

How to

[6]:

# !pip install "git+https://github.com/AG-Peter/encodermap.git@main"
# !pip install -r pip install -r https://raw.githubusercontent.com/AG-Peter/encodermap/main/tests/test_requirements.md

Imports#

[33]:

import encodermap as em
import numpy as np
import networkx as nx
import plotly.graph_objects as go
import plotly.express as px
import warnings

%load_ext autoreload
%autoreload 2
%config Completer.use_jedi = False

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Fix tensorflow seed for reproducibility

[8]:

import tensorflow as tf
tf.random.set_seed(1)

Load the trajectories#

We use EncoderMap’s TrajEnsemble class to load the trajectories and do the feature alignment.

[20]:

traj_files = ["glu7.xtc", "asp7.xtc"]
top_files = ["glu7.pdb", "asp7.pdb"]

trajs = em.load(traj_files, top_files, common_str=["glu7", "asp7"])

There’s an issue with the proteins in these trajectories. They are missing a connection between the C-terminal hydroxy oxygen and hydrogen. The GLU7-HO atom is extra.

[21]:

em.plot.plot_ball_and_stick(trajs[0], highlight="bonds")

[22]:

G = trajs[0].top.to_bondgraph()

# Generate positions for the nodes
pos = nx.spring_layout(G)

# Create a Plotly figure
fig = go.Figure()

# Add edges to the figure
for u, v, data in G.edges(data=True):
    x0, y0 = pos[u]
    x1, y1 = pos[v]
    fig.add_trace(go.Scatter(x=[x0, x1], y=[y0, y1], mode='lines', line=dict(width=5, color='gray')))

# Add nodes to the figure
for node in G.nodes():
    x, y = pos[node]
    fig.add_trace(go.Scatter(x=[x], y=[y], mode='markers', marker=dict(size=10), hovertemplate="%{customdata}", customdata=[str(node)]))

# Show the figure
fig.update_layout({"width": 800, "height": 800, "showlegend": False})
fig.show()

This can be fixed with EncoderMap, by defining custom amino acids:

[23]:

custom_aas = {
    "ASP": (
        "A",
        {
            "optional_bonds": [
                ("N", "H1"),
                ("N", "H2"),
                ("N", "H"),
                ("N", "CA"),
                ("CA", "CB"),
                ("CB", "CG"),
                ("CG", "OD1"),
                ("CG", "OD2"),
                ("OD2", "HD2"),
                ("CA", "C"),
                ("C", "O"),
                ("C", "OT"),
                ("O", "HO"),
                ("C", "+N"),
            ],
        },
    ),
    "GLU": (
        "E",
        {
            "optional_bonds": [
                ("N", "H1"),
                ("N", "H2"),
                ("N", "H"),
                ("N", "CA"),
                ("CA", "CB"),
                ("CB", "CG"),
                ("CG", "CD"),
                ("CD", "OE1"),
                ("CD", "OE2"),
                ("OE2", "HE2"),
                ("CA", "C"),
                ("C", "O"),
                ("C", "OT"),
                ("O", "HO"),
                ("C", "+N"),
            ],
        },
    ),
}


trajs.load_custom_topology(custom_aas)

[24]:

G = trajs[0].top.to_bondgraph()

# Generate positions for the nodes
pos = nx.spring_layout(G)

# Create a Plotly figure
fig = go.Figure()

# Add edges to the figure
for u, v, data in G.edges(data=True):
    x0, y0 = pos[u]
    x1, y1 = pos[v]
    fig.add_trace(go.Scatter(x=[x0, x1], y=[y0, y1], mode='lines', line=dict(width=5, color='gray')))

# Add nodes to the figure
for node in G.nodes():
    x, y = pos[node]
    fig.add_trace(go.Scatter(x=[x], y=[y], mode='markers', marker=dict(size=10), hovertemplate="%{customdata}", customdata=[str(node)]))

# Show the figure
fig.update_layout({"width": 800, "height": 800, "showlegend": False})
fig.show()

[25]:

em.plot.plot_ball_and_stick(trajs[0], highlight="atoms")

Load the CVs with the ensemble=True options. This option will tell EncoderMap to assume the trajectories to be part of the same ensemble. To highlight the key differences let use first look at a copy of the ASP7 trajectory and run a full featurization on that.

[47]:

asp7_copy = trajs.trajs_by_common_str["asp7"][0].copy()
asp7_copy.load_CV("all")

/home/kevin/git/encoder_map_private/encodermap/trajinfo/trajinfo_utils.py:1299: UserWarning:

Your custom topology for residue name=ASP resSeq=None index=None does not define atoms for the dihedral PSI. If this dihedral consists of standard atom names, it will be considered for dihedral calculations. If this dihedral should not be present in your custom topology you need to explicitly delete it by adding 'PSI_ATOMS': 'delete' to your custom_topology. If you want this dihedral to be present in your topology, you can ignore this warning.

/home/kevin/git/encoder_map_private/encodermap/trajinfo/trajinfo_utils.py:1299: UserWarning:

Your custom topology for residue name=GLU resSeq=None index=None does not define atoms for the dihedral PSI. If this dihedral consists of standard atom names, it will be considered for dihedral calculations. If this dihedral should not be present in your custom topology you need to explicitly delete it by adding 'PSI_ATOMS': 'delete' to your custom_topology. If you want this dihedral to be present in your topology, you can ignore this warning.

/home/kevin/git/encoder_map_private/encodermap/trajinfo/trajinfo_utils.py:1299: UserWarning:

Your custom topology for residue name=ASP resSeq=None index=None does not define atoms for the dihedral OMEGA. If this dihedral consists of standard atom names, it will be considered for dihedral calculations. If this dihedral should not be present in your custom topology you need to explicitly delete it by adding 'OMEGA_ATOMS': 'delete' to your custom_topology. If you want this dihedral to be present in your topology, you can ignore this warning.

/home/kevin/git/encoder_map_private/encodermap/trajinfo/trajinfo_utils.py:1299: UserWarning:

Your custom topology for residue name=GLU resSeq=None index=None does not define atoms for the dihedral OMEGA. If this dihedral consists of standard atom names, it will be considered for dihedral calculations. If this dihedral should not be present in your custom topology you need to explicitly delete it by adding 'OMEGA_ATOMS': 'delete' to your custom_topology. If you want this dihedral to be present in your topology, you can ignore this warning.

/home/kevin/git/encoder_map_private/encodermap/trajinfo/trajinfo_utils.py:1299: UserWarning:

Your custom topology for residue name=ASP resSeq=None index=None does not define atoms for the dihedral PHI. If this dihedral consists of standard atom names, it will be considered for dihedral calculations. If this dihedral should not be present in your custom topology you need to explicitly delete it by adding 'PHI_ATOMS': 'delete' to your custom_topology. If you want this dihedral to be present in your topology, you can ignore this warning.

/home/kevin/git/encoder_map_private/encodermap/trajinfo/trajinfo_utils.py:1299: UserWarning:

Your custom topology for residue name=GLU resSeq=None index=None does not define atoms for the dihedral PHI. If this dihedral consists of standard atom names, it will be considered for dihedral calculations. If this dihedral should not be present in your custom topology you need to explicitly delete it by adding 'PHI_ATOMS': 'delete' to your custom_topology. If you want this dihedral to be present in your topology, you can ignore this warning.

Notice how the labels of the sidechain diherdal features contain the name of the residue (ASP).

[31]:

asp7_copy._CVs

Now if we run a full featurization with ensemble=True.

Hint: Here we’re using the with warnings.catch_warnings() to catch the same warnings informing us about the backbone angles.

[48]:

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    trajs.load_CVs("all", ensemble=True)

Notice, how the features are now labelled with more generic labels. These generic labels allow EncoderMap to combine feature spaces from different topologies.

[49]:

trajs._CVs

Let us take a look at the SIDECHDIH CHI3 1 label, specifically. Notice, how for glu7 the angle is defined, whereas asp7 exhibits np.nan at these positions.

[53]:

trajs._CVs["side_dihedrals"].sel({"SIDE_DIHEDRALS": "SIDECHDIH CHI3   1", "traj_num": 0})

[54]:

trajs._CVs["side_dihedrals"].sel({"SIDE_DIHEDRALS": "SIDECHDIH CHI3   1", "traj_num": 1})

Create the AngleDihedralCartesianEncoderMap#

The AngleDihedralCartesianEncoderMap tries to learn all of the geometric features of a protein. The angles (backbone angles, backbone dihedrals, sidechain dihedrals) are passed through a neuronal network autoencoder, while the distances between the backbone atoms are used to create cartesian coordinates from the learned angles. The generated cartesians and the input (true) cartesians are used to construct pairwise C\(_\alpha\) distances, which are then also weighted using sketchmap’s sigmoid function. The cartesian_cost_scale_soft_start gradually increases the contribution of this cost function to the overall model loss.

[55]:

p = em.ADCParameters(use_backbone_angles=True,
                     distance_cost_scale=1,
                     auto_cost_scale=0.1,
                     cartesian_cost_scale_soft_start=(50, 80),
                     n_neurons = [500, 250, 125, 2],
                     activation_functions = ['', 'tanh', 'tanh', 'tanh', ''],
                     use_sidechains=True,
                     summary_step=1,
                     tensorboard=True,
                     periodicity=2*np.pi,
                     n_steps=100,
                     checkpoint_step=1000,
                     dist_sig_parameters = (4.5, 12, 6, 1, 2, 6),
                     main_path=em.misc.run_path('runs/asp7_glu7_asp8'),
                     model_api='functional',
                    )
emap = em.AngleDihedralCartesianEncoderMap(trajs, p)

Output files are saved to runs/asp7_glu7_asp8/run2 as defined in 'main_path' in the parameters.
input shapes are:
{'central_cartesians': (20002, 21, 3), 'central_distances': (20002, 20), 'side_dihedrals': TensorShape([20002, 21]), 'central_dihedrals': (20002, 18), 'central_angles': (20002, 19)}
Saved a text-summary of the model and an image in runs/asp7_glu7_asp8/run2, as specified in 'main_path' in the parameters.

train

[56]:

history = emap.train()

Calculating references: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 78/78 [00:00<00:00, 235.50it/s]

After 77 steps setting cost references: {'dihedral_cost': 1.7226728, 'angle_cost': 0.29572558, 'cartesian_cost': 0.19267249} to parameters.

100%|███████████████████████████████████████████████████████████████████████| 100/100 [00:15<00:00,  6.62it/s, Loss after step 100=3.02, Cartesian cost scale=1]

Saving the model to runs/asp7_glu7_asp8/run2/saved_model_2025-06-19T20:37:01+02:00.keras. Use `em.AngleDihedralCartesianEncoderMap.from_checkpoint('runs/asp7_glu7_asp8/run2')` to load the most recent model, or `em.AngleDihedralCartesianEncoderMap.from_checkpoint('runs/asp7_glu7_asp8/run2/saved_model_2025-06-19T20:37:01+02:00.keras')` to load the model with specific weights..
This model has a subclassed encoder, which can be loaded independently. Use `tf.keras.load_model('runs/asp7_glu7_asp8/run2/saved_model_2025-06-19T20:37:01+02:00_encoder.keras')` to load only this model.
This model has a subclassed decoder, which can be loaded independently. Use `tf.keras.load_model('runs/asp7_glu7_asp8/run2/saved_model_2025-06-19T20:37:01+02:00_decoder.keras')` to load only this model.

Plot the result#

In the result (longer training would be beneficial here), the projection area of asp7 and glu7 are separated.

[57]:

ids = (trajs.name_arr == "asp7").astype(int)
glu7_lowd = emap.encode()[ids == 0]
asp7_lowd = emap.encode()[ids == 1]

fig = go.Figure(
    data=[
        go.Scatter(x=asp7_lowd[:, 0], y=asp7_lowd[:, 1], name="Asp7", mode="markers"),
        go.Scatter(x=glu7_lowd[:, 0], y=glu7_lowd[:, 1], name="Glu7", mode="markers"),
    ],
)
fig.update_layout({"height": 800, "width": 800})
fig.show()

Due to the size of the provided data 190019, I need to chunk it, which takes longer. Sit back, grab a coffee...
Due to the size of the provided data 190019, I need to chunk it, which takes longer. Sit back, grab a coffee...
Due to the size of the provided data 190019, I need to chunk it, which takes longer. Sit back, grab a coffee...
Due to the size of the provided data 190019, I need to chunk it, which takes longer. Sit back, grab a coffee...

Create a new trajectory#

Using the InteractivePlotting class, we can easily generate new molecular conformations by using the decoder part of the neural network. If you’re running an interactive notebook, you can use the notebook or qt5 backend and play around with the InteractivePlotting.

[96]:

sess = em.InteractivePlotting(emap, ball_and_stick=False)

For static notebooks, we load the points along the path and generate new molecular conformations from them.

[71]:

from encodermap.plot.plotting import _plot_free_energy

path = np.load("path.npy")

fig = go.Figure(
    data=[
        _plot_free_energy(*trajs.lowd.T, bins=50),
        go.Scatter(
            x=path[:, 0],
            y=path[:, 1],
            mode="lines",
            name="",
            line_width=5,
            customdata=[i for i in range(len(path))],
            hovertemplate="%{customdata}",
            showlegend=False,
        ),
    ]
)
layout = {
    "width": 800,
    "height": 800,
    "template": "none",
    "xaxis": {
        "showgrid": False,
        "showline": False,
        "zeroline": False,
        "showticklabels": False,
        "ticks": "",
    },
    "yaxis": {
        "showgrid": False,
        "showline": False,
        "zeroline": False,
        "showticklabels": False,
        "ticks": "",
    },
    "plot_bgcolor": "rgba(0, 0, 0, 0)",
    "paper_bgcolor": 'rgba(0, 0, 0, 0)',
}
fig.update_layout(**layout)
fig.show()

And generate molecular conformations from the path. For the emap.generate() method, we can choose which topology to use.

[82]:

generated_asp7 = emap.generate(path, top="asp7", backend="mdtraj")
generated_glu7 = emap.generate(path, top="glu7", backend="mdtraj")