Getting started: Basic Cube#

Welcome

Welcome to your first EncoderMap tutorial. All EncoderMap tutorials are provided as jupyter notebooks, that you can run locally, on binderhub, or even on google colab.

Run this notebook on Google Colab:

Open in Colab

Find the documentation of EncoderMap:

https://ag-peter.github.io/encodermap

Goals:

In this tutorial you will learn:

For Google colab only:

If you’re on Google colab, please uncomment these lines and install EncoderMap.

[1]:
# !wget https://raw.githubusercontent.com/AG-Peter/encodermap/main/tutorials/install_encodermap_google_colab.sh
# !sudo bash install_encodermap_google_colab.sh

Import Libraries#

Before we can get started using EncoderMap we first need to import the EncoderMap library:

[2]:
import encodermap as em
/home/kevin/git/encoder_map_private/encodermap/__init__.py:194: GPUsAreDisabledWarning: EncoderMap disables the GPU per default because most tensorflow code runs with a higher compatibility when the GPU is disabled. If you want to enable GPUs manually, set the environment variable 'ENCODERMAP_ENABLE_GPU' to 'True' before importing EncoderMap. To do this in python you can run:

import os; os.environ['ENCODERMAP_ENABLE_GPU'] = 'True'

before importing encodermap.
  _warnings.warn(

We will also need some aditional imports for plotting. The line with google.colab imports some nice features for google colab, which renders pandas Dataframes very nicely.

[3]:
import plotly
import plotly.graph_objs as go
import plotly.express as px
import plotly.io as pio
import pandas as pd
import numpy as np
try:
    from google.colab import data_table, output
    data_table.enable_dataframe_formatter()
    output.enable_custom_widget_manager()
    renderer = "colab"
except ModuleNotFoundError:
    renderer = "plotly_mimetype+notebook"
pio.renderers.default = renderer

To ensure that this notebook yields reproducible output, we fix the randomness in tensorflow.

[4]:
import tensorflow as tf
tf.random.set_seed(3)

Load Data#

Next, we need to load our data. EncoderMap expects the input data to be a 2d array. Each line should contain one data point and the number of columns is the dimensionality of the data set. Here, you could load data from any source. In this tutorial, however, we will use a function to generate a toy data set. The function random_on_cube_edges distributes a given number of points randomly on the edges of a cube. We can also add some Gaussian noise by specifying a sigma value.

[5]:
high_d_data, ids = em.misc.create_n_cube()

let’s look at the data we have just created:

[6]:
df = pd.DataFrame(
    np.vstack([ids, high_d_data.T]).T,
    columns=["id", "x", "y", "z"],
    index=[f"Point {i}" for i in range(len(high_d_data))]
).astype({"id": "int"})
df
[6]:
id x y z
Point 0 1 -0.005684 -0.003473 0.050832
Point 1 1 -0.023902 -0.001094 -0.069661
Point 2 1 0.011999 0.012342 0.018002
Point 3 1 0.000477 -0.066432 0.054068
Point 4 1 -0.025671 -0.013453 0.045888
... ... ... ... ...
Point 5995 11 0.972860 1.003848 0.985437
Point 5996 11 1.044156 0.974762 0.987097
Point 5997 11 0.918292 1.016864 1.010103
Point 5998 11 1.040325 0.969910 1.035233
Point 5999 11 1.006821 1.024599 1.044481

6000 rows × 4 columns

We can now plot the data like so:

[7]:
fig = px.scatter_3d(df, x='x', y='y', z='z', color='id', color_continuous_scale  = plotly.colors.sequential.Viridis)
fig.show()

As you can see, we have a fuzzy cube. The edges of the cube are described by some points in 3d space. The colors of the points correspond to the id column of our dataframe. Note, how some colors appear on two edges. Try to keep track of these special edges throughout this notebook.

Select Parameters#

Now that we have loaded our data we need to select parameters for EncoderMap. Parameters are stored in an instance of the Parameters class. A list of the available parameters can be found here. Most of the default parameters are fine for our example. Some parameters will need adjustment. These are:

  • periodicity

    • This parameter defines the periodicity of the space of your input data. This is important if your data consists of angles, in which case the default periodicity of pi is good. In our case, the data lies in a non-periodic euclidean space and we set the periodicity to float("inf").

  • n_steps

    • This is the number of training steps. For our small example 200 steps is enough.

[8]:
parameters = em.Parameters(
periodicity = float("inf"),
n_steps = 200,
)
Furthermore, we should adjust the sigmoid functions applied to the high-dimensional and low-dimensional pairwise distances of the distance based part of the cost function. There a three parameters for each sigmoid which should be given in the following order:
(sig_h, a_h, b_h, sig_l, a_l, b_l)
In order to select these parameters it is helpful to plot the sigmoid functions together with a histogram of the pairwise distances in the data set. In the next cell, you can experiment with these parameters. If you don’t feel like playing around, the initial_guess parameter is a good guess for this system.
[9]:
em.plot.distance_histogram_interactive(
    high_d_data,
    parameters.periodicity,
    bins=50,
    initial_guess=(0.3, 6, 6, 1, 4, 6),
)
[9]:

The upper plot shows a histogram of the pairwise distances. The pairwise distances between \(n\) points in a space of any dimension can be represented as matrix \(D\):

\begin{equation} D = \begin{bmatrix} d_{11} & d_{12} & \dots & d_{1n} \\ d_{21} & d_{22} & \dots & d_{2n} \\ \vdots & & \ddots & \vdots \\ d_{n1} & d_{n2} & \dots & d_{nn} \end{bmatrix} \end{equation}

, where \(d_{ij}\), the distance between point \(r_i\) and \(r_j\) can be given as:

\begin{equation} d_{ij} = \lVert r_j - r_j \rVert \end{equation}

or:

\begin{equation} d_{ij} = \begin{cases} \lVert r_j - r_j \rVert, \text{if $d_{ij} <= p$} \\ \lVert r_j - r_j \rVert - p, \text{if $d_{ij} > p$} \end{cases} \end{equation}

for periodic systems obeying the minimum image convention in a system with space with a periodicity \(p\). In the same plot, the high-d sigmoid function and its derivative is shown. This derivative shows the sensitive range of the distance based part of the cost function. As it is not possible to preserve all pairwise distances in the low-d representation we want to tune this sensitive range to match distances which are most important for us. Usually very short distances are not important for the structure of a data set as these distances stem from points inside the same local region. Long distances might be interesting but can hardly be reproduced in a lower dimensional representation. Somewhere in between are the most important distances which contain the information how local regions in the data are connected to neighboring regions.

The lower plot shows the low-d sigmoid function. The black lines connecting the plots of the high-d sigmoid and the low-d sigmoid indicate to which low-dimensional distances high-dimensional distences will ideally be mapped with your choice of sigmoid parameters.

The sigmoid parameters for the low-d space can be selected according to the following rules:
sig_l = 1 (is irrelevant as it only scales the low-dimensional map)
a_l = a_h * n_dimensions_l / n_dimensions_h
b_l= b_h
Further information about the the selection of these sigmoid parameters can be found in the Sketchmap literature.

Feel free to play with different sigmoid parameters and see how the sigmoid function changes in the previous cell. I recommend to continue the tutorial with (0.3, 6, 6, 1, 4, 6) for a start but you can come back later and changes these parameters.

In the next cell, you can set the sigmoid parameters and save then in the parameters instance.

[10]:
# @title Setting the parameters { run: "auto", vertical-output: true }

sig_h = 0.3 # @param {type:"number"}
a_h = 6 # @param {type:"number"}
b_h = 6 # @param {type:"number"}
sig_l = 1 # @param {type:"number"}
a_l = 4 # @param {type:"number"}
b_l = 6 # @param {type:"number"}
parameters.dist_sig_parameters = (sig_h, a_h, b_h, sig_l, a_l, b_l)

Get more info about parameters#

To get more information from your parameters use the .parameters attribute.

[11]:
print(parameters.parameters)
    Parameter                 | Value                    | Description
    --------------------------+--------------------------+---------------------------------------------------
    n_neurons                 | [128, 128, 2]            | List containing number of neurons for each layer
                              |                          | up to the bottleneck layer. For example [128, 128,
                              |                          | 2] stands for an autoencoder with the following
                              |                          | architecture {i, 128, 128, 2, 128, 128, i} where i
                              |                          | is the number of dimensions of the input data.
                              |                          | These are Input/Output Layers that are not
                              |                          | trained.
    --------------------------+--------------------------+---------------------------------------------------
    activation_functions      | ['', 'tanh', 'tanh', ''] | List of activation function names as implemented
                              |                          | in TensorFlow. For example: "relu", "tanh",
                              |                          | "sigmoid" or "" to use no activation function. The
                              |                          | encoder part of the network takes the activation
                              |                          | functions from the list starting with the second
                              |                          | element. The decoder part of the network takes the
                              |                          | activation functions in reversed order starting
                              |                          | with the second element form the back. For example
                              |                          | ["", "relu", "tanh", ""] would result in a
                              |                          | autoencoder with {"relu", "tanh", "", "tanh",
                              |                          | "relu", ""} as sequence of activation functions.
    --------------------------+--------------------------+---------------------------------------------------
    periodicity               | inf                      | Defines the distance between periodic walls for
                              |                          | the inputs. For example 2pi for angular values in
                              |                          | radians. All periodic data processed by EncoderMap
                              |                          | must be wrapped to one periodic window. E.g. data
                              |                          | with 2pi periodicity may contain values from -pi
                              |                          | to pi or from 0 to 2pi. Set the periodicity to
                              |                          | float("inf") for non-periodic inputs.
    --------------------------+--------------------------+---------------------------------------------------
    learning_rate             | 0.001                    | Learning rate used by the optimizer.
    --------------------------+--------------------------+---------------------------------------------------
    n_steps                   | 200                      | Number of training steps.
    --------------------------+--------------------------+---------------------------------------------------
    batch_size                | 256                      | Number of training points used in each training
                              |                          | step
    --------------------------+--------------------------+---------------------------------------------------
    summary_step              | 10                       | A summary for TensorBoard is writen every
                              |                          | summary_step steps.
    --------------------------+--------------------------+---------------------------------------------------
    checkpoint_step           | 5000                     | A checkpoint is writen every checkpoint_step
                              |                          | steps.
    --------------------------+--------------------------+---------------------------------------------------
    dist_sig_parameters       | (0.3, 6, 6, 1, 4, 6)     | Parameters for the sigmoid functions applied to
                              |                          | the high- and low-dimensional distances in the
                              |                          | following order (sig_h, a_h, b_h, sig_l, a_l, b_l)
    --------------------------+--------------------------+---------------------------------------------------
    distance_cost_scale       | 500                      | Adjusts how much the distance based metric is
                              |                          | weighted in the cost function.
    --------------------------+--------------------------+---------------------------------------------------
    auto_cost_scale           | 1                        | Adjusts how much the autoencoding cost is weighted
                              |                          | in the cost function.
    --------------------------+--------------------------+---------------------------------------------------
    auto_cost_variant         | mean_abs                 | defines how the auto cost is calculated. Must be
                              |                          | one of: * `mean_square` * `mean_abs` * `mean_norm`
    --------------------------+--------------------------+---------------------------------------------------
    center_cost_scale         | 0.0001                   | Adjusts how much the centering cost is weighted in
                              |                          | the cost function.
    --------------------------+--------------------------+---------------------------------------------------
    l2_reg_constant           | 0.001                    | Adjusts how much the L2 regularisation is weighted
                              |                          | in the cost function.
    --------------------------+--------------------------+---------------------------------------------------
    gpu_memory_fraction       |                          | Specifies the fraction of gpu memory blocked. If
                              |                          | set to 0, memory is allocated as needed.
    --------------------------+--------------------------+---------------------------------------------------
    analysis_path             |                          | A path that can be used to store analysis
    --------------------------+--------------------------+---------------------------------------------------
    id                        |                          | Can be any name for the run. Might be useful for
                              |                          | example for specific analysis for different data
                              |                          | sets.
    --------------------------+--------------------------+---------------------------------------------------
    model_api                 | sequential               | A string defining the API to be used to build the
                              |                          | keras model. Defaults to `sequntial`. Possible
                              |                          | strings are: * `functional` will use keras'
                              |                          | functional API. * `sequential` will define a keras
                              |                          | Model, containing two other models with the
                              |                          | Sequential API. These two models are encoder and
                              |                          | decoder. * `custom` will create a custom Model
                              |                          | where even the layers are custom.
    --------------------------+--------------------------+---------------------------------------------------
    loss                      | emap_cost                | A string defining the loss function. Defaults to
                              |                          | `emap_cost`. Possible losses are: *
                              |                          | `reconstruction_loss` will try to train output ==
                              |                          | input * `mse`: Returns a mean squared error loss.
                              |                          | * `emap_cost` is the EncoderMap loss function.
                              |                          | Depending on the class `Autoencoder`, `Encodermap,
                              |                          | `ADCAutoencoder`, different contributions are used
                              |                          | for a combined loss. Autoencoder uses atuo_cost,
                              |                          | reg_cost, center_cost. EncoderMap class adds
                              |                          | sigmoid_loss.
    --------------------------+--------------------------+---------------------------------------------------
    training                  | auto                     | A string defining what kind of training is
                              |                          | performed when autoencoder.train() is callsed. *
                              |                          | `auto` does a regular model.compile() and
                              |                          | model.fit() procedure. * `custom` uses gradient
                              |                          | tape and calculates losses and gradients manually.
    --------------------------+--------------------------+---------------------------------------------------
    batched                   | True                     | Whether the dataset is batched or not.
    --------------------------+--------------------------+---------------------------------------------------
    tensorboard               |                          | Whether to print tensorboard information. Defaults
                              |                          | to False.
    --------------------------+--------------------------+---------------------------------------------------
    seed                      |                          | Fixes the state of all operations using random
                              |                          | numbers. Defaults to None.
    --------------------------+--------------------------+---------------------------------------------------
    current_training_step     |                          | The current training step. Aids in reloading of
                              |                          | models.
    --------------------------+--------------------------+---------------------------------------------------
    write_summary             |                          | If True writes a summar.txt of the models into
                              |                          | main_path if `tensorboard` is True, summaries will
                              |                          | also be written.
    --------------------------+--------------------------+---------------------------------------------------
    trainable_dense_to_sparse |                          | When using different topologies to train the
                              |                          | AngleDihedralCartesianEncoderMap, some inputs
                              |                          | might be sparse, which means, they have missing
                              |                          | values. Creating a dense input is done by first
                              |                          | passing these sparse tensors through
                              |                          | `tf.keras.layers.Dense` layers. These layers have
                              |                          | trainable weights, and if this parameter is True,
                              |                          | these weights will be changed by the optimizer.
    --------------------------+--------------------------+---------------------------------------------------
    using_hypercube           |                          | This parameter is not meant to be set by the user.

Perform Dimensionality Reduction#

Now that we have set up the parameters and loaded the data, it is very simple to performe the dimensionality reduction. All we need to do is to create an EncoderMap object and call its train method. The EncoderMap object takes care of setting up the neural network autoencoder and once you call the train method this network is trained to minimize the cost function as specified in the parameters.

[12]:
e_map = em.EncoderMap(parameters, high_d_data)
Output files are saved to /home/kevin/git/encoder_map_private/docs/source/notebooks/starter_nb as defined in 'main_path' in the parameters.
[13]:
history = e_map.train()
100%|███████████████████████████| 200/200 [00:01<00:00, 112.14it/s, Loss after step 200=6.93]
Saving the model to /home/kevin/git/encoder_map_private/docs/source/notebooks/starter_nb/saved_model_2024-12-30T10:52:26+01:00.keras. Use `em.EncoderMap.from_checkpoint('/home/kevin/git/encoder_map_private/docs/source/notebooks/starter_nb')` to load the most recent model, or `em.EncoderMap.from_checkpoint('/home/kevin/git/encoder_map_private/docs/source/notebooks/starter_nb/saved_model_2024-12-30T10:52:26+01:00.keras')` to load the model with specific weights..
This model has a subclassed encoder, which can be loaded independently. Use `tf.keras.load_model('/home/kevin/git/encoder_map_private/docs/source/notebooks/starter_nb/saved_model_2024-12-30T10:52:26+01:00_encoder.keras')` to load only this model.
This model has a subclassed decoder, which can be loaded independently. Use `tf.keras.load_model('/home/kevin/git/encoder_map_private/docs/source/notebooks/starter_nb/saved_model_2024-12-30T10:52:26+01:00_decoder.keras')` to load only this model.

Once the network is trained we can feed high dimensional data into the encoder part of the network and read the values from the bottleneck layer. That is how we project data to the low dimensional space. The following line projects all our high-dimensional data to the low-dimensional space:

[14]:
low_d_projection = e_map.encode(high_d_data)

Let’s have a look at the result and plot the data:

[15]:
fig = px.scatter(x=low_d_projection[:, 0], y=low_d_projection[:, 1], color=df["id"].values, color_continuous_scale  = plotly.colors.sequential.Viridis)
fig.show()

Generate High-Dimensional Data#

We can not only use the encoder part of the network to project points the to the low-dimensional space. Also, the inverse procedure is possible using the decoder part of the Network. This allows to project any point from the low-dimensional space to the high dimensional space.
In the following we feed all low-dimension points into the decoder part of the network to generate high dimensional points:
[16]:
generated = e_map.generate(low_d_projection)

Let’s have a look at these generated point:

[17]:
fig = px.scatter_3d(x=generated[:, 0], y=generated[:, 1], z=generated[:, 2], color=df["id"].values, color_continuous_scale  = plotly.colors.sequential.Viridis)
fig.show()

You probable see again a cube like structure. The reconstruction, however, will not be perfect, as information is lost when the data is projected to a lower dimensional space.

Conclusion#

In this tutorial you have learned:

  • How to set parameters of EncoderMap

  • Instantiate an EncoderMap class with these parameters.

  • Run the dimensionality reduction

  • project points from the high-dimensional space to the low dimensional space and vice versa.