Enhancing TimeGAN with Language Model Architectures: Autoregressive Transformers and Positional Encoding

12 min readAug 27, 2024

The generation of synthetic yet realistic time-series data is a critical task in various domains, including finance, healthcare, and environmental monitoring. Time-series Generative Adversarial Networks (TimeGAN) have emerged as a powerful tool for this purpose, combining the strengths of Generative Adversarial Networks (GANs) and Recurrent Neural Networks (RNNs) to model sequential dependencies. However, the inherent limitations of RNNs, such as difficulties in capturing long-range dependencies, can impact the quality and coherence of generated time series, especially those with complex temporal patterns. In this article, we explore the integration of transformer architectures and positional encoding into TimeGAN, drawing inspiration from recent advancements in natural language processing (NLP). We will delve into the architectural modifications, their benefits,and provide code snippets to illustrate the implementation.

The code implementation presented in this article is inspired by and builds upon the foundational work of the TimeGAN model as described in the paper “Time-series Generative Adversarial Networks” by Yoon et al. (NeurIPS 2019).

The core concepts and structure of the TimeGAN model are retained, while the code incorporates modifications and enhancements, particularly the integration of transformer architectures and positional encoding, to improve the model’s capabilities.

The original TimeGAN implementation can be found in the GitHub repository: https://github.com/flaviagiammarino/time-gan-tensorflow/tree/main.

The code presented here serves as an example of how to build upon this foundation and explore the potential of transformer-based architectures for time-series generation.

The Foundation: Understanding the Original TimeGAN

The original TimeGAN architecture comprises several key components that work in concert to generate synthetic time-series data:

Embedding and Recovery Functions: The embedding function, often an RNN, maps the input time series into a latent space, capturing its temporal dynamics in a lower-dimensional representation. The recovery function, typically a feedforward network, decodes these latent representations back to the original feature space, ensuring the preservation of essential information.
Sequence Generator and Discriminator: The generator, also an RNN, takes random noise as input and generates synthetic time series in the latent space. The discriminator, often a bidirectional RNN, distinguishes between real and synthetic time series in both the input and latent spaces, providing feedback to the generator.
Supervised Loss for Temporal Coherence: TimeGAN introduces a supervised loss that compares the generator’s output in the latent space to the actual next-step latent vector from the real data. This encourages the generator to learn the step-wise conditional distributions, enhancing the temporal coherence of the generated sequences.

The Transformer Advantage: Addressing RNN Limitations

While RNNs are capable of modeling sequential data, their limitations can hinder the generation of high-quality, long-term coherent time series. Transformers, with their self-attention mechanism, offer a compelling solution:

Capturing Long-Range Dependencies: Self-attention allows transformers to weigh the importance of any position in the input sequence when generating the output, enabling them to capture dependencies across long time horizons. This is crucial for modeling complex temporal patterns that may span multiple time steps.
Parallel Processing: Transformers process input sequences in parallel, leading to faster training and inference compared to the sequential nature of RNNs.
Handling Variable-Length Sequences: Transformers can naturally handle sequences of varying lengths without requiring architectural modifications, providing flexibility when working with time-series data of different durations.

Enhancing TimeGAN: Architectural Modifications

We propose two key enhancements to the TimeGAN architecture:

1. Autoregressive Transformer in the Generator

We replace the RNN-based generator with an autoregressive transformer. This modification leverages the transformer’s ability to capture long-range dependencies and parallelize computations, leading to improved temporal coherence and efficiency.

def transformer_block(hidden_dim, num_heads, ff_dim, dropout_rate):
    inputs = tf.keras.layers.Input(shape=(None, hidden_dim))
    attention_output = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=hidden_dim)(inputs, inputs)
    attention_output = tf.keras.layers.Dropout(dropout_rate)(attention_output)
    attention_output = tf.keras.layers.LayerNormalization(epsilon=1e-6)(inputs + attention_output)
    
    ff_output = tf.keras.layers.Dense(ff_dim, activation="relu")(attention_output)
    ff_output = tf.keras.layers.Dense(hidden_dim)(ff_output)
    ff_output = tf.keras.layers.Dropout(dropout_rate)(ff_output)
    outputs = tf.keras.layers.LayerNormalization(epsilon=1e-6)(attention_output + ff_output)
    
    return tf.keras.models.Model(inputs, outputs)

def generator(timesteps, hidden_dim, num_layers):
    '''
    Generator, takes as input the synthetic embeddings and returns the synthetic latent vector.
    '''
    e = tf.keras.layers.Input(shape=(timesteps, hidden_dim))
    h = e
    for _ in range(num_layers):
        h = transformer_block(hidden_dim=hidden_dim, num_heads=4, ff_dim=hidden_dim * 4, dropout_rate=0.1)(h)
    h = tf.keras.layers.Dense(units=hidden_dim)(h)
    return tf.keras.models.Model(e, h, name='generator')

In this code snippet, the generator function constructs the transformer-based generator. It takes synthetic embeddings as input (e), processes them through multiple transformer blocks (transformer_block), and finally produces the synthetic latent vector (h).

2. Positional Encoding in the Embedder

We incorporate positional encoding into the embedder to provide explicit temporal information to the transformer-based generator.

This is crucial because transformers are inherently permutation-invariant and do not consider the order of elements in the input sequence.

class PositionalEncoding(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super(PositionalEncoding, self).__init__(**kwargs)

    def get_angles(self, position, i, hidden_dim):
        angles = 1 / tf.pow(10000., (2 * (i // 2)) / tf.cast(hidden_dim, tf.float32))
        return position * angles

    def get_positional_encoding(self, timesteps, hidden_dim):
        angle_rads = self.get_angles(
            position=tf.range(timesteps, dtype=tf.float32)[:, tf.newaxis],
            i=tf.range(hidden_dim, dtype=tf.float32)[tf.newaxis, :],
            hidden_dim=hidden_dim
        )
        # apply sin to even index in the array
        sines = tf.sin(angle_rads[:, 0::2])
        # apply cos to odd index in the array
        cosines = tf.cos(angle_rads[:, 1::2])

        pos_encoding = tf.concat([sines, cosines], axis=-1)
        pos_encoding = pos_encoding[tf.newaxis, ...]
        return tf.cast(pos_encoding, tf.float32)

    def call(self, inputs):
        timesteps = tf.shape(inputs)[1]
        hidden_dim = tf.shape(inputs)[2]
        pos_encoding = self.get_positional_encoding(timesteps, hidden_dim)
        return inputs + pos_encoding

    def compute_output_shape(self, input_shape):
        return input_shape

def encoder_embedder(timesteps, features, hidden_dim, num_layers):
    '''
    Encoder embedder, takes as input the actual sequences and returns the actual embeddings.
    '''
    x = tf.keras.layers.Input(shape=(timesteps, features))
    e = tf.keras.layers.Dense(hidden_dim)(x)
    e = PositionalEncoding()(e)  # Apply positional encoding
    for _ in range(num_layers):
        e = transformer_block(hidden_dim=hidden_dim, num_heads=4, ff_dim=hidden_dim * 4, dropout_rate=0.1)(e)
    return tf.keras.models.Model(x, e, name='encoder_embedder')

In this code, the encoder_embedder function applies positional encoding (PositionalEncoding()) to the input embeddings before feeding them into the transformer blocks. The PositionalEncoding class itself calculates sinusoidal positional encodings based on the position and dimension of each element in the sequence.

The Code: Bringing the Enhanced TimeGAN to Life

The provided code serves as the backbone for implementing and training the enhanced TimeGAN model. Let’s break down its key components and the training procedure.


def generator_embedder(timesteps, features, hidden_dim, num_layers):
    z = tf.keras.layers.Input(shape=(timesteps, features))
    x = PositionalEncoding()(z)  # Add positional encoding before feeding into the Transformer
    for _ in range(num_layers):
        e = tf.keras.layers.MultiHeadAttention(num_heads=8, key_dim=hidden_dim)(x if _ == 0 else e, x if _ == 0 else e)
        e = tf.keras.layers.LayerNormalization()(e)
        e = tf.keras.layers.Dense(units=hidden_dim, activation='relu')(e)
    return tf.keras.models.Model(z, e, name='generator_embedder')

def encoder(timesteps, hidden_dim, num_layers):
    '''
    Encoder, takes as input the actual embeddings and returns the actual latent vector.
    '''
    e = tf.keras.layers.Input(shape=(timesteps, hidden_dim))
    h = e
    for _ in range(num_layers):
        h = transformer_block(hidden_dim=hidden_dim, num_heads=4, ff_dim=hidden_dim * 4, dropout_rate=0.1)(h)
    h = tf.keras.layers.Dense(units=hidden_dim)(h)
    return tf.keras.models.Model(e, h, name='encoder')

def generator(timesteps, hidden_dim, num_layers):
    '''
    Generator, takes as input the synthetic embeddings and returns the synthetic latent vector.
    '''
    e = tf.keras.layers.Input(shape=(timesteps, hidden_dim))
    h = e
    for _ in range(num_layers):
        h = transformer_block(hidden_dim=hidden_dim, num_heads=4, ff_dim=hidden_dim * 4, dropout_rate=0.1)(h)
    h = tf.keras.layers.Dense(units=hidden_dim)(h)
    return tf.keras.models.Model(e, h, name='generator')

def decoder(timesteps, features, hidden_dim, num_layers):
    '''
    Decoder, takes as input the actual or synthetic latent vector and returns the reconstructed or synthetic sequences.
    '''
    h = tf.keras.layers.Input(shape=(timesteps, hidden_dim))
    for _ in range(num_layers):
        y = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(units=hidden_dim, activation='relu'))(h if _ == 0 else y)
    y = tf.keras.layers.Dense(units=features)(y)
    return tf.keras.models.Model(h, y, name='decoder')

def discriminator(timesteps, hidden_dim, num_layers):
    '''
    Discriminator, takes as input the actual or synthetic embedding or latent vector and returns the log-odds.
    '''
    h = tf.keras.layers.Input(shape=(timesteps, hidden_dim))
    for _ in range(num_layers):
        p = transformer_block(hidden_dim=hidden_dim, num_heads=4, ff_dim=hidden_dim * 4, dropout_rate=0.1)(h if _ == 0 else p)
    p = tf.keras.layers.Dense(units=1)(p)
    return tf.keras.models.Model(h, p, name='discriminator')

def simulator(samples, timesteps, features):
    '''
    Simulator, generates synthetic sequences from a Wiener process.
    '''
    z = tf.random.normal(mean=0, stddev=1, shape=(samples * timesteps, features), dtype=tf.float32)
    z = tf.cumsum(z, axis=0) / tf.sqrt(tf.cast(samples * timesteps, dtype=tf.float32))
    z = (z - tf.reduce_mean(z, axis=0)) / tf.math.reduce_std(z, axis=0)
    z = tf.reshape(z, (samples, timesteps, features))
    return z


@tf.function
def mean_squared_error(y_true, y_pred):
    '''
    Mean squared error, used for calculating the supervised loss and the reconstruction loss.
    '''
    loss = tf.keras.losses.MSE(y_true=tf.expand_dims(y_true, axis=-1), y_pred=tf.expand_dims(y_pred, axis=-1))
    return tf.reduce_mean(tf.reduce_sum(loss, axis=-1))


@tf.function
def binary_crossentropy(y_true, y_pred):
    '''
    Binary cross-entropy, used for calculating the unsupervised loss.
    '''
    loss = tf.keras.losses.binary_crossentropy(y_true=y_true, y_pred=y_pred, from_logits=True)
    return tf.reduce_mean(loss)

def time_series_to_sequences(time_series, timesteps):
    '''
    Reshape the time series as sequences.
    '''
    sequences = np.array([time_series[t - timesteps: t] for t in range(timesteps, len(time_series) + timesteps, timesteps)])
    return sequences


def sequences_to_time_series(sequences):
    '''
    Reshape the sequences as time series.
    '''
    time_series = np.concatenate([sequence for sequence in sequences], axis=0)
    return time_series

Core Functions and Classes

PositionalEncoding: This class implements the positional encoding mechanism, crucial for providing temporal information to the transformer. It calculates sinusoidal positional encodings based on the position and dimension of each element in the sequence.
transformer_block: This function defines the fundamental building block of the transformer architecture. It consists of a multi-head self-attention layer, feedforward networks, layer normalization, and dropout for regularization.
encoder_embedder and generator_embedder: These functions create the embedders for the encoder and generator, respectively. They take the input sequences and apply dense layers and positional encoding to generate embeddings
encoder and generator: These functions define the encoder and generator networks, respectively. They utilize transformer blocks to process the embeddings and produce latent representations.
decoder: This function constructs the decoder network, which takes latent representations as input and generates the output sequences (either reconstructed or synthetic).
discriminator: This function builds the discriminator network, which evaluates the authenticity of sequences in both the input and latent spaces.
simulator: This function generates synthetic sequences from a Wiener process, serving as input to the generator during training.
Loss Functions: The code defines mean squared error (mean_squared_error) for the supervised and reconstruction losses, and binary cross-entropy (binary_crossentropy) for the unsupervised loss.
time_series_to_sequences and sequences_to_time_series: These helper functions reshape time series data into sequences and vice-versa, facilitating batch processing during training.

The `TimeGAN` Class

import numpy as np
import tensorflow as tf

class TimeGAN():
    def __init__(self,
                 x,
                 timesteps,
                 hidden_dim,
                 num_layers,
                 lambda_param,
                 eta_param,
                 learning_rate,
                 batch_size):
        '''
        Implementation of synthetic time series generation model introduced in Yoon, J., Jarrett, D. and Van der Schaar, M., 2019.
        Time-series generative adversarial networks. Advances in neural information processing systems, 32.
        '''
        
        # extract the length of the time series
        samples = x.shape[0]

        # extract the number of time series
        features = x.shape[2]
        print('features shape',features)

        # scale the time series
        mu = np.mean(x, axis=0)
        sigma = np.std(x, axis=0)
        x = (x - mu) / sigma

        # reshape the time series as sequences
        #x = time_series_to_sequences(time_series=x, timesteps=timesteps)
        print('shape', x.shape)
        # create the dataset
        dataset = tf.data.Dataset.from_tensor_slices(x)
        dataset = dataset.cache().shuffle(samples).batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
        
        # build the models
        autoencoder_model = tf.keras.models.Sequential([
            encoder_embedder(timesteps=timesteps, features=features, hidden_dim=hidden_dim, num_layers=1),
            encoder(timesteps=timesteps, hidden_dim=hidden_dim, num_layers=num_layers - 1),
            decoder(timesteps=timesteps, features=features, hidden_dim=hidden_dim, num_layers=num_layers)
        ])
    
        generator_model = tf.keras.models.Sequential([
            generator_embedder(timesteps=timesteps, features=features, hidden_dim=hidden_dim, num_layers=1),
            generator(timesteps=timesteps, hidden_dim=hidden_dim, num_layers=num_layers - 1),
        ])
        
        discriminator_model = discriminator(timesteps=timesteps, hidden_dim=hidden_dim, num_layers=num_layers)
        
        # instantiate the optimizers
        autoencoder_optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
        generator_optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
        discriminator_optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
        
        # save the objects
        self.mu = mu
        self.sigma = sigma
        self.samples = samples
        self.timesteps = timesteps
        self.features = features
        self.lambda_param = lambda_param
        self.eta_param = eta_param
        self.dataset = dataset
        self.autoencoder_model = autoencoder_model
        self.generator_model = generator_model
        self.discriminator_model = discriminator_model
        self.autoencoder_optimizer = autoencoder_optimizer
        self.generator_optimizer = generator_optimizer
        self.discriminator_optimizer = discriminator_optimizer
    
    def fit(self, epochs, verbose=True):
        '''
        Train the model.
        '''
        
        # define the training loop
        @tf.function
        def train_step(data):
            with tf.GradientTape() as autoencoder_tape, tf.GradientTape() as generator_tape, tf.GradientTape() as discriminator_tape:
                print('SHAAAPE',data.shape)
                # get the actual sequences
                x = tf.cast(data, dtype=tf.float32)
                
                # generate the synthetic sequences
                z = simulator(samples=x.shape[0], timesteps=self.timesteps, features=self.features)

                # get the encoder outputs
                ex = self.autoencoder_model.get_layer('encoder_embedder')(x)     # actual embedding
                hx = self.autoencoder_model.get_layer('encoder')(ex)             # actual latent vector

                # get the generator outputs
                ez = self.generator_model.get_layer('generator_embedder')(z)     # synthetic embedding
                hz = self.generator_model.get_layer('generator')(ez)             # synthetic latent vector
                hx_hat = self.generator_model.get_layer('generator')(ex)         # conditional synthetic latent vector (i.e. given the actual embedding)
                
                # get the decoder outputs
                x_hat = self.autoencoder_model.get_layer('decoder')(hx)          # reconstructed sequences

                # get the discriminator outputs
                p_ex = self.discriminator_model(ex)                              # log-odds of actual embedding
                p_ez = self.discriminator_model(ez)                              # log-odds of synthetic embedding
                p_hx = self.discriminator_model(hx)                              # log-odds of actual latent vector
                p_hz = self.discriminator_model(hz)                              # log-odds of synthetic latent vector

                # calculate the supervised loss
                supervised_loss = mean_squared_error(hx[:, 1:, :], hx_hat[:, :-1, :])
                
                # calculate the autoencoder loss
                autoencoder_loss = mean_squared_error(x, x_hat) + \
                                   self.lambda_param * supervised_loss
                                   
                # calculate the generator loss
                generator_loss = binary_crossentropy(tf.ones_like(p_hz), p_hz) + \
                                 binary_crossentropy(tf.ones_like(p_ez), p_ez) + \
                                 self.eta_param * supervised_loss

                # calculate the discriminator loss
                discriminator_loss = binary_crossentropy(tf.zeros_like(p_hz), p_hz) + \
                                     binary_crossentropy(tf.zeros_like(p_ez), p_ez) + \
                                     binary_crossentropy(tf.ones_like(p_hx), p_hx) + \
                                     binary_crossentropy(tf.ones_like(p_ex), p_ex)
            
            # calculate the gradients
            autoencoder_gradient = autoencoder_tape.gradient(autoencoder_loss, self.autoencoder_model.trainable_variables)
            generator_gradient = generator_tape.gradient(generator_loss, self.generator_model.trainable_variables)
            discriminator_gradient = discriminator_tape.gradient(discriminator_loss, self.discriminator_model.trainable_variables)
            
            # update the weights
            self.autoencoder_optimizer.apply_gradients(zip(autoencoder_gradient, self.autoencoder_model.trainable_variables))
            self.generator_optimizer.apply_gradients(zip(generator_gradient, self.generator_model.trainable_variables))
            self.discriminator_optimizer.apply_gradients(zip(discriminator_gradient, self.discriminator_model.trainable_variables))
            
            return autoencoder_loss, generator_loss, discriminator_loss

        # train the model
        for epoch in range(epochs):
            for data in self.dataset:
                autoencoder_loss, generator_loss, discriminator_loss = train_step(data)
            if verbose:
                print(
                    f'epoch: {1 + epoch} '
                    f'autoencoder_loss: {format(autoencoder_loss.numpy(), ".6f")} '
                    f'generator_loss: {format(generator_loss.numpy(), ".6f")} '
                    f'discriminator_loss: {format(discriminator_loss.numpy(), ".6f")}'
                )

    def reconstruct(self, x):
        '''
        Reconstruct the time series.
        '''
        
        # scale the time series
        x = (x - self.mu) / self.sigma

        # reshape the time series as sequences
        #x = time_series_to_sequences(time_series=x, timesteps=self.timesteps)

        # get the reconstructed sequences
        x_hat = self.autoencoder_model(x)
        
        # transform the reconstructed sequences back to time series
        #x_hat = sequences_to_time_series(x_hat.numpy())
   
        # transform the reconstructed time series back to the original scale
        x_hat = self.mu + self.sigma * x_hat
        
        return x_hat
    
    def simulate(self, samples):
        '''
        Simulate the time series.
        '''
        
        # generate the synthetic sequences
        z = simulator(samples=samples // self.timesteps, timesteps=self.timesteps, features=self.features)
        
        # get the simulated sequences
        x_sim = self.autoencoder_model.get_layer('decoder')(self.generator_model(z))
        print('sim x shape', x_sim.shape)
        # transform the simulated sequences back to time series
        #x_sim = sequences_to_time_series(x_sim.numpy())
    
        # transform the simulated time series back to the original scale
        x_sim = self.mu + self.sigma * x_sim
    
        return x_sim

The heart of the implementation lies in the TimeGAN class. Let's dissect its key methods:

__init__: The constructor initializes the model's components, including the autoencoder, generator, and discriminator. It also sets up optimizers and hyperparameters.
fit: This method orchestrates the training process. It iterates over the dataset, performing the following steps in each training loop:

Forward Pass: The autoencoder processes real data, and the generator processes random noise to produce latent representations. The discriminator evaluates both real and synthetic sequences.
Loss Calculation: The supervised loss, autoencoder loss, generator loss, and discriminator loss are computed based on the model outputs and the ground truth.
Backpropagation and Optimization: Gradients are calculated for each loss, and the optimizers update the model parameters accordingly.

reconstruct: This method takes real time-series data as input and uses the trained autoencoder to reconstruct it.
simulate: This method generates synthetic time-series data using the trained generator.

Example results after training for a few epochs on a synthetic dataset:

#Generate synthetic time series data
N = 20      # number of features
L = 1000    # length of each time series
t = np.linspace(0, 1, L).reshape(-1, 1)
c = np.cos(2 * np.pi * (50 * t - 0.5))
s = np.sin(2 * np.pi * (100 * t - 0.5))
x = 5 + 10 * c + 10 * s + 5 * np.random.normal(size=(L, N))

# Split the data into training and testing sets
train_ratio = 0.8
L_train = int(L * train_ratio)
L_test = L - L_train

x_train, x_test = x[:L_train], x[L_train:]

# Define the number of timesteps per sequence
timesteps = 20

# Reshape the data into (samples, timesteps, features)
x_train = x_train[:L_train - L_train % timesteps].reshape(-1, timesteps, N)
x_test = x_test[:L_test - L_test % timesteps].reshape(-1, timesteps, N)

Final thoughts

The integration of transformer architectures and positional encoding into the TimeGAN framework presents a promising avenue for enhancing the generation of synthetic time-series data. By addressing the limitations of RNNs in capturing long-range dependencies and enabling parallel processing, this enhanced architecture has the potential to significantly improve the quality, coherence, and efficiency of time-series generation. The ability of transformers to model complex temporal patterns and handle variable-length sequences makes them a natural fit for the dynamic nature of time-series data.

The next logical step would be to rigorously evaluate the performance of this enhanced TimeGAN model against various benchmarks and alternative architectures. This would involve:

Quantitative Evaluation: Conducting experiments on diverse time-series datasets and comparing the performance of the enhanced TimeGAN against the original TimeGAN, as well as other state-of-the-art time-series generation models. The evaluation metrics could include measures of similarity between real and synthetic data, predictive accuracy on downstream tasks, and computational efficiency.
Qualitative Assessment: Visualizing and analyzing the generated time series to assess their realism, diversity, and ability to capture complex temporal patterns. This could involve techniques like t-SNE plots, anomaly detection, and expert evaluation.
Ablation Studies: Conducting ablation studies to understand the individual contributions of the transformer-based generator and positional encoding to the overall performance improvement. This would help identify the key factors driving the enhanced capabilities of the model.
Exploration of Hyperparameters and Architectures: Investigating the impact of different hyperparameters (e.g., number of transformer layers, attention heads, feedforward network dimensions) and architectural choices (e.g., different positional encoding schemes, alternative transformer variants) on the model’s performance.

By systematically evaluating and refining this enhanced TimeGAN architecture, we can gain deeper insights into its capabilities and limitations, paving the way for its broader adoption and application in various time-series generation tasks. The fusion of transformer architectures with the TimeGAN framework represents a significant step towards more powerful and versatile time-series generation models, opening doors to new possibilities in data augmentation, anomaly detection, and privacy-preserving data analysis.