EventStream.transformer.transformer module

The internal transformer module code.

TODO(mmcdermott): Can use transformers.apply_chunking_to_forward to save memory.

Based on https://raw.githubusercontent.com/huggingface/transformers/e3cc4487fe66e03ec85970ea2db8e5fb34c455f4/src/transformers/models/gpt_neo/modeling_gpt_neo.py

class EventStream.transformer.transformer.ConditionallyIndependentPointProcessInputLayer(config: StructuredTransformerConfig)[source]

Bases: Module

Processes input batch and produces event embeddings.

This layer accepts a batch from an event-stream PyTorch dataset and returns input embeddings from it. This is designed for conditionally independent models, as it does not split the input embeddings into different components corresponding to different dependency graph positions. Combines time and data embeddings.

Parameters:
config: StructuredTransformerConfig

Configuration parameters for the structured transformer.

forward(batch: PytorchBatch) Tensor[source]

Returns input event embeddings for the provided batch.

Parameters:
batch: PytorchBatch

A PytorchBatch instance containing input data.

class EventStream.transformer.transformer.ConditionallyIndependentPointProcessTransformer(config: StructuredTransformerConfig)[source]

Bases: StructuredTransformerPreTrainedModel

A transformer model specifically for conditionally independent point processes.

This model uses an input layer to generate embeddings from an event-stream PyTorch dataset, and an InnerBlock layer for non-structured processing. As a conditionally independent model, all event covariates are predicted simultaneously from the history embedding.

Parameters:
config: StructuredTransformerConfig

Configuration parameters for the structured transformer.

Raises:

ValueError – If the provided configuration indicates a nested attention model.

forward(batch: PytorchBatch | None = None, input_embeds: Tensor | None = None, past: tuple[FloatTensor] | None = None, seq_attention_mask: Tensor | None = None, head_mask: Tensor | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None) tuple[Tensor] | TransformerOutputWithPast[source]

Performs a forward pass on the transformer model.

Parameters:
batch: PytorchBatch | None = None

A PytorchBatch instance containing input data.

input_embeds: Tensor | None = None

Precomputed embeddings for the input data. Currently unused.

past: tuple[FloatTensor] | None = None

Past hidden states for more efficient decoding.

seq_attention_mask: Tensor | None = None

Mask for the sequential attention mechanism.

head_mask: Tensor | None = None

Mask to nullify selected heads of the self-attention module.

use_cache: bool | None = None

Specifies whether caching should be used.

output_attentions: bool | None = None

Specifies whether attention probabilities should be returned in the output.

output_hidden_states: bool | None = None

Specifies whether hidden states should be returned in the output.

return_dict: bool | None = None

Specifies whether the output should be an object with key names (True) or a tuple.

Returns:

A tuple containing hidden states, or a TransformerOutputWithPast object if return_dict is True.

class EventStream.transformer.transformer.InnerAttention(config: StructuredTransformerConfig, layer_id: int = 0, is_seq: bool = True)[source]

Bases: Module

The inner attention module used by the GPTs in this codebase.

This module largely just selects what kind of attention computation should be used in this layer, and offloads computation therein.

Parameters:
config: StructuredTransformerConfig

The model configuration object.

layer_id: int = 0

Which layer is this attention computation in (by integer index)?

is_seq: bool = True

Is this a sequence or dependency-graph attention layer?

Raises:

ValueError – If an invalid attention type is provided.

forward(hidden_states, attention_mask=None, layer_past=None, head_mask=None, use_cache=False, output_attentions=False, static_kv_first: bool = False)[source]

Forward pass.

This returns the pre-selected attention calculation over the inputs (run through a layer norm).

Parameters:
hidden_states

The input hidden states.

attention_mask=None

A mask to be applied on the attention weights.

layer_past=None

The past layer states.

head_mask=None

A mask to be applied on the attention heads.

use_cache=False

A flag indicating whether to cache the layer’s past states.

output_attentions=False

A flag indicating whether to output the attention weights.

static_kv_first: bool = False

In the case of attention over the dependency graph, the history embedding is dropped after processing, so we want to only use it as a KV, not as a query.

class EventStream.transformer.transformer.InnerBlock(config: StructuredTransformerConfig, layer_id: int, is_seq: bool)[source]

Bases: Module

An inner block in a transformer architecture that consists of attention and MLP layers.

Parameters:
config: StructuredTransformerConfig

Configuration parameters for the structured transformer.

layer_id: int

Unique identifier for the layer.

is_seq: bool

Flag indicating whether the block is sequential.

forward(hidden_states, attention_mask=None, layer_past=None, head_mask=None, use_cache=False, output_attentions=False, static_kv_first: bool = False) tuple[Tensor, dict[str, Tensor]][source]

Conducts the forward pass for the inner block.

Parameters:
hidden_states

Input tensor.

attention_mask=None

Mask to avoid attending to padded token positions.

layer_past=None

Cache of past hidden states for more efficient decoding.

head_mask=None

Mask to nullify selected heads of the self-attention module.

use_cache=False

Whether to use caching.

output_attentions=False

Whether to return attention probabilities in the output.

static_kv_first: bool = False

Whether the static key-value pair comes first.

Returns:

Modified hidden states and a dictionary containing present key-value pair and attention weights (if output_attentions=True).

Return type:

tuple

class EventStream.transformer.transformer.InnerMLP(config: StructuredTransformerConfig)[source]

Bases: Module

Applies a multilayer perceptron (MLP) to the hidden_states.

Parameters:
config: StructuredTransformerConfig

Configuration parameters for the structured transformer.

forward(hidden_states)[source]

Conducts forward pass for the MLP.

Parameters:
hidden_states

Input tensor.

Returns:

Modified hidden states after applying MLP.

class EventStream.transformer.transformer.InnerSelfAttention(config: StructuredTransformerConfig, attention_type: str, window_size: int)[source]

Bases: Module

This class implements the inner self-attention mechanism.

This involves performing the self-attention operation and returning the result along with some optional additional outputs. The constructor of this class accepts three arguments, which determine the configuration of the self-attention mechanism.

Parameters:
config: StructuredTransformerConfig

An instance of StructuredTransformerConfig which contains various configuration parameters.

attention_type: str

A string indicating the type of attention to be applied. Currently, only “local” is implemented.

window_size: int

An integer specifying the size of the attention window.

Raises:

ValueError – If the product of num_heads and head_dim from the config does not match embed_dim.

forward(hidden_states, attention_mask=None, layer_past=None, head_mask=None, use_cache=False, output_attentions=False, static_kv_first: bool = False)[source]

Applies the attention mechanism to the input hidden states.

Parameters:
hidden_states

The input hidden states.

attention_mask=None

A mask to be applied on the attention weights.

layer_past=None

The past layer states.

head_mask=None

A mask to be applied on the attention heads.

use_cache=False

A flag indicating whether to cache the layer’s past states.

output_attentions=False

A flag indicating whether to output the attention weights.

static_kv_first: bool = False

In the case of attention over the dependency graph, the history embedding is dropped after processing, so we want to only use it as a KV, not as a query.

Returns:

A tuple containing the output of the attention mechanism and a dictionary of optional outputs.

class EventStream.transformer.transformer.LearnableFrequencySinusoidalTemporalPositionEncoding(embedding_dim: int, max_timepoint: float = 10000.0)[source]

Bases: Module

A module for applying time-based position encodings to a PytorchBatch.

Adapted from (link).

Parameters:
embedding_dim: int

The desired size of the output embedding. Unlike many position embedding implementations, this does not need to be even.

forward(batch: PytorchBatch) Tensor[source]

Forward pass.

Parameters:
batch: PytorchBatch

The input batch to process.

Returns:

The temporal position embeddings tensor of shape (bsz, seq_len)

class EventStream.transformer.transformer.NestedAttentionPointProcessInputLayer(config: StructuredTransformerConfig)[source]

Bases: Module

Processes input batch and produces input dependency graph element embeddings.

This layer accepts a batch from an event-stream PyTorch dataset and returns input embeddings from it. This is designed for nested attention models, as it splits the input embeddings into different components corresponding to different dependency graph positions. Combines time and data embeddings.

Parameters:
config: StructuredTransformerConfig

Configuration parameters for the structured transformer.

forward(batch: PytorchBatch, dep_graph_el_generation_target: int | None = None) Tensor[source]

Returns input dependency graph element embeddings for the provided batch.

Parameters:
batch: PytorchBatch

A PytorchBatch instance containing input data.

class EventStream.transformer.transformer.NestedAttentionPointProcessTransformer(config: StructuredTransformerConfig)[source]

Bases: StructuredTransformerPreTrainedModel

A transformer model specifically for nested attention point processes.

This model uses an input layer to generate embeddings from an event-stream PyTorch dataset, and an InnerBlock layer for non-structured processing. As a nested attention model, event covariates are predicted in the sequence of the dependency graph elements, specified in the config’s measurements_per_dep_graph_level parameter, depending on both the historical event embeddings and the prior dependency graph elements.

Parameters:
config: StructuredTransformerConfig

Configuration parameters for the structured transformer.

Raises:

ValueError – If the provided configuration indicates a conditionally independent model.

forward(batch: PytorchBatch | None = None, input_embeds: Tensor | None = None, past: tuple[FloatTensor] | None = None, seq_attention_mask: Tensor | None = None, head_mask: Tensor | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, dep_graph_past: tuple[FloatTensor] | None = None, dep_graph_el_generation_target: int | None = None) tuple[Tensor] | TransformerOutputWithPast[source]

Performs a forward pass on the transformer model.

Parameters:
batch: PytorchBatch | None = None

A PytorchBatch instance containing input data.

input_embeds: Tensor | None = None

Precomputed embeddings for the input data. Currently unused.

past: tuple[FloatTensor] | None = None

Past hidden states for more efficient decoding.

seq_attention_mask: Tensor | None = None

Mask for the sequential attention mechanism.

head_mask: Tensor | None = None

Mask to nullify selected heads of the self-attention module.

use_cache: bool | None = None

Specifies whether caching should be used.

output_attentions: bool | None = None

Specifies whether attention probabilities should be returned in the output.

output_hidden_states: bool | None = None

Specifies whether hidden states should be returned in the output.

return_dict: bool | None = None

Specifies whether the output should be an object with key names (True) or a tuple.

Returns:

A tuple containing hidden states, or a TransformerOutputWithPast object if return_dict is True.

class EventStream.transformer.transformer.StructuredTransformerBlock(config: StructuredTransformerConfig, layer_id: int)[source]

Bases: Module

A block for structured attention with both sequential and dependency graph modules.

Parameters:
config: StructuredTransformerConfig

Configuration parameters for the structured transformer.

layer_id: int

Unique identifier (depth index) for the layer.

forward(*args, **kwargs) tuple[Tensor, dict[str, dict[str, Tensor | None] | None]][source]

Conducts the forward pass for the structured transformer block.

Parameters:
*args

Variable length argument list.

**kwargs

Arbitrary keyword arguments.

Returns:

Modified input tensor and a dictionary containing present key-value pair and attention weights.

Return type:

tuple

class EventStream.transformer.transformer.StructuredTransformerPreTrainedModel(*inputs, **kwargs)[source]

Bases: PreTrainedModel

The base pre-trained model class for Transformer models.

base_model_prefix = 'transformer'
config_class

alias of StructuredTransformerConfig

supports_gradient_checkpointing = True
class EventStream.transformer.transformer.TemporalPositionEncoding(embedding_dim: int, max_timepoint: float = 10000.0)[source]

Bases: Module

A module for applying time-based position encodings to a PytorchBatch.

Adapted from https://pytorch.org/tutorials/beginner/transformer_tutorial.html

Parameters:
embedding_dim: int

The desired size of the output embedding. Unlike many position embedding implementations, this does not need to be even.

max_timepoint: float = 10000.0

The maximum observed timepoint, used to initialize the frequency space.

forward(batch: PytorchBatch) Tensor[source]

Forward pass.

Parameters:
batch: PytorchBatch

The input batch to process.

Returns:

The temporal position embeddings tensor of shape (bsz, seq_len)

EventStream.transformer.transformer.expand_mask(mask: BoolTensor, dtype: dtype) Tensor[source]

Expands attention_mask from [bsz, seq_len] to [bsz, 1, 1, seq_len] and converts to float.

This enables broadcasting to [bsz, num_heads, from_seq_len, to_seq_len] by converting the size [bsz, seq_len] to [bsz, 1, 1, seq_len] and converts from a boolean form to an attention weights masking form, which has 0 where the original mask was True and the minimum possible floating point expressible value where it was False.

Parameters:
mask: BoolTensor

The event presence/absence mask of shape [bsz, seq_len].

dtype: dtype

The target dtype of the attention mask.

Returns:

The passed event indicator mask reshaped and type converted, unless mask is None in which case returns None.

Examples

>>> import torch
>>> assert expand_mask(None, None) is None
>>> mask = torch.BoolTensor([
...     [True, True, False, False],
...     [True, True, True, False],
... ])
>>> dtype = torch.float16
>>> print(expand_mask(mask, dtype))
tensor([[[[    -0.,     -0., -65504., -65504.]]],


        [[[    -0.,     -0.,     -0., -65504.]]]], dtype=torch.float16)
EventStream.transformer.transformer.time_from_deltas(batch: PytorchBatch) Tensor[source]

Given a batch of time deltas, compute the relative time-since-start for each event.

Parameters:
batch: PytorchBatch

The input batch

Examples

>>> batch = PytorchBatch(
...     event_mask=torch.BoolTensor([
...         [True, True, True], [True, True, False]
...     ]),
...     time_delta=torch.Tensor([[1.0, 3.2, 0.0], [1.4, 0.0, 1.0]])
... )
>>> print(time_from_deltas(batch))
tensor([[0.0000, 1.0000, 4.2000],
        [0.0000, 1.4000, 1.4000]])