EventStream.transformer.transformer module¶

The internal transformer module code.

TODO(mmcdermott): Can use transformers.apply_chunking_to_forward to save memory.

Based on https://raw.githubusercontent.com/huggingface/transformers/e3cc4487fe66e03ec85970ea2db8e5fb34c455f4/src/transformers/models/gpt_neo/modeling_gpt_neo.py

class EventStream.transformer.transformer.ConditionallyIndependentPointProcessInputLayer(config: StructuredTransformerConfig)[source]¶

Bases: Module

Processes input batch and produces event embeddings.

This layer accepts a batch from an event-stream PyTorch dataset and returns input embeddings from it. This is designed for conditionally independent models, as it does not split the input embeddings into different components corresponding to different dependency graph positions. Combines time and data embeddings.

Parameters:¶

config: StructuredTransformerConfig¶: Configuration parameters for the structured transformer.

forward(batch: PytorchBatch) → Tensor[source]¶

Returns input event embeddings for the provided batch.

Parameters:¶

batch: PytorchBatch¶: A PytorchBatch instance containing input data.

class EventStream.transformer.transformer.ConditionallyIndependentPointProcessTransformer(config: StructuredTransformerConfig)[source]¶

Bases: StructuredTransformerPreTrainedModel

A transformer model specifically for conditionally independent point processes.

This model uses an input layer to generate embeddings from an event-stream PyTorch dataset, and an InnerBlock layer for non-structured processing. As a conditionally independent model, all event covariates are predicted simultaneously from the history embedding.

Parameters:¶

config: StructuredTransformerConfig¶: Configuration parameters for the structured transformer.

Raises:¶

ValueError – If the provided configuration indicates a nested attention model.

Performs a forward pass on the transformer model.

Parameters:¶

batch: PytorchBatch | None = None¶: A PytorchBatch instance containing input data.
input_embeds: Tensor | None = None¶: Precomputed embeddings for the input data. Currently unused.
past: tuple[FloatTensor] | None = None¶: Past hidden states for more efficient decoding.
seq_attention_mask: Tensor | None = None¶: Mask for the sequential attention mechanism.
head_mask: Tensor | None = None¶: Mask to nullify selected heads of the self-attention module.
use_cache: bool | None = None¶: Specifies whether caching should be used.
output_attentions: bool | None = None¶: Specifies whether attention probabilities should be returned in the output.
output_hidden_states: bool | None = None¶: Specifies whether hidden states should be returned in the output.
return_dict: bool | None = None¶: Specifies whether the output should be an object with key names (True) or a tuple.

Returns:¶

A tuple containing hidden states, or a TransformerOutputWithPast object if return_dict is True.

class EventStream.transformer.transformer.InnerAttention(config: StructuredTransformerConfig, layer_id: int = 0, is_seq: bool = True)[source]¶

Bases: Module

The inner attention module used by the GPTs in this codebase.

This module largely just selects what kind of attention computation should be used in this layer, and offloads computation therein.

Parameters:¶

config: StructuredTransformerConfig¶: The model configuration object.
layer_id: int = 0¶: Which layer is this attention computation in (by integer index)?
is_seq: bool = True¶: Is this a sequence or dependency-graph attention layer?

Raises:¶

ValueError – If an invalid attention type is provided.

forward(hidden_states, attention_mask=None, layer_past=None, head_mask=None, use_cache=False, output_attentions=False, static_kv_first: bool = False)[source]¶

Forward pass.

This returns the pre-selected attention calculation over the inputs (run through a layer norm).

Parameters:¶

hidden_states¶: The input hidden states.
attention_mask=None¶: A mask to be applied on the attention weights.
layer_past=None¶: The past layer states.
head_mask=None¶: A mask to be applied on the attention heads.
use_cache=False¶: A flag indicating whether to cache the layer’s past states.
output_attentions=False¶: A flag indicating whether to output the attention weights.
static_kv_first: bool = False¶: In the case of attention over the dependency graph, the history embedding is dropped after processing, so we want to only use it as a KV, not as a query.

class EventStream.transformer.transformer.InnerBlock(config: StructuredTransformerConfig, layer_id: int, is_seq: bool)[source]¶

Bases: Module

An inner block in a transformer architecture that consists of attention and MLP layers.

Parameters:¶

config: StructuredTransformerConfig¶: Configuration parameters for the structured transformer.
layer_id: int¶: Unique identifier for the layer.
is_seq: bool¶: Flag indicating whether the block is sequential.

forward(hidden_states, attention_mask=None, layer_past=None, head_mask=None, use_cache=False, output_attentions=False, static_kv_first: bool = False) → tuple[Tensor, dict[str, Tensor]][source]¶

Conducts the forward pass for the inner block.

Parameters:¶

hidden_states¶: Input tensor.
attention_mask=None¶: Mask to avoid attending to padded token positions.
layer_past=None¶: Cache of past hidden states for more efficient decoding.
head_mask=None¶: Mask to nullify selected heads of the self-attention module.
use_cache=False¶: Whether to use caching.
output_attentions=False¶: Whether to return attention probabilities in the output.
static_kv_first: bool = False¶: Whether the static key-value pair comes first.

Returns:¶

Modified hidden states and a dictionary containing present key-value pair and attention weights (if output_attentions=True).

Return type:¶

tuple

class EventStream.transformer.transformer.InnerMLP(config: StructuredTransformerConfig)[source]¶

Bases: Module

Applies a multilayer perceptron (MLP) to the hidden_states.

Parameters:¶

config: StructuredTransformerConfig¶: Configuration parameters for the structured transformer.

forward(hidden_states)[source]¶

Conducts forward pass for the MLP.

Parameters:¶

hidden_states¶: Input tensor.

Returns:¶

Modified hidden states after applying MLP.

class EventStream.transformer.transformer.InnerSelfAttention(config: StructuredTransformerConfig, attention_type: str, window_size: int)[source]¶

Bases: Module

This class implements the inner self-attention mechanism.

This involves performing the self-attention operation and returning the result along with some optional additional outputs. The constructor of this class accepts three arguments, which determine the configuration of the self-attention mechanism.

Parameters:¶

config: StructuredTransformerConfig¶: An instance of StructuredTransformerConfig which contains various configuration parameters.
attention_type: str¶: A string indicating the type of attention to be applied. Currently, only “local” is implemented.
window_size: int¶: An integer specifying the size of the attention window.

Raises:¶

ValueError – If the product of num_heads and head_dim from the config does not match embed_dim.

forward(hidden_states, attention_mask=None, layer_past=None, head_mask=None, use_cache=False, output_attentions=False, static_kv_first: bool = False)[source]¶

Applies the attention mechanism to the input hidden states.

Parameters:¶

hidden_states¶: The input hidden states.
attention_mask=None¶: A mask to be applied on the attention weights.
layer_past=None¶: The past layer states.
head_mask=None¶: A mask to be applied on the attention heads.
use_cache=False¶: A flag indicating whether to cache the layer’s past states.
output_attentions=False¶: A flag indicating whether to output the attention weights.
static_kv_first: bool = False¶: In the case of attention over the dependency graph, the history embedding is dropped after processing, so we want to only use it as a KV, not as a query.

Returns:¶

A tuple containing the output of the attention mechanism and a dictionary of optional outputs.

class EventStream.transformer.transformer.LearnableFrequencySinusoidalTemporalPositionEncoding(embedding_dim: int, max_timepoint: float = 10000.0)[source]¶

Bases: Module

A module for applying time-based position encodings to a PytorchBatch.

Adapted from (link).

Parameters:¶

embedding_dim: int¶: The desired size of the output embedding. Unlike many position embedding implementations, this does not need to be even.

forward(batch: PytorchBatch) → Tensor[source]¶

Forward pass.

Parameters:¶

batch: PytorchBatch¶: The input batch to process.

Returns:¶

The temporal position embeddings tensor of shape (bsz, seq_len)

class EventStream.transformer.transformer.NestedAttentionPointProcessInputLayer(config: StructuredTransformerConfig)[source]¶

Bases: Module

Processes input batch and produces input dependency graph element embeddings.

This layer accepts a batch from an event-stream PyTorch dataset and returns input embeddings from it. This is designed for nested attention models, as it splits the input embeddings into different components corresponding to different dependency graph positions. Combines time and data embeddings.

Parameters:¶

config: StructuredTransformerConfig¶: Configuration parameters for the structured transformer.

forward(batch: PytorchBatch, dep_graph_el_generation_target: int | None = None) → Tensor[source]¶

Returns input dependency graph element embeddings for the provided batch.

Parameters:¶

batch: PytorchBatch¶: A PytorchBatch instance containing input data.

class EventStream.transformer.transformer.NestedAttentionPointProcessTransformer(config: StructuredTransformerConfig)[source]¶

Bases: StructuredTransformerPreTrainedModel

A transformer model specifically for nested attention point processes.

This model uses an input layer to generate embeddings from an event-stream PyTorch dataset, and an InnerBlock layer for non-structured processing. As a nested attention model, event covariates are predicted in the sequence of the dependency graph elements, specified in the config’s measurements_per_dep_graph_level parameter, depending on both the historical event embeddings and the prior dependency graph elements.

Parameters:¶

config: StructuredTransformerConfig¶: Configuration parameters for the structured transformer.

Raises:¶

ValueError – If the provided configuration indicates a conditionally independent model.

Performs a forward pass on the transformer model.

Parameters:¶

batch: PytorchBatch | None = None¶: A PytorchBatch instance containing input data.
input_embeds: Tensor | None = None¶: Precomputed embeddings for the input data. Currently unused.
past: tuple[FloatTensor] | None = None¶: Past hidden states for more efficient decoding.
seq_attention_mask: Tensor | None = None¶: Mask for the sequential attention mechanism.
head_mask: Tensor | None = None¶: Mask to nullify selected heads of the self-attention module.
use_cache: bool | None = None¶: Specifies whether caching should be used.
output_attentions: bool | None = None¶: Specifies whether attention probabilities should be returned in the output.
output_hidden_states: bool | None = None¶: Specifies whether hidden states should be returned in the output.
return_dict: bool | None = None¶: Specifies whether the output should be an object with key names (True) or a tuple.

Returns:¶

A tuple containing hidden states, or a TransformerOutputWithPast object if return_dict is True.

class EventStream.transformer.transformer.StructuredTransformerBlock(config: StructuredTransformerConfig, layer_id: int)[source]¶

Bases: Module

A block for structured attention with both sequential and dependency graph modules.

Parameters:¶

config: StructuredTransformerConfig¶: Configuration parameters for the structured transformer.
layer_id: int¶: Unique identifier (depth index) for the layer.

forward(*args, **kwargs) → tuple[Tensor, dict[str, dict[str, Tensor | None] | None]][source]¶

Conducts the forward pass for the structured transformer block.

Parameters:¶

*args¶: Variable length argument list.
**kwargs¶: Arbitrary keyword arguments.

Returns:¶

Modified input tensor and a dictionary containing present key-value pair and attention weights.

Return type:¶

tuple

class EventStream.transformer.transformer.StructuredTransformerPreTrainedModel(*inputs, **kwargs)[source]¶

Bases: PreTrainedModel

The base pre-trained model class for Transformer models.

base_model_prefix = 'transformer'¶

config_class¶: alias of StructuredTransformerConfig

supports_gradient_checkpointing = True¶

class EventStream.transformer.transformer.TemporalPositionEncoding(embedding_dim: int, max_timepoint: float = 10000.0)[source]¶

Bases: Module

A module for applying time-based position encodings to a PytorchBatch.

Adapted from https://pytorch.org/tutorials/beginner/transformer_tutorial.html

Parameters:¶

embedding_dim: int¶: The desired size of the output embedding. Unlike many position embedding implementations, this does not need to be even.
max_timepoint: float = 10000.0¶: The maximum observed timepoint, used to initialize the frequency space.

forward(batch: PytorchBatch) → Tensor[source]¶

Forward pass.

Parameters:¶

batch: PytorchBatch¶: The input batch to process.

Returns:¶

The temporal position embeddings tensor of shape (bsz, seq_len)

EventStream.transformer.transformer.expand_mask(mask: BoolTensor, dtype: dtype) → Tensor[source]¶

Expands attention_mask from [bsz, seq_len] to [bsz, 1, 1, seq_len] and converts to float.

This enables broadcasting to [bsz, num_heads, from_seq_len, to_seq_len] by converting the size [bsz, seq_len] to [bsz, 1, 1, seq_len] and converts from a boolean form to an attention weights masking form, which has 0 where the original mask was True and the minimum possible floating point expressible value where it was False.

Parameters:¶

mask: BoolTensor¶: The event presence/absence mask of shape [bsz, seq_len].
dtype: dtype¶: The target dtype of the attention mask.

Returns:¶

The passed event indicator mask reshaped and type converted, unless mask is None in which case returns None.

Examples

>>> import torch
>>> assert expand_mask(None, None) is None
>>> mask = torch.BoolTensor([
...     [True, True, False, False],
...     [True, True, True, False],
... ])
>>> dtype = torch.float16
>>> print(expand_mask(mask, dtype))
tensor([[[[    -0.,     -0., -65504., -65504.]]],


        [[[    -0.,     -0.,     -0., -65504.]]]], dtype=torch.float16)

EventStream.transformer.transformer.time_from_deltas(batch: PytorchBatch) → Tensor[source]¶

Given a batch of time deltas, compute the relative time-since-start for each event.

Parameters:¶

batch: PytorchBatch¶: The input batch

Examples

>>> batch = PytorchBatch(
...     event_mask=torch.BoolTensor([
...         [True, True, True], [True, True, False]
...     ]),
...     time_delta=torch.Tensor([[1.0, 3.2, 0.0], [1.4, 0.0, 1.0]])
... )
>>> print(time_from_deltas(batch))
tensor([[0.0000, 1.0000, 4.2000],
        [0.0000, 1.4000, 1.4000]])