EventStream.transformer.transformer module¶
The internal transformer module code.
TODO(mmcdermott): Can use transformers.apply_chunking_to_forward to save memory.
- class EventStream.transformer.transformer.ConditionallyIndependentPointProcessInputLayer(config: StructuredTransformerConfig)[source]¶
Bases:
ModuleProcesses input batch and produces event embeddings.
This layer accepts a batch from an event-stream PyTorch dataset and returns input embeddings from it. This is designed for conditionally independent models, as it does not split the input embeddings into different components corresponding to different dependency graph positions. Combines time and data embeddings.
- Parameters:¶
- config: StructuredTransformerConfig¶
Configuration parameters for the structured transformer.
- forward(batch: PytorchBatch) Tensor[source]¶
Returns input event embeddings for the provided batch.
- Parameters:¶
- batch: PytorchBatch¶
A PytorchBatch instance containing input data.
- class EventStream.transformer.transformer.ConditionallyIndependentPointProcessTransformer(config: StructuredTransformerConfig)[source]¶
Bases:
StructuredTransformerPreTrainedModelA transformer model specifically for conditionally independent point processes.
This model uses an input layer to generate embeddings from an event-stream PyTorch dataset, and an InnerBlock layer for non-structured processing. As a conditionally independent model, all event covariates are predicted simultaneously from the history embedding.
- Parameters:¶
- config: StructuredTransformerConfig¶
Configuration parameters for the structured transformer.
- Raises:¶
ValueError – If the provided configuration indicates a nested attention model.
-
forward(batch: PytorchBatch | None =
None, input_embeds: Tensor | None =None, past: tuple[FloatTensor] | None =None, seq_attention_mask: Tensor | None =None, head_mask: Tensor | None =None, use_cache: bool | None =None, output_attentions: bool | None =None, output_hidden_states: bool | None =None, return_dict: bool | None =None) tuple[Tensor] | TransformerOutputWithPast[source]¶ Performs a forward pass on the transformer model.
- Parameters:¶
- batch: PytorchBatch | None =
None¶ A PytorchBatch instance containing input data.
- input_embeds: Tensor | None =
None¶ Precomputed embeddings for the input data. Currently unused.
- past: tuple[FloatTensor] | None =
None¶ Past hidden states for more efficient decoding.
- seq_attention_mask: Tensor | None =
None¶ Mask for the sequential attention mechanism.
- head_mask: Tensor | None =
None¶ Mask to nullify selected heads of the self-attention module.
- use_cache: bool | None =
None¶ Specifies whether caching should be used.
- output_attentions: bool | None =
None¶ Specifies whether attention probabilities should be returned in the output.
Specifies whether hidden states should be returned in the output.
- return_dict: bool | None =
None¶ Specifies whether the output should be an object with key names (True) or a tuple.
- batch: PytorchBatch | None =
- Returns:¶
A tuple containing hidden states, or a TransformerOutputWithPast object if return_dict is True.
-
class EventStream.transformer.transformer.InnerAttention(config: StructuredTransformerConfig, layer_id: int =
0, is_seq: bool =True)[source]¶ Bases:
ModuleThe inner attention module used by the GPTs in this codebase.
This module largely just selects what kind of attention computation should be used in this layer, and offloads computation therein.
- Parameters:¶
- Raises:¶
ValueError – If an invalid attention type is provided.
-
forward(hidden_states, attention_mask=
None, layer_past=None, head_mask=None, use_cache=False, output_attentions=False, static_kv_first: bool =False)[source]¶ Forward pass.
This returns the pre-selected attention calculation over the inputs (run through a layer norm).
- Parameters:¶
The input hidden states.
- attention_mask=
None¶ A mask to be applied on the attention weights.
- layer_past=
None¶ The past layer states.
- head_mask=
None¶ A mask to be applied on the attention heads.
- use_cache=
False¶ A flag indicating whether to cache the layer’s past states.
- output_attentions=
False¶ A flag indicating whether to output the attention weights.
- static_kv_first: bool =
False¶ In the case of attention over the dependency graph, the history embedding is dropped after processing, so we want to only use it as a KV, not as a query.
- class EventStream.transformer.transformer.InnerBlock(config: StructuredTransformerConfig, layer_id: int, is_seq: bool)[source]¶
Bases:
ModuleAn inner block in a transformer architecture that consists of attention and MLP layers.
- Parameters:¶
-
forward(hidden_states, attention_mask=
None, layer_past=None, head_mask=None, use_cache=False, output_attentions=False, static_kv_first: bool =False) tuple[Tensor, dict[str, Tensor]][source]¶ Conducts the forward pass for the inner block.
- Parameters:¶
Input tensor.
- attention_mask=
None¶ Mask to avoid attending to padded token positions.
- layer_past=
None¶ Cache of past hidden states for more efficient decoding.
- head_mask=
None¶ Mask to nullify selected heads of the self-attention module.
- use_cache=
False¶ Whether to use caching.
- output_attentions=
False¶ Whether to return attention probabilities in the output.
- static_kv_first: bool =
False¶ Whether the static key-value pair comes first.
- Returns:¶
Modified hidden states and a dictionary containing present key-value pair and attention weights (if
output_attentions=True).- Return type:¶
- class EventStream.transformer.transformer.InnerMLP(config: StructuredTransformerConfig)[source]¶
Bases:
ModuleApplies a multilayer perceptron (MLP) to the
hidden_states.- Parameters:¶
- config: StructuredTransformerConfig¶
Configuration parameters for the structured transformer.
- forward(hidden_states)[source]¶
Conducts forward pass for the MLP.
- class EventStream.transformer.transformer.InnerSelfAttention(config: StructuredTransformerConfig, attention_type: str, window_size: int)[source]¶
Bases:
ModuleThis class implements the inner self-attention mechanism.
This involves performing the self-attention operation and returning the result along with some optional additional outputs. The constructor of this class accepts three arguments, which determine the configuration of the self-attention mechanism.
- Parameters:¶
- config: StructuredTransformerConfig¶
An instance of StructuredTransformerConfig which contains various configuration parameters.
- attention_type: str¶
A string indicating the type of attention to be applied. Currently, only “local” is implemented.
- window_size: int¶
An integer specifying the size of the attention window.
- Raises:¶
ValueError – If the product of
num_headsandhead_dimfrom the config does not matchembed_dim.
-
forward(hidden_states, attention_mask=
None, layer_past=None, head_mask=None, use_cache=False, output_attentions=False, static_kv_first: bool =False)[source]¶ Applies the attention mechanism to the input hidden states.
- Parameters:¶
The input hidden states.
- attention_mask=
None¶ A mask to be applied on the attention weights.
- layer_past=
None¶ The past layer states.
- head_mask=
None¶ A mask to be applied on the attention heads.
- use_cache=
False¶ A flag indicating whether to cache the layer’s past states.
- output_attentions=
False¶ A flag indicating whether to output the attention weights.
- static_kv_first: bool =
False¶ In the case of attention over the dependency graph, the history embedding is dropped after processing, so we want to only use it as a KV, not as a query.
- Returns:¶
A tuple containing the output of the attention mechanism and a dictionary of optional outputs.
-
class EventStream.transformer.transformer.LearnableFrequencySinusoidalTemporalPositionEncoding(embedding_dim: int, max_timepoint: float =
10000.0)[source]¶ Bases:
ModuleA module for applying time-based position encodings to a PytorchBatch.
Adapted from (link).
- Parameters:¶
- forward(batch: PytorchBatch) Tensor[source]¶
Forward pass.
- Parameters:¶
- batch: PytorchBatch¶
The input batch to process.
- Returns:¶
The temporal position embeddings tensor of shape (bsz, seq_len)
- class EventStream.transformer.transformer.NestedAttentionPointProcessInputLayer(config: StructuredTransformerConfig)[source]¶
Bases:
ModuleProcesses input batch and produces input dependency graph element embeddings.
This layer accepts a batch from an event-stream PyTorch dataset and returns input embeddings from it. This is designed for nested attention models, as it splits the input embeddings into different components corresponding to different dependency graph positions. Combines time and data embeddings.
- Parameters:¶
- config: StructuredTransformerConfig¶
Configuration parameters for the structured transformer.
-
forward(batch: PytorchBatch, dep_graph_el_generation_target: int | None =
None) Tensor[source]¶ Returns input dependency graph element embeddings for the provided batch.
- Parameters:¶
- batch: PytorchBatch¶
A PytorchBatch instance containing input data.
- class EventStream.transformer.transformer.NestedAttentionPointProcessTransformer(config: StructuredTransformerConfig)[source]¶
Bases:
StructuredTransformerPreTrainedModelA transformer model specifically for nested attention point processes.
This model uses an input layer to generate embeddings from an event-stream PyTorch dataset, and an InnerBlock layer for non-structured processing. As a nested attention model, event covariates are predicted in the sequence of the dependency graph elements, specified in the config’s
measurements_per_dep_graph_levelparameter, depending on both the historical event embeddings and the prior dependency graph elements.- Parameters:¶
- config: StructuredTransformerConfig¶
Configuration parameters for the structured transformer.
- Raises:¶
ValueError – If the provided configuration indicates a conditionally independent model.
-
forward(batch: PytorchBatch | None =
None, input_embeds: Tensor | None =None, past: tuple[FloatTensor] | None =None, seq_attention_mask: Tensor | None =None, head_mask: Tensor | None =None, use_cache: bool | None =None, output_attentions: bool | None =None, output_hidden_states: bool | None =None, return_dict: bool | None =None, dep_graph_past: tuple[FloatTensor] | None =None, dep_graph_el_generation_target: int | None =None) tuple[Tensor] | TransformerOutputWithPast[source]¶ Performs a forward pass on the transformer model.
- Parameters:¶
- batch: PytorchBatch | None =
None¶ A PytorchBatch instance containing input data.
- input_embeds: Tensor | None =
None¶ Precomputed embeddings for the input data. Currently unused.
- past: tuple[FloatTensor] | None =
None¶ Past hidden states for more efficient decoding.
- seq_attention_mask: Tensor | None =
None¶ Mask for the sequential attention mechanism.
- head_mask: Tensor | None =
None¶ Mask to nullify selected heads of the self-attention module.
- use_cache: bool | None =
None¶ Specifies whether caching should be used.
- output_attentions: bool | None =
None¶ Specifies whether attention probabilities should be returned in the output.
Specifies whether hidden states should be returned in the output.
- return_dict: bool | None =
None¶ Specifies whether the output should be an object with key names (True) or a tuple.
- batch: PytorchBatch | None =
- Returns:¶
A tuple containing hidden states, or a TransformerOutputWithPast object if return_dict is True.
- class EventStream.transformer.transformer.StructuredTransformerBlock(config: StructuredTransformerConfig, layer_id: int)[source]¶
Bases:
ModuleA block for structured attention with both sequential and dependency graph modules.
- Parameters:¶
- config: StructuredTransformerConfig¶
Configuration parameters for the structured transformer.
- layer_id: int¶
Unique identifier (depth index) for the layer.
- class EventStream.transformer.transformer.StructuredTransformerPreTrainedModel(*inputs, **kwargs)[source]¶
Bases:
PreTrainedModelThe base pre-trained model class for Transformer models.
-
base_model_prefix =
'transformer'¶
- config_class¶
alias of
StructuredTransformerConfig
-
supports_gradient_checkpointing =
True¶
-
base_model_prefix =
-
class EventStream.transformer.transformer.TemporalPositionEncoding(embedding_dim: int, max_timepoint: float =
10000.0)[source]¶ Bases:
ModuleA module for applying time-based position encodings to a PytorchBatch.
Adapted from https://pytorch.org/tutorials/beginner/transformer_tutorial.html
- Parameters:¶
- forward(batch: PytorchBatch) Tensor[source]¶
Forward pass.
- Parameters:¶
- batch: PytorchBatch¶
The input batch to process.
- Returns:¶
The temporal position embeddings tensor of shape (bsz, seq_len)
- EventStream.transformer.transformer.expand_mask(mask: BoolTensor, dtype: dtype) Tensor[source]¶
Expands attention_mask from
[bsz, seq_len]to[bsz, 1, 1, seq_len]and converts to float.This enables broadcasting to [bsz, num_heads, from_seq_len, to_seq_len] by converting the size [bsz, seq_len] to [bsz, 1, 1, seq_len] and converts from a boolean form to an attention weights masking form, which has 0 where the original mask was True and the minimum possible floating point expressible value where it was False.
- Parameters:¶
- Returns:¶
The passed event indicator mask reshaped and type converted, unless mask is
Nonein which case returnsNone.
Examples
>>> import torch >>> assert expand_mask(None, None) is None >>> mask = torch.BoolTensor([ ... [True, True, False, False], ... [True, True, True, False], ... ]) >>> dtype = torch.float16 >>> print(expand_mask(mask, dtype)) tensor([[[[ -0., -0., -65504., -65504.]]], [[[ -0., -0., -0., -65504.]]]], dtype=torch.float16)
- EventStream.transformer.transformer.time_from_deltas(batch: PytorchBatch) Tensor[source]¶
Given a batch of time deltas, compute the relative time-since-start for each event.
- Parameters:¶
- batch: PytorchBatch¶
The input batch
Examples
>>> batch = PytorchBatch( ... event_mask=torch.BoolTensor([ ... [True, True, True], [True, True, False] ... ]), ... time_delta=torch.Tensor([[1.0, 3.2, 0.0], [1.4, 0.0, 1.0]]) ... ) >>> print(time_from_deltas(batch)) tensor([[0.0000, 1.0000, 4.2000], [0.0000, 1.4000, 1.4000]])