EventStream.transformer.config module¶

Core EventStream GPT model configuration classes.

EventStream.transformer.config.MEAS_INDEX_GROUP_T¶: The type of acceptable measurement index group option specifications.

EventStream.transformer.config.ATTENTION_TYPES_LIST_T¶: The type of acceptable attention type configuration options.

class EventStream.transformer.config.AttentionLayerType(value)[source]¶

Bases: StrEnum

Attention layer type options.

GLOBAL = 'global'¶: Attention is global over all sequence elements (respecting a causal mask).

LOCAL = 'local'¶: Attention is limited to a local window of a config-determined size.

class EventStream.transformer.config.Averaging(value)[source]¶

Bases: StrEnum

Describes the different ways metric values can be averaged in multi-class or multi-label settings.

MACRO = 'macro'¶: Macro-averaging; Metrics across different labels are averaged without regard for label frequency.

MICRO = 'micro'¶: Micro-averaging; Metrics across different labels are averaged without weighting.

WEIGHTED = 'weighted'¶: Weighted-averaging; Metrics across different labels are averaged weighted by label/class frequency.

class EventStream.transformer.config.MetricCategories(value)[source]¶

Bases: StrEnum

Describes different categories of metrics.

Used for configuring what metrics to track.

CLASSIFICATION = 'classification'¶: Track metrics for generative prediction of classification metrics.

LOSS_PARTS = 'loss_parts'¶: Track the different loss components.

REGRESSION = 'regression'¶: Track metrics for generative prediction of regression metrics.

TTE = 'TTE'¶: Track metrics related to time-to-event prediction.

class EventStream.transformer.config.Metrics(value)[source]¶

Bases: StrEnum

Describes the different supported metric functions.

ACCURACY = 'accuracy'¶: Raw accuracy.

AUPRC = 'AUPRC'¶

The area under the precision recall curve.

Also commonly refferred to as “Average Precision”.

AUROC = 'AUROC'¶

The area under the receiver operating characteristic.

Also commonly called “AUC”.

EXPLAINED_VARIANCE = 'explained_variance'¶: The extent to which the predicted regression label explains the variance in the true label.

MSE = 'MSE'¶: The mean squared error between predicted and true regression labels.

MSLE = 'MSLE'¶: The mean squared log error between predicted and true regression labels.

class EventStream.transformer.config.MetricsConfig(n_auc_thresholds: int | None = 50, do_skip_all_metrics: bool = False, do_validate_args: bool = False, include_metrics: dict[str, ~typing.Any] = <factory>)[source]¶

Bases: JSONableMixin

An overall configuration for what metrics should be tracked.

Parameters:¶

n_auc_thresholds: The number of thresholds to be used when computing AUROCs, for memory efficiency.
do_skip_all_metrics: If True, all metrics will be skipped by the model. This can save significant time.
do_validate_args: If True, torchmetrics metrics objects will validate their arguments during computation. This costs time.
include_metrics: A dictionary detailing what metrics should be tracked over what splits, for what measurements, in what ways. If do_skip_all_metrics, this will be silently overwritten with {}. The format for this dictionary is as follows. The outermost level of keys is splits. Within each split, there is another dictionary, whose keys are metric categories that should be tracked in some form on that split. Each metric category maps to either the boolean True, in which case that metric category should be tracked across all relevant metrics, or to a dictionary mapping metric functions to either the boolean True, indicating they should be tracked over all relevant weightings, or to a list of weightings which should be tracked.

do_log(split: Split, cat: MetricCategories, metric_name: str | None = None) → bool[source]¶: Returns True if metric_name should be tracked for split and cat.

do_log_any(cat: MetricCategories, metric_name: str | None = None) → bool[source]¶: Returns True if metric_name should be tracked for cat and any split.

do_log_only_loss(split: Split) → bool[source]¶: Returns True if only loss should be logged for this split.

do_skip_all_metrics : bool = False¶

do_validate_args : bool = False¶

include_metrics : dict[str, Any]¶

n_auc_thresholds : int | None = 50¶

class EventStream.transformer.config.OptimizationConfig(init_lr: float = 0.01, end_lr: float | None = None, end_lr_frac_of_init_lr: float | None = 0.001, max_epochs: int = 100, batch_size: int = 32, validation_batch_size: int = 32, lr_frac_warmup_steps: float | None = 0.01, lr_num_warmup_steps: int | None = None, max_training_steps: int | None = None, lr_decay_power: float = 1.0, weight_decay: float = 0.01, patience: int | None = None, gradient_accumulation: int | None = None, num_dataloader_workers: int = 0)[source]¶

Bases: JSONableMixin

Configuration for optimization variables for training a model.

Parameters:¶

init_lr: float = 0.01¶: The initial learning rate used by the optimizer. Given warmup is used, this will be the peak learning rate after the warmup period.
end_lr: float | None = None¶: The final learning rate at the end of all learning rate decay.
end_lr_frac_of_init_lr: float | None = 0.001¶: The fraction of the initial learning rate that the end learning rate should be. Must be consistent with end_lr, when both are set. If only one is set, the other will be correctly inferred upon initialization. This is largely useful during hyperparameter tuning, to avoid sampling hyperparameters where end_lr is larger than init_lr, which is not compatible with the supported learning rate scheduler.
max_epochs: int = 100¶: The maximum number of training epochs.
batch_size: int = 32¶: The batch size used during stochastic gradient descent.
validation_batch_size: int = 32¶: The batch size used during evaluation.
lr_frac_warmup_steps: float | None = 0.01¶: What fraction of the total training steps should be spent increasing the learning rate during the learning rate warmup period. Should not be set simultaneously with lr_num_warmup_steps. This is largely used in the set_tot_dataset function which initializes missing parameters given the dataset size, such as inferring the max_num_training_steps and setting lr_num_warmup_steps given this parameter and the inferred max_num_training_steps.
lr_num_warmup_steps: int | None = None¶: How many training steps should be spent on learning rate warmup. If this is set then lr_frac_warmup_steps should be set to None, and lr_frac_warmup_steps will be properly inferred during set_to_dataset.
max_training_steps: int | None = None¶: The maximum number of training steps the system will run for given max_epochs, batch_size, and the size of the used dataset (as inferred via set_to_dataset). Generally should not be set at initialization.
lr_decay_power: float = 1.0¶: The decay power in the learning rate polynomial decay with warmup. 1.0 corresponds to linear decay.
weight_decay: float = 0.01¶: The L2 weight regularization penalty that is applied during training.
patience: int | None = None¶: The number of epochs to wait before early stopping if the validation loss does not improve. If None, early stopping is not used.
gradient_accumulation: int | None = None¶: The number of gradient accumulation steps to use. If None, gradient accumulation is not used.

Raises:¶

ValueError – If end_lr, init_lr, and end_lr_frac_of_init_lr are not consistent, or if end_lr and end_lr_frac_of_init_lr are both unset.

batch_size : int = 32¶

end_lr : float | None = None¶

end_lr_frac_of_init_lr : float | None = 0.001¶

gradient_accumulation : int | None = None¶

init_lr : float = 0.01¶

lr_decay_power : float = 1.0¶

lr_frac_warmup_steps : float | None = 0.01¶

lr_num_warmup_steps : int | None = None¶

max_epochs : int = 100¶

max_training_steps : int | None = None¶

num_dataloader_workers : int = 0¶

patience : int | None = None¶

set_to_dataset(dataset: PytorchDataset)[source]¶

Sets parameters in the config to appropriate values given dataset.

Some parameters for optimization are dependent upon the total size of the dataset (e.g., converting between a fraction of training and a concrete number of steps). This function sets these parameters based on dataset’s size.

Parameters:¶

dataset: PytorchDataset¶: The dataset to set the internal parameters too.

Raises:¶

ValueError – If the setting process does not yield consistent results.

validation_batch_size : int = 32¶

weight_decay : float = 0.01¶

class EventStream.transformer.config.Split(value)[source]¶

Bases: StrEnum

What data split is being used.

HELD_OUT = 'held_out'¶

The held out test set split.

Also often called “test”.

TRAIN = 'train'¶: The train split.

TUNING = 'tuning'¶

The hyperparameter tuning split.

Also often called “dev”, “validation”, or “val”.

class EventStream.transformer.config.StructuredEventProcessingMode(value)[source]¶

Bases: StrEnum

Structured event sequence processing modes.

CONDITIONALLY_INDEPENDENT = 'conditionally_independent'¶: Intra-event covariates are independent of one another, conditioned on history.

NESTED_ATTENTION = 'nested_attention'¶: Intra-event covariates are predicted according to a user-specified intra-event dependency chain.

Bases: PretrainedConfig

The configuration class for Event Stream GPT models.

It is used to instantiate a Transformer model according to the specified arguments. Depending on the use of the model, some parameters will be unused. For example, measurements_per_generative_mode and parameters in the Model Output Config section are only used for generative tasks, not fine-tuning tasks.

Configuration objects inherit from PretrainedConfig can be used to control the model outputs. Read the documentation from PretrainedConfig for more information. Of particular interest, note that all PretrainedConfig objects inherit the following properties, to be used for fine-tuning tasks:

finetuning_task (str, optional) — Name of the task used to fine-tune the model. This can be used when converting from an original (TensorFlow or PyTorch) checkpoint.
id2label (Dict[int, str], optional) — A map from index (for instance prediction index, or target index) to label.
label2id (Dict[str, int], optional) — A map from label to index for the model.
num_labels (int, optional) — Number of labels to use in the last layer added to the model, typically for a classification task.
task_specific_params (Dict[str, Any], optional) — Additional keyword arguments to store for the current task.
problem_type (str, optional) — Problem type for fine-tuning models. Can be one of “regression”, “single_label_classification” or “multi_label_classification”.

Parameters:¶

vocab_sizes_by_measurement: dict[str, int] | None = None¶

The size of the vocabulary per data type.

vocab_offsets_by_measurement: dict[str, int] | None = None¶

The vocab offset per data type.

measurement_configs: dict[str, MeasurementConfig] | None = None¶

A map per measurement to the fit, pre-processed configuration object for that measurement. Used only during generation.

measurements_idxmap: dict[str, dict[Hashable, int]] | None = None¶

A map per measurement of the integer index corresponding to that measurement in the unified measurements vocabulary.

measurements_per_generative_mode: dict[DataModality, list[str]] | None = None¶

Which measurements (by str name) are generated in which mode.

event_types_idxmap: dict[str, int] | None = None¶

A map of the integer index corresponding to each event type.

measurements_per_dep_graph_level: list[list[str | tuple[str, MeasIndexGroupOptions]]] | None = None¶

A list of the measurements (by name) and whether or not categorical, numerical, or both associated values of that measurement are used in each dependency graph level. At the default, this assumes the dependency graph has exactly one non-whole-event level and uses that to predict the entirety of the event contents.

max_seq_len: int = 256¶

The maximum sequence length for the model.

do_split_embeddings: bool = False¶

Whether or not embeddings should be split into separate categorical and numerical embedding layers, or all embedded jointly. See DataEmbeddingLayer for more information.

categoral_embedding_dim

If specified, the input embedding layer will use a split embedding layer, with one embedding for categorical data and one for continuous data. The embedding dimension for the categorical data will be this value. In this case, numerical_embedding_dim must be specified.

numerical_embedding_dim: int | None = None¶

If specified, the input embedding layer will use a split embedding layer, with one embedding for categorical data and one for continuous data. The embedding dimension for the continuous data will be this value. In this case, categoral_embedding_dim must be specified.

static_embedding_mode: StaticEmbeddingMode = StaticEmbeddingMode.SUM_ALL¶

Specifies how the static embeddings are combined with dynamic embeddings. Options and their effects are described in the StaticEmbeddingMode documentation.

static_embedding_weight: float = 0.5¶

The relative weight of the static embedding in the combined embedding. Only used if the static_embedding_mode is not StaticEmbeddingMode.DROP.

dynamic_embedding_weight: float = 0.5¶

The relative weight of the dynamic embedding in the combined embedding. Only used if the static_embedding_mode is not StaticEmbeddingMode.DROP.

categorical_embedding_weight: float = 0.5¶

The relative weight of the categorical embedding in the combined embedding. Only used if categoral_embedding_dim and numerical_embedding_dim are not None.

numerical_embedding_weight: float = 0.5¶

The relative weight of the numerical embedding in the combined embedding. Only used if categoral_embedding_dim and numerical_embedding_dim are not None.

do_normalize_by_measurement_index: bool = False¶

If True, the input embeddings are normalized such that each unique measurement index contributes equally to the embedding.

do_use_learnable_sinusoidal_ATE: bool = False¶

If True, then the model will produce temporal position embeddings via a sinnusoidal position embedding such that the frequencies are learnable, rather than fixed and regular.

structured_event_processing_mode: StructuredEventProcessingMode = StructuredEventProcessingMode.CONDITIONALLY_INDEPENDENT¶

Specifies how the internal event is processed internally by the model. Can be either:

StructuredEventProcessingMode.NESTED_ATTENTION: In this case, the whole-event embeddings are processed via a sequential encoder first into historical embeddings, then the inter-event dependency graph elements are processed via a second sequential encoder alongside the relevant historical embedding. Sequential processing types are either full attention / MLP blocks or just self attention layers, as controlled by do_full_block_in_seq_attention and do_full_block_in_dep_graph_attention.
StructuredEventProcessingMode.CONDITIONALLY_INDEPENDENT In this case, the input dependency graph embedding elements are all summed and processed as a single event sequence, with each event’s output embedding being used to simultaneously predict all elements of the subsequent event (thereby treating them all as conditionally independent). In this case, the following parameters should all be None:
- measurements_per_dep_graph_level
- do_full_block_in_seq_attention
- do_full_block_in_dep_graph_attention
- dep_graph_attention_types
- dep_graph_window_size

hidden_size: int | None = None¶

The hidden size of the model. Must be consistent with head_dim, if specified.

head_dim: int | None = 64¶

The hidden size per attention head. Useful for hyperparameter tuning to avoid setting infeasible hidden sizes. Must be consistent with hidden_size, if specified.

num_hidden_layers: int = 2¶

Number of encoder layers.

num_attention_heads: int = 4¶

Number of attention heads for each attention layer in the Transformer encoder.

seq_attention_types: AttentionLayerType | list[AttentionLayerType] | list[tuple[list[AttentionLayerType], int]] | None = None¶

The type of attention for each sequence self attention layer.

seq_window_size: int = 32¶

The window size used in local attention for sequence self attention layers.

dep_graph_attention_types: AttentionLayerType | list[AttentionLayerType] | list[tuple[list[AttentionLayerType], int]] | None = None¶

The type of attention for each dependency graph self attention layer. Defaults to global attention as dependency graph sare in general much shorter than sequences.

dep_graph_window_size: int | None = 2¶

The window size used in local attention for dependency graph self attention layers. Default is set much lower as dependency graphs are in general much shorter than sequences.

do_full_block_in_seq_attention: bool | None = False¶

If True, use a full attention block (including layer normalization and MLP layers) for the sequence processing module. If false, just use a self attention layer.

do_full_block_in_dep_graph_attention: bool | None = True¶

If True, use a full attention block (including layer normalization and MLP layers) for the dependency graph processing module. If false, just use a self attention layer.

intermediate_size: int = 32¶

Dimension of the “intermediate” (often named feed-forward) layer in encoder.

activation_function: str = 'gelu'¶

The non-linear activation function (function or string) in the encoder. If string, "gelu" and "relu" are supported.

input_dropout: float = 0.1¶

The dropout probability for the input layer.

attention_dropout: float = 0.1¶

The dropout probability for the attention probabilities.

resid_dropout: float = 0.1¶

The dropout probability used on the residual connections.

layer_norm_epsilon: float = 1e-05¶

The epsilon used by the layer normalization layers.

init_std: float = 0.02¶

The standard deviation of the truncated normal weight initialization distribution.

TTE_generation_layer_type: TimeToEventGenerationHeadType = 'exponential'¶

What kind of TTE generation layer to use.

TTE_lognormal_generation_num_components: int | None = None¶

If the TTE generation layer is 'log_normal_mixture', this specifies the number of mixture components to include. Must be None if TTE_generation_layer_type == 'exponential'.

mean_log_inter_event_time_min: float | None = None¶

The mean of the log of the time between events in the underlying data. Used for normalizing TTE predictions. Must be None if TTE_generation_layer_type == 'exponential'.

std_log_inter_event_time_min: float | None = None¶

The standard deviation of the log of the time between events in the underlying data. Used for normalizing TTE predictions. Must be None if TTE_generation_layer_type == 'exponential'.

use_cache: bool = True¶

Whether to use the past key/values attentions (if applicable to the model) to speed up decoding.

Raises:¶

ValueError – If configuration parameters are not fully self consistent.

expand_attention_types_params(attention_types: AttentionLayerType | list[AttentionLayerType] | list[tuple[list[AttentionLayerType], int]]) → list[AttentionLayerType][source]¶: Expands the attention syntax from the easy-to-enter syntax to one for the model.

classmethod from_dict(*args, **kwargs) → StructuredTransformerConfig[source]¶

Instantiates a [PretrainedConfig] from a Python dictionary of parameters.

Parameters:¶

config_dict : Dict[str, Any]: Dictionary that will be used to instantiate the configuration object. Such a dictionary can be retrieved from a pretrained checkpoint by leveraging the [get_config_dict] method.
**kwargs¶: Additional parameters from which to initialize the configuration object.

Returns:¶

The configuration object instantiated from those parameters.

Return type:¶

[PretrainedConfig]

measurements_for(modality: DataModality) → list[str][source]¶

set_to_dataset(dataset: PytorchDataset)[source]¶: Set various configuration parameters to match dataset.

to_dict() → dict[str, Any][source]¶

Serializes this instance to a Python dictionary.

Returns:¶: Dictionary of all the attributes that make up this configuration instance.
Return type:¶: Dict[str, Any]

class EventStream.transformer.config.TimeToEventGenerationHeadType(value)[source]¶

Bases: StrEnum

Options for model TTE generation heads.

EXPONENTIAL = 'exponential'¶: TTE is modeled by an exponential distribution with a model-determined rate parameter.

LOG_NORMAL_MIXTURE = 'log_normal_mixture'¶: TTE is modeled by a mixture of log-normal distribiutions.