EventStream.transformer.config module¶
Core EventStream GPT model configuration classes.
- EventStream.transformer.config.MEAS_INDEX_GROUP_T¶
The type of acceptable measurement index group option specifications.
- EventStream.transformer.config.ATTENTION_TYPES_LIST_T¶
The type of acceptable attention type configuration options.
- class EventStream.transformer.config.AttentionLayerType(value)[source]¶
Bases:
StrEnumAttention layer type options.
-
GLOBAL =
'global'¶ Attention is global over all sequence elements (respecting a causal mask).
-
LOCAL =
'local'¶ Attention is limited to a local window of a config-determined size.
-
GLOBAL =
- class EventStream.transformer.config.Averaging(value)[source]¶
Bases:
StrEnumDescribes the different ways metric values can be averaged in multi-class or multi-label settings.
-
MACRO =
'macro'¶ Macro-averaging; Metrics across different labels are averaged without regard for label frequency.
-
MICRO =
'micro'¶ Micro-averaging; Metrics across different labels are averaged without weighting.
-
WEIGHTED =
'weighted'¶ Weighted-averaging; Metrics across different labels are averaged weighted by label/class frequency.
-
MACRO =
- class EventStream.transformer.config.MetricCategories(value)[source]¶
Bases:
StrEnumDescribes different categories of metrics.
Used for configuring what metrics to track.
-
CLASSIFICATION =
'classification'¶ Track metrics for generative prediction of classification metrics.
-
LOSS_PARTS =
'loss_parts'¶ Track the different loss components.
-
REGRESSION =
'regression'¶ Track metrics for generative prediction of regression metrics.
-
TTE =
'TTE'¶ Track metrics related to time-to-event prediction.
-
CLASSIFICATION =
- class EventStream.transformer.config.Metrics(value)[source]¶
Bases:
StrEnumDescribes the different supported metric functions.
-
ACCURACY =
'accuracy'¶ Raw accuracy.
-
AUPRC =
'AUPRC'¶ The area under the precision recall curve.
Also commonly refferred to as “Average Precision”.
-
AUROC =
'AUROC'¶ The area under the receiver operating characteristic.
Also commonly called “AUC”.
-
EXPLAINED_VARIANCE =
'explained_variance'¶ The extent to which the predicted regression label explains the variance in the true label.
-
MSE =
'MSE'¶ The mean squared error between predicted and true regression labels.
-
MSLE =
'MSLE'¶ The mean squared log error between predicted and true regression labels.
-
ACCURACY =
- class EventStream.transformer.config.MetricsConfig(n_auc_thresholds: int | None = 50, do_skip_all_metrics: bool = False, do_validate_args: bool = False, include_metrics: dict[str, ~typing.Any] = <factory>)[source]¶
Bases:
JSONableMixinAn overall configuration for what metrics should be tracked.
- Parameters:¶
- n_auc_thresholds
The number of thresholds to be used when computing AUROCs, for memory efficiency.
- do_skip_all_metrics
If
True, all metrics will be skipped by the model. This can save significant time.- do_validate_args
If
True,torchmetricsmetrics objects will validate their arguments during computation. This costs time.- include_metrics
A dictionary detailing what metrics should be tracked over what splits, for what measurements, in what ways. If
do_skip_all_metrics, this will be silently overwritten with {}. The format for this dictionary is as follows. The outermost level of keys is splits. Within each split, there is another dictionary, whose keys are metric categories that should be tracked in some form on that split. Each metric category maps to either the booleanTrue, in which case that metric category should be tracked across all relevant metrics, or to a dictionary mapping metric functions to either the booleanTrue, indicating they should be tracked over all relevant weightings, or to a list of weightings which should be tracked.
-
do_log(split: Split, cat: MetricCategories, metric_name: str | None =
None) bool[source]¶ Returns True if
metric_nameshould be tracked forsplitandcat.
-
do_log_any(cat: MetricCategories, metric_name: str | None =
None) bool[source]¶ Returns True if
metric_nameshould be tracked forcatand any split.
-
class EventStream.transformer.config.OptimizationConfig(init_lr: float =
0.01, end_lr: float | None =None, end_lr_frac_of_init_lr: float | None =0.001, max_epochs: int =100, batch_size: int =32, validation_batch_size: int =32, lr_frac_warmup_steps: float | None =0.01, lr_num_warmup_steps: int | None =None, max_training_steps: int | None =None, lr_decay_power: float =1.0, weight_decay: float =0.01, patience: int | None =None, gradient_accumulation: int | None =None, num_dataloader_workers: int =0)[source]¶ Bases:
JSONableMixinConfiguration for optimization variables for training a model.
- Parameters:¶
- init_lr: float =
0.01¶ The initial learning rate used by the optimizer. Given warmup is used, this will be the peak learning rate after the warmup period.
- end_lr: float | None =
None¶ The final learning rate at the end of all learning rate decay.
- end_lr_frac_of_init_lr: float | None =
0.001¶ The fraction of the initial learning rate that the end learning rate should be. Must be consistent with end_lr, when both are set. If only one is set, the other will be correctly inferred upon initialization. This is largely useful during hyperparameter tuning, to avoid sampling hyperparameters where
end_lris larger thaninit_lr, which is not compatible with the supported learning rate scheduler.- max_epochs: int =
100¶ The maximum number of training epochs.
- batch_size: int =
32¶ The batch size used during stochastic gradient descent.
- validation_batch_size: int =
32¶ The batch size used during evaluation.
- lr_frac_warmup_steps: float | None =
0.01¶ What fraction of the total training steps should be spent increasing the learning rate during the learning rate warmup period. Should not be set simultaneously with
lr_num_warmup_steps. This is largely used in theset_tot_datasetfunction which initializes missing parameters given the dataset size, such as inferring themax_num_training_stepsand settinglr_num_warmup_stepsgiven this parameter and the inferredmax_num_training_steps.- lr_num_warmup_steps: int | None =
None¶ How many training steps should be spent on learning rate warmup. If this is set then
lr_frac_warmup_stepsshould be set to None, andlr_frac_warmup_stepswill be properly inferred duringset_to_dataset.- max_training_steps: int | None =
None¶ The maximum number of training steps the system will run for given
max_epochs,batch_size, and the size of the used dataset (as inferred viaset_to_dataset). Generally should not be set at initialization.- lr_decay_power: float =
1.0¶ The decay power in the learning rate polynomial decay with warmup. 1.0 corresponds to linear decay.
- weight_decay: float =
0.01¶ The L2 weight regularization penalty that is applied during training.
- patience: int | None =
None¶ The number of epochs to wait before early stopping if the validation loss does not improve. If None, early stopping is not used.
- gradient_accumulation: int | None =
None¶ The number of gradient accumulation steps to use. If None, gradient accumulation is not used.
- init_lr: float =
- Raises:¶
ValueError – If
end_lr,init_lr, andend_lr_frac_of_init_lrare not consistent, or ifend_lrandend_lr_frac_of_init_lrare both unset.
- set_to_dataset(dataset: PytorchDataset)[source]¶
Sets parameters in the config to appropriate values given dataset.
Some parameters for optimization are dependent upon the total size of the dataset (e.g., converting between a fraction of training and a concrete number of steps). This function sets these parameters based on dataset’s size.
- Parameters:¶
- dataset: PytorchDataset¶
The dataset to set the internal parameters too.
- Raises:¶
ValueError – If the setting process does not yield consistent results.
- class EventStream.transformer.config.Split(value)[source]¶
Bases:
StrEnumWhat data split is being used.
-
HELD_OUT =
'held_out'¶ The held out test set split.
Also often called “test”.
-
TRAIN =
'train'¶ The train split.
-
TUNING =
'tuning'¶ The hyperparameter tuning split.
Also often called “dev”, “validation”, or “val”.
-
HELD_OUT =
- class EventStream.transformer.config.StructuredEventProcessingMode(value)[source]¶
Bases:
StrEnumStructured event sequence processing modes.
-
CONDITIONALLY_INDEPENDENT =
'conditionally_independent'¶ Intra-event covariates are independent of one another, conditioned on history.
-
NESTED_ATTENTION =
'nested_attention'¶ Intra-event covariates are predicted according to a user-specified intra-event dependency chain.
-
CONDITIONALLY_INDEPENDENT =
-
class EventStream.transformer.config.StructuredTransformerConfig(vocab_sizes_by_measurement: dict[str, int] | None =
None, vocab_offsets_by_measurement: dict[str, int] | None =None, measurement_configs: dict[str, MeasurementConfig] | None =None, measurements_idxmap: dict[str, dict[Hashable, int]] | None =None, measurements_per_generative_mode: dict[DataModality, list[str]] | None =None, event_types_idxmap: dict[str, int] | None =None, measurements_per_dep_graph_level: list[list[str | tuple[str, MeasIndexGroupOptions]]] | None =None, max_seq_len: int =256, do_split_embeddings: bool =False, categorical_embedding_dim: int | None =None, numerical_embedding_dim: int | None =None, static_embedding_mode: StaticEmbeddingMode =StaticEmbeddingMode.SUM_ALL, static_embedding_weight: float =0.5, dynamic_embedding_weight: float =0.5, categorical_embedding_weight: float =0.5, numerical_embedding_weight: float =0.5, do_normalize_by_measurement_index: bool =False, do_use_learnable_sinusoidal_ATE: bool =False, structured_event_processing_mode: StructuredEventProcessingMode =StructuredEventProcessingMode.CONDITIONALLY_INDEPENDENT, hidden_size: int | None =None, head_dim: int | None =64, num_hidden_layers: int =2, num_attention_heads: int =4, seq_attention_types: AttentionLayerType | list[AttentionLayerType] | list[tuple[list[AttentionLayerType], int]] | None =None, seq_window_size: int =32, dep_graph_attention_types: AttentionLayerType | list[AttentionLayerType] | list[tuple[list[AttentionLayerType], int]] | None =None, dep_graph_window_size: int | None =2, intermediate_size: int =32, activation_function: str ='gelu', attention_dropout: float =0.1, input_dropout: float =0.1, resid_dropout: float =0.1, init_std: float =0.02, layer_norm_epsilon: float =1e-05, do_full_block_in_dep_graph_attention: bool | None =True, do_full_block_in_seq_attention: bool | None =False, TTE_generation_layer_type: TimeToEventGenerationHeadType ='exponential', TTE_lognormal_generation_num_components: int | None =None, mean_log_inter_event_time_min: float | None =None, std_log_inter_event_time_min: float | None =None, use_cache: bool =True, **kwargs)[source]¶ Bases:
PretrainedConfigThe configuration class for Event Stream GPT models.
It is used to instantiate a Transformer model according to the specified arguments. Depending on the use of the model, some parameters will be unused. For example,
measurements_per_generative_modeand parameters in the Model Output Config section are only used for generative tasks, not fine-tuning tasks.Configuration objects inherit from
PretrainedConfigcan be used to control the model outputs. Read the documentation fromPretrainedConfigfor more information. Of particular interest, note that allPretrainedConfigobjects inherit the following properties, to be used for fine-tuning tasks:finetuning_task (str, optional) — Name of the task used to fine-tune the model. This can be used when converting from an original (TensorFlow or PyTorch) checkpoint.
id2label (Dict[int, str], optional) — A map from index (for instance prediction index, or target index) to label.
label2id (Dict[str, int], optional) — A map from label to index for the model.
num_labels (int, optional) — Number of labels to use in the last layer added to the model, typically for a classification task.
task_specific_params (Dict[str, Any], optional) — Additional keyword arguments to store for the current task.
problem_type (str, optional) — Problem type for fine-tuning models. Can be one of “regression”, “single_label_classification” or “multi_label_classification”.
- Parameters:¶
- vocab_sizes_by_measurement: dict[str, int] | None =
None¶ The size of the vocabulary per data type.
- vocab_offsets_by_measurement: dict[str, int] | None =
None¶ The vocab offset per data type.
- measurement_configs: dict[str, MeasurementConfig] | None =
None¶ A map per measurement to the fit, pre-processed configuration object for that measurement. Used only during generation.
- measurements_idxmap: dict[str, dict[Hashable, int]] | None =
None¶ A map per measurement of the integer index corresponding to that measurement in the unified measurements vocabulary.
- measurements_per_generative_mode: dict[DataModality, list[str]] | None =
None¶ Which measurements (by str name) are generated in which mode.
- event_types_idxmap: dict[str, int] | None =
None¶ A map of the integer index corresponding to each event type.
- measurements_per_dep_graph_level: list[list[str | tuple[str, MeasIndexGroupOptions]]] | None =
None¶ A list of the measurements (by name) and whether or not categorical, numerical, or both associated values of that measurement are used in each dependency graph level. At the default, this assumes the dependency graph has exactly one non-whole-event level and uses that to predict the entirety of the event contents.
- max_seq_len: int =
256¶ The maximum sequence length for the model.
- do_split_embeddings: bool =
False¶ Whether or not embeddings should be split into separate categorical and numerical embedding layers, or all embedded jointly. See
DataEmbeddingLayerfor more information.- categoral_embedding_dim
If specified, the input embedding layer will use a split embedding layer, with one embedding for categorical data and one for continuous data. The embedding dimension for the categorical data will be this value. In this case, numerical_embedding_dim must be specified.
- numerical_embedding_dim: int | None =
None¶ If specified, the input embedding layer will use a split embedding layer, with one embedding for categorical data and one for continuous data. The embedding dimension for the continuous data will be this value. In this case, categoral_embedding_dim must be specified.
- static_embedding_mode: StaticEmbeddingMode =
StaticEmbeddingMode.SUM_ALL¶ Specifies how the static embeddings are combined with dynamic embeddings. Options and their effects are described in the
StaticEmbeddingModedocumentation.- static_embedding_weight: float =
0.5¶ The relative weight of the static embedding in the combined embedding. Only used if the
static_embedding_modeis notStaticEmbeddingMode.DROP.- dynamic_embedding_weight: float =
0.5¶ The relative weight of the dynamic embedding in the combined embedding. Only used if the
static_embedding_modeis notStaticEmbeddingMode.DROP.- categorical_embedding_weight: float =
0.5¶ The relative weight of the categorical embedding in the combined embedding. Only used if
categoral_embedding_dimandnumerical_embedding_dimare not None.- numerical_embedding_weight: float =
0.5¶ The relative weight of the numerical embedding in the combined embedding. Only used if
categoral_embedding_dimandnumerical_embedding_dimare not None.- do_normalize_by_measurement_index: bool =
False¶ If True, the input embeddings are normalized such that each unique measurement index contributes equally to the embedding.
- do_use_learnable_sinusoidal_ATE: bool =
False¶ If True, then the model will produce temporal position embeddings via a sinnusoidal position embedding such that the frequencies are learnable, rather than fixed and regular.
- structured_event_processing_mode: StructuredEventProcessingMode =
StructuredEventProcessingMode.CONDITIONALLY_INDEPENDENT¶ Specifies how the internal event is processed internally by the model. Can be either:
StructuredEventProcessingMode.NESTED_ATTENTION: In this case, the whole-event embeddings are processed via a sequential encoder first into historical embeddings, then the inter-event dependency graph elements are processed via a second sequential encoder alongside the relevant historical embedding. Sequential processing types are either full attention / MLP blocks or just self attention layers, as controlled bydo_full_block_in_seq_attentionanddo_full_block_in_dep_graph_attention.StructuredEventProcessingMode.CONDITIONALLY_INDEPENDENTIn this case, the input dependency graph embedding elements are all summed and processed as a single event sequence, with each event’s output embedding being used to simultaneously predict all elements of the subsequent event (thereby treating them all as conditionally independent). In this case, the following parameters should all be None:measurements_per_dep_graph_leveldo_full_block_in_seq_attentiondo_full_block_in_dep_graph_attentiondep_graph_attention_typesdep_graph_window_size
The hidden size of the model. Must be consistent with
head_dim, if specified.- head_dim: int | None =
64¶ The hidden size per attention head. Useful for hyperparameter tuning to avoid setting infeasible hidden sizes. Must be consistent with hidden_size, if specified.
Number of encoder layers.
- num_attention_heads: int =
4¶ Number of attention heads for each attention layer in the Transformer encoder.
- seq_attention_types: AttentionLayerType | list[AttentionLayerType] | list[tuple[list[AttentionLayerType], int]] | None =
None¶ The type of attention for each sequence self attention layer.
- seq_window_size: int =
32¶ The window size used in local attention for sequence self attention layers.
- dep_graph_attention_types: AttentionLayerType | list[AttentionLayerType] | list[tuple[list[AttentionLayerType], int]] | None =
None¶ The type of attention for each dependency graph self attention layer. Defaults to global attention as dependency graph sare in general much shorter than sequences.
- dep_graph_window_size: int | None =
2¶ The window size used in local attention for dependency graph self attention layers. Default is set much lower as dependency graphs are in general much shorter than sequences.
- do_full_block_in_seq_attention: bool | None =
False¶ If True, use a full attention block (including layer normalization and MLP layers) for the sequence processing module. If false, just use a self attention layer.
- do_full_block_in_dep_graph_attention: bool | None =
True¶ If True, use a full attention block (including layer normalization and MLP layers) for the dependency graph processing module. If false, just use a self attention layer.
- intermediate_size: int =
32¶ Dimension of the “intermediate” (often named feed-forward) layer in encoder.
- activation_function: str =
'gelu'¶ The non-linear activation function (function or string) in the encoder. If string,
"gelu"and"relu"are supported.- input_dropout: float =
0.1¶ The dropout probability for the input layer.
- attention_dropout: float =
0.1¶ The dropout probability for the attention probabilities.
- resid_dropout: float =
0.1¶ The dropout probability used on the residual connections.
- layer_norm_epsilon: float =
1e-05¶ The epsilon used by the layer normalization layers.
- init_std: float =
0.02¶ The standard deviation of the truncated normal weight initialization distribution.
- TTE_generation_layer_type: TimeToEventGenerationHeadType =
'exponential'¶ What kind of TTE generation layer to use.
- TTE_lognormal_generation_num_components: int | None =
None¶ If the TTE generation layer is
'log_normal_mixture', this specifies the number of mixture components to include. Must beNoneifTTE_generation_layer_type == 'exponential'.- mean_log_inter_event_time_min: float | None =
None¶ The mean of the log of the time between events in the underlying data. Used for normalizing TTE predictions. Must be
NoneifTTE_generation_layer_type == 'exponential'.- std_log_inter_event_time_min: float | None =
None¶ The standard deviation of the log of the time between events in the underlying data. Used for normalizing TTE predictions. Must be
NoneifTTE_generation_layer_type == 'exponential'.- use_cache: bool =
True¶ Whether to use the past key/values attentions (if applicable to the model) to speed up decoding.
- vocab_sizes_by_measurement: dict[str, int] | None =
- Raises:¶
ValueError – If configuration parameters are not fully self consistent.
- expand_attention_types_params(attention_types: AttentionLayerType | list[AttentionLayerType] | list[tuple[list[AttentionLayerType], int]]) list[AttentionLayerType][source]¶
Expands the attention syntax from the easy-to-enter syntax to one for the model.
- classmethod from_dict(*args, **kwargs) StructuredTransformerConfig[source]¶
Instantiates a [
PretrainedConfig] from a Python dictionary of parameters.- Parameters:¶
- config_dict :
Dict[str, Any] Dictionary that will be used to instantiate the configuration object. Such a dictionary can be retrieved from a pretrained checkpoint by leveraging the [
get_config_dict] method.- **kwargs¶
Additional parameters from which to initialize the configuration object.
- config_dict :
- Returns:¶
The configuration object instantiated from those parameters.
- Return type:¶
[
PretrainedConfig]
- measurements_for(modality: DataModality) list[str][source]¶
- set_to_dataset(dataset: PytorchDataset)[source]¶
Set various configuration parameters to match
dataset.
- class EventStream.transformer.config.TimeToEventGenerationHeadType(value)[source]¶
Bases:
StrEnumOptions for model TTE generation heads.
-
EXPONENTIAL =
'exponential'¶ TTE is modeled by an exponential distribution with a model-determined rate parameter.
-
LOG_NORMAL_MIXTURE =
'log_normal_mixture'¶ TTE is modeled by a mixture of log-normal distribiutions.
-
EXPONENTIAL =