EventStream.baseline.FT_task_baseline module¶

Utilities for collecting baseline performance of fine-tuning tasks defined over ESGPT datasets.

class EventStream.baseline.FT_task_baseline.BaseSklearnModuleConfig[source]¶

Bases: ABC

CLS : str = '???'¶

SKIP_PARAMS = ['CLS', 'SKLEARN_COMPONENTS', 'SKIP_PARAMS']¶

SKLEARN_COMPONENTS = {'ESDFlatFeatureLoader': <class 'EventStream.baseline.FT_task_baseline.ESDFlatFeatureLoader'>, 'KNNImputer': <class 'sklearn.impute._knn.KNNImputer'>, 'MinMaxScaler': <class 'sklearn.preprocessing._data.MinMaxScaler'>, 'NMF': <class 'sklearn.decomposition._nmf.NMF'>, 'PCA': <class 'sklearn.decomposition._pca.PCA'>, 'RandomForestClassifier': <class 'sklearn.ensemble._forest.RandomForestClassifier'>, 'SelectKBest': <class 'sklearn.feature_selection._univariate_selection.SelectKBest'>, 'SimpleImputer': <class 'sklearn.impute._base.SimpleImputer'>, 'StandardScaler': <class 'sklearn.preprocessing._data.StandardScaler'>, 'mutual_info_classif': <function mutual_info_classif>}¶

get_model(seed: int | None = None, **additional_kwargs) → Any[source]¶

property module_kwargs : dict[str, Any]¶

class EventStream.baseline.FT_task_baseline.ESDFlatFeatureLoader(ESD: Dataset, window_sizes: list[str], feature_inclusion_frequency: float | dict[str, float] | None = None, include_only_measurements: set[str] | None = None, convert_to_mean_var: bool = True, **kwargs)[source]¶

Bases: object

A flat feature pre-processor in line with scikit-learn’s APIs.

This can dynamically apply window size, feature inclusion frequency, measurement restrictions, and mean variable conversions to flat feature sets. All window sizes indicated in this featurizer must be included in the passed dataframes.

fit(flat_rep_df: DataFrame, _) → ESDFlatFeatureLoader[source]¶

set_params(ESD: Dataset | None = None, window_sizes: list[str] | None = None, feature_inclusion_frequency: float | dict[str, float] | None = None, include_only_measurements: set[str] | None = None, convert_to_mean_var: bool | None = None)[source]¶

transform(flat_rep_df: DataFrame) → ndarray[source]¶

class EventStream.baseline.FT_task_baseline.ESDFlatFeatureLoaderConfig(CLS: str = 'ESDFlatFeatureLoader', window_sizes: list[str] | None = None, feature_inclusion_frequency: float | None = None, include_only_measurements: list[str] | None = None, convert_to_mean_var: bool = True)[source]¶

Bases: BaseSklearnModuleConfig

CLS : str = 'ESDFlatFeatureLoader'¶

convert_to_mean_var : bool = True¶

feature_inclusion_frequency : float | None = None¶

include_only_measurements : list[str] | None = None¶

window_sizes : list[str] | None = None¶

class EventStream.baseline.FT_task_baseline.KNNImputerConfig(CLS: str = 'KNNImputer', n_neighbors: int = 5, weights: str = 'uniform', add_indicator: bool = True)[source]¶

Bases: BaseSklearnModuleConfig

CLS : str = 'KNNImputer'¶

add_indicator : bool = True¶

n_neighbors : int = 5¶

weights : str = 'uniform'¶

class EventStream.baseline.FT_task_baseline.MinMaxScalerConfig(CLS: str = 'MinMaxScaler')[source]¶

Bases: BaseSklearnModuleConfig

CLS : str = 'MinMaxScaler'¶

class EventStream.baseline.FT_task_baseline.NMFConfig(CLS: str = 'NMF', n_components: int = 2)[source]¶

Bases: BaseSklearnModuleConfig

CLS : str = 'NMF'¶

n_components : int = 2¶

class EventStream.baseline.FT_task_baseline.PCAConfig(CLS: str = 'PCA', n_components: int = 2)[source]¶

Bases: BaseSklearnModuleConfig

CLS : str = 'PCA'¶

n_components : int = 2¶

class EventStream.baseline.FT_task_baseline.RandomForestClassifierConfig(CLS: str = 'RandomForestClassifier', n_estimators: int = 100, criterion: str = 'gini', max_depth: int | None = None, min_samples_split: int = 2, min_samples_leaf: int = 1, min_weight_fraction_leaf: float = 0.0, max_features: str | None = 'sqrt', max_leaf_nodes: int | None = None, min_impurity_decrease: float = 0.0, bootstrap: bool = True, oob_score: bool = False, class_weight: str | None = None, ccp_alpha: float = 0.0, max_samples: int | float | None = None)[source]¶

Bases: BaseSklearnModuleConfig

CLS : str = 'RandomForestClassifier'¶

bootstrap : bool = True¶

ccp_alpha : float = 0.0¶

class_weight : str | None = None¶

criterion : str = 'gini'¶

max_depth : int | None = None¶

max_features : str | None = 'sqrt'¶

max_leaf_nodes : int | None = None¶

max_samples : int | float | None = None¶

min_impurity_decrease : float = 0.0¶

min_samples_leaf : int = 1¶

min_samples_split : int = 2¶

min_weight_fraction_leaf : float = 0.0¶

n_estimators : int = 100¶

oob_score : bool = False¶

class EventStream.baseline.FT_task_baseline.SelectKBestConfig(CLS: str = 'SelectKBest', k: int = 2)[source]¶

Bases: BaseSklearnModuleConfig

CLS : str = 'SelectKBest'¶

k : int = 2¶

class EventStream.baseline.FT_task_baseline.SimpleImputerConfig(CLS: str = 'SimpleImputer', strategy: str = 'constant', fill_value: float = 0, add_indicator: bool = True)[source]¶

Bases: BaseSklearnModuleConfig

CLS : str = 'SimpleImputer'¶

add_indicator : bool = True¶

fill_value : float = 0¶

strategy : str = 'constant'¶

class EventStream.baseline.FT_task_baseline.SklearnConfig(defaults: list[typing.Any] = <factory>, seed: int = 1, experiment_dir: str | pathlib.Path = '???', dataset_dir: str | pathlib.Path = '???', save_dir: str | pathlib.Path = '${experiment_dir}/sklearn_baselines/${task_df_name}/${finetuning_task_label}/${now:%Y-%m-%d_%H-%M-%S}', train_subset_size: int | float | str | None = None, do_overwrite: bool = False, task_df_name: str | None = '???', finetuning_task_label: str | None = '???', feature_selector: Any = '???', scaling: Any = None, imputation: Any = None, dim_reduce: Any = None, model: Any = '???', wandb_logger_kwargs: dict[str, typing.Any] = <factory>)[source]¶

Bases: object

PIPELINE_COMPONENTS = ['feature_selector', 'scaling', 'imputation', 'dim_reduce', 'model']¶

dataset_dir : str | Path = '???'¶

defaults : list[Any]¶

dim_reduce : Any = None¶

do_overwrite : bool = False¶

experiment_dir : str | Path = '???'¶

feature_selector : Any = '???'¶

finetuning_task_label : str | None = '???'¶

get_model(dataset: Dataset) → Any[source]¶

imputation : Any = None¶

model : Any = '???'¶

save_dir : str | Path = '${experiment_dir}/sklearn_baselines/${task_df_name}/${finetuning_task_label}/${now:%Y-%m-%d_%H-%M-%S}'¶

scaling : Any = None¶

seed : int = 1¶

task_df_name : str | None = '???'¶

train_subset_size : int | float | str | None = None¶

wandb_logger_kwargs : dict[str, Any]¶

class EventStream.baseline.FT_task_baseline.StandardScalerConfig(CLS: str = 'StandardScaler')[source]¶

Bases: BaseSklearnModuleConfig

CLS : str = 'StandardScaler'¶

EventStream.baseline.FT_task_baseline.eval_binary_classification(Y: ndarray, probs: ndarray) → dict[str, float][source]¶

EventStream.baseline.FT_task_baseline.eval_multi_class_classification(Y: ndarray, probs: ndarray, task_vocab: list[Any])[source]¶

EventStream.baseline.FT_task_baseline.load_flat_rep(ESD: Dataset, window_sizes: list[str], feature_inclusion_frequency: float | dict[str, float] | None = None, include_only_measurements: set[str] | None = None, do_update_if_missing: bool = True, task_df_name: str | None = None, do_cache_filtered_task: bool = True, subjects_included: dict[str, set[int]] | None = None) → dict[str, LazyFrame][source]¶

Loads a set of flat representations from a passed dataset that satisfy the given constraints.

Parameters:¶

ESD: Dataset¶: The dataset for which the flat representations should be loaded.
window_sizes: list[str]¶: Beyond writing out a raw, per-event flattened representation, the dataset also has the capability to summarize these flattened representations over the historical windows specified in this argument. These are strings specifying time deltas, using this syntax: link_. Each window size will be summarized to a separate directory, and will share the same subject file split as is used in the raw representation files.
feature_inclusion_frequency: float | dict[str, float] | None = None¶: The base feature inclusion frequency that should be used to dictate what features can be included in the flat representation. It can either be a float, in which case it applies across all measurements, or None, in which case no filtering is applied, or a dictionary from measurement type to a float dictating a per-measurement-type inclusion cutoff.
include_only_measurements: set[str] | None = None¶: Measurement types can also be filtered out wholesale from both representations. If this list is not None, only these measurements will be included.
do_update_if_missing: bool = True¶: If True, then if any window sizes or features are missing, the function will try to update the stored flat representations to reflect these. If False, if information is missing, it will raise a FileNotFoundError instead.
task_df_name: str | None = None¶: If specified, the flat representations loaded will be (inner) joined against the task dataframe of this name on the columns "subject_id" and "end_time" (which will be renamed to "timestamp"). This is to avoid needing to load the full dataset in flattened form into memory. This is also used as a cache key; if a pre-filtered dataset is written to disk at a specified path for this task, then the data will be loaded from there, rather than from the base dataset.
do_cache_filtered_task: bool = True¶: If True, the flat representations will, after being filtered to just the relevant rows for the task, be cached to disk for faster re-use.
subjects_included: dict[str, set[int]] | None = None¶: A dictionary by split of the subjects to include in the task. Omitted splits are used wholesale.

Raises:¶

FileNotFoundError – If do_update_if_missing is False and the requested historical representations are not already written to disk.

EventStream.baseline.FT_task_baseline.registered_sklearn_config(dataclass: Any) → Any[source]¶

Decorator that allows you to use a dataclass as a hydra config via the ConfigStore

Adds the decorated dataclass as a Hydra StructuredConfig object to the Hydra ConfigStore. The name of the stored config in the ConfigStore is the snake case version of the CamelCase class name.

EventStream.baseline.FT_task_baseline.train_sklearn_pipeline(cfg: SklearnConfig)[source]¶

EventStream.baseline.FT_task_baseline.wandb_train_sklearn(cfg: SklearnConfig)[source]¶