EventStream.baseline.FT_task_baseline module

Utilities for collecting baseline performance of fine-tuning tasks defined over ESGPT datasets.

class EventStream.baseline.FT_task_baseline.BaseSklearnModuleConfig[source]

Bases: ABC

CLS : str = '???'
SKIP_PARAMS = ['CLS', 'SKLEARN_COMPONENTS', 'SKIP_PARAMS']
SKLEARN_COMPONENTS = {'ESDFlatFeatureLoader': <class 'EventStream.baseline.FT_task_baseline.ESDFlatFeatureLoader'>, 'KNNImputer': <class 'sklearn.impute._knn.KNNImputer'>, 'MinMaxScaler': <class 'sklearn.preprocessing._data.MinMaxScaler'>, 'NMF': <class 'sklearn.decomposition._nmf.NMF'>, 'PCA': <class 'sklearn.decomposition._pca.PCA'>, 'RandomForestClassifier': <class 'sklearn.ensemble._forest.RandomForestClassifier'>, 'SelectKBest': <class 'sklearn.feature_selection._univariate_selection.SelectKBest'>, 'SimpleImputer': <class 'sklearn.impute._base.SimpleImputer'>, 'StandardScaler': <class 'sklearn.preprocessing._data.StandardScaler'>, 'mutual_info_classif': <function mutual_info_classif>}
get_model(seed: int | None = None, **additional_kwargs) Any[source]
property module_kwargs : dict[str, Any]
class EventStream.baseline.FT_task_baseline.ESDFlatFeatureLoader(ESD: Dataset, window_sizes: list[str], feature_inclusion_frequency: float | dict[str, float] | None = None, include_only_measurements: set[str] | None = None, convert_to_mean_var: bool = True, **kwargs)[source]

Bases: object

A flat feature pre-processor in line with scikit-learn’s APIs.

This can dynamically apply window size, feature inclusion frequency, measurement restrictions, and mean variable conversions to flat feature sets. All window sizes indicated in this featurizer must be included in the passed dataframes.

fit(flat_rep_df: DataFrame, _) ESDFlatFeatureLoader[source]
set_params(ESD: Dataset | None = None, window_sizes: list[str] | None = None, feature_inclusion_frequency: float | dict[str, float] | None = None, include_only_measurements: set[str] | None = None, convert_to_mean_var: bool | None = None)[source]
transform(flat_rep_df: DataFrame) ndarray[source]
class EventStream.baseline.FT_task_baseline.ESDFlatFeatureLoaderConfig(CLS: str = 'ESDFlatFeatureLoader', window_sizes: list[str] | None = None, feature_inclusion_frequency: float | None = None, include_only_measurements: list[str] | None = None, convert_to_mean_var: bool = True)[source]

Bases: BaseSklearnModuleConfig

CLS : str = 'ESDFlatFeatureLoader'
convert_to_mean_var : bool = True
feature_inclusion_frequency : float | None = None
include_only_measurements : list[str] | None = None
window_sizes : list[str] | None = None
class EventStream.baseline.FT_task_baseline.KNNImputerConfig(CLS: str = 'KNNImputer', n_neighbors: int = 5, weights: str = 'uniform', add_indicator: bool = True)[source]

Bases: BaseSklearnModuleConfig

CLS : str = 'KNNImputer'
add_indicator : bool = True
n_neighbors : int = 5
weights : str = 'uniform'
class EventStream.baseline.FT_task_baseline.MinMaxScalerConfig(CLS: str = 'MinMaxScaler')[source]

Bases: BaseSklearnModuleConfig

CLS : str = 'MinMaxScaler'
class EventStream.baseline.FT_task_baseline.NMFConfig(CLS: str = 'NMF', n_components: int = 2)[source]

Bases: BaseSklearnModuleConfig

CLS : str = 'NMF'
n_components : int = 2
class EventStream.baseline.FT_task_baseline.PCAConfig(CLS: str = 'PCA', n_components: int = 2)[source]

Bases: BaseSklearnModuleConfig

CLS : str = 'PCA'
n_components : int = 2
class EventStream.baseline.FT_task_baseline.RandomForestClassifierConfig(CLS: str = 'RandomForestClassifier', n_estimators: int = 100, criterion: str = 'gini', max_depth: int | None = None, min_samples_split: int = 2, min_samples_leaf: int = 1, min_weight_fraction_leaf: float = 0.0, max_features: str | None = 'sqrt', max_leaf_nodes: int | None = None, min_impurity_decrease: float = 0.0, bootstrap: bool = True, oob_score: bool = False, class_weight: str | None = None, ccp_alpha: float = 0.0, max_samples: int | float | None = None)[source]

Bases: BaseSklearnModuleConfig

CLS : str = 'RandomForestClassifier'
bootstrap : bool = True
ccp_alpha : float = 0.0
class_weight : str | None = None
criterion : str = 'gini'
max_depth : int | None = None
max_features : str | None = 'sqrt'
max_leaf_nodes : int | None = None
max_samples : int | float | None = None
min_impurity_decrease : float = 0.0
min_samples_leaf : int = 1
min_samples_split : int = 2
min_weight_fraction_leaf : float = 0.0
n_estimators : int = 100
oob_score : bool = False
class EventStream.baseline.FT_task_baseline.SelectKBestConfig(CLS: str = 'SelectKBest', k: int = 2)[source]

Bases: BaseSklearnModuleConfig

CLS : str = 'SelectKBest'
k : int = 2
class EventStream.baseline.FT_task_baseline.SimpleImputerConfig(CLS: str = 'SimpleImputer', strategy: str = 'constant', fill_value: float = 0, add_indicator: bool = True)[source]

Bases: BaseSklearnModuleConfig

CLS : str = 'SimpleImputer'
add_indicator : bool = True
fill_value : float = 0
strategy : str = 'constant'
class EventStream.baseline.FT_task_baseline.SklearnConfig(defaults: list[typing.Any] = <factory>, seed: int = 1, experiment_dir: str | pathlib.Path = '???', dataset_dir: str | pathlib.Path = '???', save_dir: str | pathlib.Path = '${experiment_dir}/sklearn_baselines/${task_df_name}/${finetuning_task_label}/${now:%Y-%m-%d_%H-%M-%S}', train_subset_size: int | float | str | None = None, do_overwrite: bool = False, task_df_name: str | None = '???', finetuning_task_label: str | None = '???', feature_selector: Any = '???', scaling: Any = None, imputation: Any = None, dim_reduce: Any = None, model: Any = '???', wandb_logger_kwargs: dict[str, typing.Any] = <factory>)[source]

Bases: object

PIPELINE_COMPONENTS = ['feature_selector', 'scaling', 'imputation', 'dim_reduce', 'model']
dataset_dir : str | Path = '???'
defaults : list[Any]
dim_reduce : Any = None
do_overwrite : bool = False
experiment_dir : str | Path = '???'
feature_selector : Any = '???'
finetuning_task_label : str | None = '???'
get_model(dataset: Dataset) Any[source]
imputation : Any = None
model : Any = '???'
save_dir : str | Path = '${experiment_dir}/sklearn_baselines/${task_df_name}/${finetuning_task_label}/${now:%Y-%m-%d_%H-%M-%S}'
scaling : Any = None
seed : int = 1
task_df_name : str | None = '???'
train_subset_size : int | float | str | None = None
wandb_logger_kwargs : dict[str, Any]
class EventStream.baseline.FT_task_baseline.StandardScalerConfig(CLS: str = 'StandardScaler')[source]

Bases: BaseSklearnModuleConfig

CLS : str = 'StandardScaler'
EventStream.baseline.FT_task_baseline.eval_binary_classification(Y: ndarray, probs: ndarray) dict[str, float][source]
EventStream.baseline.FT_task_baseline.eval_multi_class_classification(Y: ndarray, probs: ndarray, task_vocab: list[Any])[source]
EventStream.baseline.FT_task_baseline.load_flat_rep(ESD: Dataset, window_sizes: list[str], feature_inclusion_frequency: float | dict[str, float] | None = None, include_only_measurements: set[str] | None = None, do_update_if_missing: bool = True, task_df_name: str | None = None, do_cache_filtered_task: bool = True, subjects_included: dict[str, set[int]] | None = None) dict[str, LazyFrame][source]

Loads a set of flat representations from a passed dataset that satisfy the given constraints.

Parameters:
ESD: Dataset

The dataset for which the flat representations should be loaded.

window_sizes: list[str]

Beyond writing out a raw, per-event flattened representation, the dataset also has the capability to summarize these flattened representations over the historical windows specified in this argument. These are strings specifying time deltas, using this syntax: link_. Each window size will be summarized to a separate directory, and will share the same subject file split as is used in the raw representation files.

feature_inclusion_frequency: float | dict[str, float] | None = None

The base feature inclusion frequency that should be used to dictate what features can be included in the flat representation. It can either be a float, in which case it applies across all measurements, or None, in which case no filtering is applied, or a dictionary from measurement type to a float dictating a per-measurement-type inclusion cutoff.

include_only_measurements: set[str] | None = None

Measurement types can also be filtered out wholesale from both representations. If this list is not None, only these measurements will be included.

do_update_if_missing: bool = True

If True, then if any window sizes or features are missing, the function will try to update the stored flat representations to reflect these. If False, if information is missing, it will raise a FileNotFoundError instead.

task_df_name: str | None = None

If specified, the flat representations loaded will be (inner) joined against the task dataframe of this name on the columns "subject_id" and "end_time" (which will be renamed to "timestamp"). This is to avoid needing to load the full dataset in flattened form into memory. This is also used as a cache key; if a pre-filtered dataset is written to disk at a specified path for this task, then the data will be loaded from there, rather than from the base dataset.

do_cache_filtered_task: bool = True

If True, the flat representations will, after being filtered to just the relevant rows for the task, be cached to disk for faster re-use.

subjects_included: dict[str, set[int]] | None = None

A dictionary by split of the subjects to include in the task. Omitted splits are used wholesale.

Raises:

FileNotFoundError – If do_update_if_missing is False and the requested historical representations are not already written to disk.

EventStream.baseline.FT_task_baseline.registered_sklearn_config(dataclass: Any) Any[source]

Decorator that allows you to use a dataclass as a hydra config via the ConfigStore

Adds the decorated dataclass as a Hydra StructuredConfig object to the Hydra ConfigStore. The name of the stored config in the ConfigStore is the snake case version of the CamelCase class name.

EventStream.baseline.FT_task_baseline.train_sklearn_pipeline(cfg: SklearnConfig)[source]
EventStream.baseline.FT_task_baseline.wandb_train_sklearn(cfg: SklearnConfig)[source]