EventStream.evaluation.MCF_evaluation module¶
This file contains code to aid in longitudinal, MCF-based evaluation over measurement predicates.
- EventStream.evaluation.MCF_evaluation.align_time_and_eval_predicates(df: DataFrame, measurement_predicates: dict[int, bool | tuple[None | tuple[float, bool] | float, None | tuple[float, bool]]]) DataFrame[source]¶
Adjusts the input DataFrame’s time column and evaluates the measurement predicates.
- Parameters:¶
- df: DataFrame¶
The dataframe to be adjusted. Must have the columns
subject_id,time,dynamic_indices,dynamic_values, andalign_time.- measurement_predicates: dict[int, bool | tuple[None | tuple[float, bool] | float, None | tuple[float, bool]]]¶
A dictionary from dynamic measurement index to either the boolean True, in which case the presence of the measurement is used alone, or a range dictating bounds for the measurement’s value to satisfy the predicate. The range is in the format
(LOWER_BOUND, UPPER_BOUND), where*_BOUNDcan be eitherNone(in which case there is no bound on that side), a floating point value (in which case the bound is considered to be exclusive), or a tuple of a floating point value and a boolean value where the boolean value indicates an inclusive or exclusive bound.
- Returns:¶
A modified dataframe such that the elements of the (nested) time column are normalized such that
0indicates a time value ofalign_timeand such that the dynamic indices and values columns are replaced by a set of boolean list columns detailing whether or not the event at that index satisfies the given predicate.
Examples
>>> df = pl.DataFrame({ ... 'subject_id': [1, 2, 3], ... 'time': [ ... [0., 10, 20], ... [0., 100], ... [0., 1, 2, 3], ... ], ... 'dynamic_indices': [ ... [[1, 2], [3, 3, 2], [4]], ... [[1], [3]], ... [[2, 3], [1], [8], [3, 1, 1]], ... ], ... 'dynamic_values': [ ... [[None, 0], [-1, 4, 0.2], [None]], ... [[None], [3]], ... [[-0.1, 10], [None], [None], [6, None, None]], ... ], ... 'align_time': [10, 100, 1.5], ... }) >>> measurement_predicates = { ... 3: (3.5, None), ... 1: True, ... } >>> out = align_time_and_eval_predicates(df, measurement_predicates) >>> pl.Config.set_tbl_width_chars(80) <class 'polars.config.Config'> >>> out shape: (3, 4) ┌────────────┬─────────────────────┬─────────────────┬─────────────────────────┐ │ subject_id ┆ time ┆ pred_3 ┆ pred_1 │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ list[f64] ┆ list[bool] ┆ list[bool] │ ╞════════════╪═════════════════════╪═════════════════╪═════════════════════════╡ │ 1 ┆ [-10.0, 0.0, 10.0] ┆ [false, true, ┆ [true, false, false] │ │ ┆ ┆ false] ┆ │ │ 2 ┆ [-100.0, 0.0] ┆ [false, false] ┆ [true, false] │ │ 3 ┆ [-1.5, -0.5, … 1.5] ┆ [true, false, … ┆ [false, true, … true] │ │ ┆ ┆ true] ┆ │ └────────────┴─────────────────────┴─────────────────┴─────────────────────────┘ >>> out[2]['time'].item().to_list() [-1.5, -0.5, 0.5, 1.5] >>> out[2]['pred_3'].item().to_list() [True, False, False, True] >>> out[2]['pred_1'].item().to_list() [False, True, False, True]
- EventStream.evaluation.MCF_evaluation.crps(samples: ndarray, true: ndarray) ndarray[source]¶
Computes the Continuous Ranked Probability Score (CRPS) [1].
Given an empirical distribution and a true observation, this computes the CRPS between the two. For a single sample, this reduces to absolute error. The empirical distribution should be arranged such that independent samples of the distribution are on the first axis, and all other axes should be equal.
Initial Source: https://docs.pyro.ai/en/stable/_modules/pyro/ops/stats.html#crps_empirical
- [1] Tilmann Gneiting, Adrian E. Raftery (2007)
Strictly Proper Scoring Rules, Prediction, and Estimationhttps://www.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf
- Parameters:¶
- samples: ndarray¶
A numpy array of shape (n_samples, …) containing the drawn empirical samples for the distribution in question. May contain NaNs, which represents missing or censored samples.
- true: ndarray¶
A numpy array of shape (…) containing true observations. May contain NaNs, which represent missing or censored true observations.
- Returns:¶
- A numpy array of shape (…) containing the CRPS score results for the true observations and empirical
distributions corresponding to each position. Will be NaN if either the true observation was NaN at that position or if all sampled observations were NaN at that position.
- Raises:¶
ValueError – If the shape of
truedoes not match the shape ofsamplesabsent the first dimension.
Examples
>>> import numpy as np >>> true = np.array([0]) >>> samples = np.array([[-2]]) >>> crps(samples, true) array([2]) >>> true = np.array([0]) >>> samples = np.array([[-2], [np.NaN], [np.NaN], [1], [2]]) >>> crps(samples, true) array([0.77777778]) >>> true = np.array([0]) >>> samples = np.array([[-2], [-1], [0], [1], [2]]) >>> crps(samples, true) array([0.4]) >>> true = np.array([-2, 0, -2, np.NaN]) >>> samples = np.array([ ... [-1, 1, -1, -1], ... [1, -2, 1, 1], ... [2, -20, np.NaN, 2], ... [0, 10, 0, 0], ... [3, 1, 3, 3], ... [1, 1, 1, 1] ... ]) >>> crps(samples, true) array([2.27777778, 1.41666667, 2.08 , nan]) >>> crps(np.array([-2, -1, 0, 1, 2]), true) Traceback (most recent call last): ... ValueError: The shape of true (4,) must match that of samples (5,) after the 1st dimension.
- EventStream.evaluation.MCF_evaluation.eval_range(rng: tuple[None | tuple[float, bool] | float, None | tuple[float, bool]], val: Expr) Expr[source]¶
Returns true if val satisfies the range rng.
Examples
>>> pl.select(eval_range(True, pl.lit(0.1))).item() True >>> pl.select(eval_range(False, pl.lit(0.1))).item() False >>> pl.select(eval_range((1, 2), pl.lit(0.1))).item() False >>> pl.select(eval_range((None, 2), pl.lit(0.1))).item() True >>> pl.select(eval_range((1, 2), pl.lit(1))).item() False >>> pl.select(eval_range(((1, False), 2), pl.lit(1))).item() False >>> pl.select(eval_range(((1, True), 2), pl.lit(1))).item() True >>> pl.select(eval_range((1, 2), pl.lit(3))).item() False >>> pl.select(eval_range((1, None), pl.lit(3))).item() True
- EventStream.evaluation.MCF_evaluation.get_MCF(aligned_Ts: list[float], MCF_cols: list[str], *dfs: list[DataFrame]) tuple[ndarray, ndarray][source]¶
Returns the population censor mask and the cumulative predicate incidence delta function for dfs.
- Parameters:¶
- aligned_Ts: list[float]¶
The timestamps for which the final MCF and censoring mask should be computed.
- MCF_cols: list[str]¶
A list of
pl.List[pl.Boolean]columns in the dataframes to compute the MCF over.- *dfs: list[DataFrame]¶
A list of dataframes to include in the final MCF. Each must be in the same order and have columns
time, andMCF_cols[i]for alli.
- Returns:¶
- A boolean numpy array of shape
(len(dfs), dfs[0].shape[0], len(aligned_Ts))which contains a 1 at index
[n, i, j]if subjectihas any data at or after timealigned_Ts[j]indfs[n].
- A boolean numpy array of shape
- A uint numpy array of shape
(len(dfs), dfs[0].shape[0], len(aligned_Ts), len(MCF_cols))such that the value at index
[n, i, j, k]is the count of new instances whereMCF_cols[k]is True for subjectibetween timealigned_Ts[j-1](or negative infinity ifj == 0) andaligned_Ts[j]indfs[n].
- A uint numpy array of shape
Examples
>>> df_1 = pl.DataFrame({ ... "subject_id": [1, 2], ... "time": [ ... [-3.2, -2, 0, 10.2], ... [0., 1.], ... ], ... "pred_1": [ ... [False, True, True, False], ... [True, True], ... ], ... "pred_2": [ ... [True, False, False, True], ... [False, False], ... ], ... }) >>> df_2 = pl.DataFrame({ ... "subject_id": [1, 2], ... "time": [ ... [-1.9, 0., 0.2], ... [-10., 0., 2.3], ... ], ... "pred_1": [ ... [False, True, False], ... [True, True, False], ... ], ... "pred_2": [ ... [True, False, True], ... [True, False, False], ... ], ... }) >>> aligned_Ts = [-3, 3, 6, 10] >>> out = get_MCF(aligned_Ts, ["pred_1", "pred_2"], df_1, df_2) >>> print(f"Got a {type(out)} of len {len(out)}") Got a <class 'tuple'> of len 2 >>> out[0] array([[[ True, True, True, True, True], [ True, True, False, False, False]], [[ True, True, False, False, False], [ True, True, False, False, False]]]) >>> out[1] array([[[[ 0., 1.], [ 2., 0.], [ 0., 0.], [ 0., 0.], [ 0., 1.]], [[nan, nan], [ 2., 0.], [ 0., 0.], [ 0., 0.], [nan, nan]]], [[[nan, nan], [ 1., 2.], [ 0., 0.], [ 0., 0.], [ 0., 0.]], [[ 1., 1.], [ 1., 0.], [ 0., 0.], [ 0., 0.], [ 0., 0.]]]])
-
EventStream.evaluation.MCF_evaluation.get_MCF_coordinates(control_df: DataFrame, sample_dfs: list[DataFrame], measurement_predicates: dict[int, bool | tuple[None | tuple[float, bool] | float, None | tuple[float, bool]] | list[tuple[None | tuple[float, bool] | float, None | tuple[float, bool]]]], n_timestamps: int | None =
None) tuple[list[int], list[float], list[int], ndarray, ndarray, ndarray, ndarray][source]¶ Returns aligned MCF coordinates per subject comparing the control and sample dataframes.
- Parameters:¶
- control_df: DataFrame¶
A dataframe in the “deep-learning friendly format” containing the control data for comparison. Must have columns
subject_id,time,dynamic_indices, anddynamic_values.- sample_dfs: list[DataFrame]¶
A list of dataframes in the “deep-learning friendly format” containing the comparison population. Must have the same columns as the control_df, plus additional column
control_align_idx, which states what event index within the control dataframe is the temporal alignment point. Each entry of the list is interpreted to be an independent sample for comparison, and list order is presumed to be meaningless.- measurement_predicates: dict[int, bool | tuple[None | tuple[float, bool] | float, None | tuple[float, bool]] | list[tuple[None | tuple[float, bool] | float, None | tuple[float, bool]]]]¶
A dictionary from dynamic measurement index to either the boolean True, in which case the presence of the measurement is used alone, or a range dictating bounds for the measurement’s value to satisfy the predicate. The range is in the format
(LOWER_BOUND, UPPER_BOUND), where*_BOUNDcan be eitherNone(in which case there is no bound on that side), a floating point value (in which case the bound is considered to be exclusive), or a tuple of a floating point value and a boolean value where the boolean value indicates an inclusive or exclusive bound.- n_timestamps: int | None =
None¶ Downsample (without replacement) the set of possible aligned timepoints to this number if specified.
- Returns:¶
The subject IDs in order of the rows of the returned coordinates.
2. The aligned MCF time-values (aligned so that 0 is the alignment point between control and sample dataframes per subject). 3. The output index of dynamic measurement indices. 4. A boolean numpy array indicating whether or not a given subject (row) in the control population has data at or after a timepoint (column) 5. A boolean numpy array containing incidence markers for measurement predicates (dimension 0) by subject (dimension 1) and time (dimension 3). 4. A boolean numpy array indicating whether or not a given subject (dimension 0) in the sample population has data at or after a timepoint (dimension 1) across all sample populations (dimension 2) 6. A boolean np array containing incidence markers for measurement predicates (dimension 0) by subject (dimension 1) and time (dimension 2) across all sample populations (dimension 3).
Examples
>>> control_df = pl.DataFrame({ ... 'subject_id': [1, 2, 3], ... 'control_align_idx': [1, 1, 0], ... 'time': [ ... [0., 10, 20], ... [0., 100], ... [0., 1, 2, 3], ... ], ... 'dynamic_indices': [ ... [[1, 2], [3, 3, 2], [4]], ... [[1], [3]], ... [[2, 3], [1], [8], [3, 1, 1]], ... ], ... 'dynamic_values': [ ... [[None, 0], [-1, 4, 0.2], [None]], ... [[None], [3]], ... [[-0.1, 10], [None], [None], [6, None, None]], ... ], ... }) >>> sample_df_1 = pl.DataFrame({ ... 'subject_id': [2, 1, 3], ... 'time': [ ... [200, 300, 400], ... [18, 24, 33], ... [2.1, 3, 4.1], ... ], ... 'dynamic_indices': [ ... [[1], [3], [1, 2]], ... [[3], [2], [1]], ... [[2, 3], [], [3, 3]], ... ], ... 'dynamic_values': [ ... [[None], [3.1], [None, 0.03]], ... [[0], [0.21], [None]], ... [[-0.1, 10], [], [6, -1]], ... ], ... }) >>> sample_df_2 = pl.DataFrame({ ... 'subject_id': [3, 1, 2], ... 'time': [ ... [5.1, 6, 7.1], ... [11, 14, 23], ... [110, 202, 250], ... ], ... 'dynamic_indices': [ ... [[], [1, 2], [1]], ... [[1, 2], [1], [1]], ... [[1], [3], [3, 3]], ... ], ... 'dynamic_values': [ ... [[], [None, 0.1], [None]], ... [[None, -0.04], [None], [None]], ... [[None], [13.1], [0.5, 0.3]], ... ], ... }) >>> measurement_predicates = { ... 3: (3.5, None), ... 1: True, ... } >>> out = get_MCF_coordinates(control_df, [sample_df_1, sample_df_2], measurement_predicates) >>> subject_ids, Ts, dynamic_indices, control_censor_mask, control_MCF, sample_mask, sample_MCF = out >>> subject_ids [1, 2, 3] >>> len(Ts) 20 >>> Ts[:10] [-100.0, -10.0, 0.0, 1.0, 2.0, 3.0, 4.0, 5.1, 6.0, 7.1] >>> Ts[10:] [8.0, 10.0, 13.0, 14.0, 23.0, 100.0, 102.0, 150.0, 200.0, 300.0] >>> dynamic_indices [3, 1] >>> control_censor_mask.shape (1, 3, 21) >>> control_MCF.shape (1, 3, 21, 2) >>> sample_mask.shape (2, 3, 21) >>> sample_MCF.shape (2, 3, 21, 2)
-
EventStream.evaluation.MCF_evaluation.get_aligned_timestamps(control_T: Series, *sample_Ts: list[Series], n_timestamps: int | None =
None) list[float][source]¶ Gets the aligned timestamps given the input raw timestamps.
Examples
>>> control_T = pl.Series([ ... [-10., 0, 1, 2], [-105, 1, 4], ... ]) >>> sample_T_1 = pl.Series([ ... [8, 21.1], [46, 132, 188, 200.], ... ]) >>> sample_T_2 = pl.Series([ ... [1.1], None ... ]) >>> get_aligned_timestamps(control_T, sample_T_1, sample_T_2) [-105.0, -10.0, 0.0, 1.0, 1.1, 2.0, 4.0, 8.0, 21.1, 46.0, 132.0, 188.0, 200.0] >>> get_aligned_timestamps(control_T, sample_T_1, sample_T_2, n_timestamps=40) [-105.0, -10.0, 0.0, 1.0, 1.1, 2.0, 4.0, 8.0, 21.1, 46.0, 132.0, 188.0, 200.0] >>> import numpy as np >>> np.random.seed(1) >>> get_aligned_timestamps(control_T, sample_T_1, sample_T_2, n_timestamps=4) [1.1, 2.0, 4.0, 46.0]