h2o_sonar.methods.core package

Submodules

h2o_sonar.methods.core.method module

class h2o_sonar.methods.core.method.FeatureTypes

Bases: object

DEFAULT_DATE_FEATURE_FORMAT = '%Y%m%d'

KEY_CATEGORICAL_FEATURES = 'categorical'

KEY_CATNUM_FEATURES = 'catnum'

KEY_DATE_FEATURES = 'date'

KEY_DATE_FEATURES_FORMAT = 'date-format'

KEY_DATE_TIME_FEATURES = 'datetime'

KEY_ID_FEATURES = 'id'

KEY_IMAGE_FEATURES = 'image'

KEY_NUMERIC_FEATURES = 'numeric'

KEY_QUANTILE_BINS = 'quantile-bin'

KEY_TEXT_FEATURES = 'text'

KEY_TIME_FEATURES = 'time'

class h2o_sonar.methods.core.method.FeaturesMetadata(features_meta: dict | None = None)

Bases: FeatureTypes

Utility class to build dictionary with features metadata. For instances as used/determined by a machine learning model. Every feature used by model is marked with its type (numeric, categorical or both) and characteristic (date, time, datetime, text, image, ID).

add(feature_type: str, feature_name: str)

property categorical_features: list: Categorical features - can overlap with numeric features.

property categorical_numeric_features: list

static create_blank_dict()

property date_features: list

property date_time_features: list

empty() → bool: Return True if no feature metadata are set.

property format_date_features: list: Format for date features - index of the format corresponds to the index of date feature.

get(feature_name: str, default_value)

property id_features: list: ID features.

property image_features: list: Image features - column contains images and is used by the model.

property numeric_features: list: Numeric features (can overlap with categorical features)

property qtile_binning_features: dict: Quantile binning specification for given features - key is the feature, value is quantile binning specification (the number of quantile bins to create e.g. 4 for quartiles)

set(features_meta: dict)

property text_features: list: Text features - dataset column is used as text feature by the model.

property time_features: list

to_dict()

to_json(indent=None)

class h2o_sonar.methods.core.method.Method(method_name, method_type, interpretable_model=None)

Bases: ABC, FeatureTypes

Abstract class for all MLI objects exposing interpretation mechanisms.

DEFAULT_GRID_RESOLUTION = 10

KEY_CAT_WITH_NUM_BIN = 'categorical_with_numeric_bin'

LABEL_PREFIX_CLASS = 'p_'

LABEL_REGRESSION = 'p_0'

MISSING_VALUES = ['', '?', 'None', 'nan', 'NA', 'N/A', 'unknown', 'inf', '-inf', '1.7976931348623157e+308', '-1.7976931348623157e+308']

static create_date_aware_bins(features: list, frame, features_meta: dict = None, grid_resolution: int = 10, out_of_range_resolution: int = 0, date_format: str | list[str] = '%Y%m%d')

Create date aware bins (for basic formats) with given grid resolution.

Parameters:

features: list[int or str]: A list of features for which date aware bins should be created.
frame: datatable.Frame or pandas.core.frame.DataFrame: Original data for which should be partial dependence computed.
grid_resolution: int: The number of equally spaced points used to create bins if the number of unique values is big.
features_meta: dict: Optional features metadata allowing to indicate whether given feature is date (use date key and list of feature names)
out_of_range_resolution: int: Number of out of range bins to create below / above the binning interval.
date_format: str or [str]: Pandas (Python string format based) date format to be used to decode featurs. Optinal list allows to specify per-feature date format. https://docs.python.org/3/library/datetime.html #strftime-and-strptime-behavior

Returns:

bins, oor_bins: tuple(list[list[object]], list[list[object]]): Data values for each target feature for which we want to compute partial dependence, vector if for single target feature, otherwise a matrix.

property diagnostics: Method diagnostics data.

abstractmethod explain(model, **kwargs)

property interpretable_model: Interpretable model.

static is_missing_value(value)

Determine whether input represents a missing value.

Parameters:

value:: Input value.

Returns:

bool:: True in case of missing value, False otherwise.

property method_name: Method name.

property method_type: Method type e.g. ‘loco’ or ‘ice’.

static opt_import_err_msg(pckg_names: list[str] | str, method_name: str = '', method_type: str = '')

h2o_sonar.methods.core.stats module

class h2o_sonar.methods.core.stats.KolmogorovSmirnovResult(statistic, p_value, same_distribution, p_value_method)

Bases: tuple

p_value: Alias for field number 1

p_value_method: Alias for field number 3

same_distribution: Alias for field number 2

statistic: Alias for field number 0

h2o_sonar.methods.core.stats.jensen_shannon_divergence(sample_u: list, sample_v: list) → float

Calculate the Jensen-Shannon divergence (not distance) between two distributions.

Parameters:

sample_ulist: First probability distribution.
sample_vlist: Second probability distribution.

Returns:

float: Jensen-Shannon divergence between the two distributions.

h2o_sonar.methods.core.stats.kolmogorov_smirnov(sample_u: list, sample_v: list, p_calc_method: str = '', logger: SonarLogger | None = None) → KolmogorovSmirnovResult

Discrete Kolmogorov-Smirnov (KS) test for two samples.

Compare two samples and understand if they come from the same discrete distribution.

KS metric interpretation: The KS statistic is the maximum distance between the empirical cumulative distribution functions (ECDFs) of the two samples. A larger KS statistic indicates a greater difference between the two distributions. The p-value indicates the probability of observing a KS statistic at least as extreme as the one calculated, assuming the null hypothesis (that the two samples come from the same distribution) is true. A small p-value (typically < 0.05) suggests that the null hypothesis can be rejected, indicating a significant difference between the two distributions. The KS test is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples. The KS test is non-parametric, meaning it does not assume any specific distribution for the data. This makes it a versatile tool for comparing distributions, especially when the underlying distributions are unknown or not normally distributed.

Parameters:

sample_ulist

First sample.

sample_vlist

Second sample.

p_calc_methodstr

Method to calculate p-value - options are “auto”, “exact”, and “asymp”.

exact method: exact distribution of test statistic - used w/ “auto” for small samples
asymp method: asymptotic distribution of test statistic - used w/ “auto” for large samples

loggerloggers.SonarLogger | None

Logger.

Returns:

KolmogorovSmirnovResult: Kolmogorov-Smirnov statistic (0.0 meaning perfect agreement, 1.0 disagreement), p-value (hypothesis testing), same distribution flag, and p-value method.

References

[1]

scipy.stats.ks_2samp: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html

h2o_sonar.methods.core.stats.wasserstein_distance(sample_u: list, sample_v: list) → float

Calculate the Wasserstein distance between two distributions.

The function assumes values are sorted, but handles sorting internally if not.
The function works correctly even if the value arrays don’t perfectly overlap.

Wasserstein distance interpretation: the distance represents the minimum “cost” (amount of probability mass multiplied by the distance moved) required to transform distribution 1 into distribution 2.

Parameters:

sample_ulist: First sample.
sample_vlist: Second sample.

Returns:

float: Wasserstein distance between the two distributions (lower value meaning higher agreement).

References

[1]

scipy.stats.wasserstein_distance: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html

h2o_sonar.methods.core package

Submodules

h2o_sonar.methods.core.method module

h2o_sonar.methods.core.stats module

Module contents