h2o_sonar.methods.core package

Submodules

h2o_sonar.methods.core.method module

class h2o_sonar.methods.core.method.FeatureTypes

Bases: object

DEFAULT_DATE_FEATURE_FORMAT = '%Y%m%d'
KEY_CATEGORICAL_FEATURES = 'categorical'
KEY_CATNUM_FEATURES = 'catnum'
KEY_DATE_FEATURES = 'date'
KEY_DATE_FEATURES_FORMAT = 'date-format'
KEY_DATE_TIME_FEATURES = 'datetime'
KEY_ID_FEATURES = 'id'
KEY_IMAGE_FEATURES = 'image'
KEY_NUMERIC_FEATURES = 'numeric'
KEY_QUANTILE_BINS = 'quantile-bin'
KEY_TEXT_FEATURES = 'text'
KEY_TIME_FEATURES = 'time'
class h2o_sonar.methods.core.method.FeaturesMetadata(features_meta: dict | None = None)

Bases: FeatureTypes

Utility class to build dictionary with features metadata. For instances as used/determined by a machine learning model. Every feature used by model is marked with its type (numeric, categorical or both) and characteristic (date, time, datetime, text, image, ID).

add(feature_type: str, feature_name: str)
property categorical_features: list

Categorical features - can overlap with numeric features.

property categorical_numeric_features: list
static create_blank_dict()
property date_features: list
property date_time_features: list
empty() bool

Return True if no feature metadata are set.

property format_date_features: list

Format for date features - index of the format corresponds to the index of date feature.

get(feature_name: str, default_value)
property id_features: list

ID features.

property image_features: list

Image features - column contains images and is used by the model.

property numeric_features: list

Numeric features (can overlap with categorical features)

property qtile_binning_features: dict

Quantile binning specification for given features - key is the feature, value is quantile binning specification (the number of quantile bins to create e.g. 4 for quartiles)

set(features_meta: dict)
property text_features: list

Text features - dataset column is used as text feature by the model.

property time_features: list
to_dict()
to_json(indent=None)
class h2o_sonar.methods.core.method.Method(method_name, method_type, interpretable_model=None)

Bases: ABC, FeatureTypes

Abstract class for all MLI objects exposing interpretation mechanisms.

DEFAULT_GRID_RESOLUTION = 10
KEY_CAT_WITH_NUM_BIN = 'categorical_with_numeric_bin'
LABEL_PREFIX_CLASS = 'p_'
LABEL_REGRESSION = 'p_0'
MISSING_VALUES = ['', '?', 'None', 'nan', 'NA', 'N/A', 'unknown', 'inf', '-inf', '1.7976931348623157e+308', '-1.7976931348623157e+308']
static create_date_aware_bins(features: list, frame, features_meta: dict = None, grid_resolution: int = 10, out_of_range_resolution: int = 0, date_format: str | list[str] = '%Y%m%d')

Create date aware bins (for basic formats) with given grid resolution.

Parameters:
features: list[int or str]

A list of features for which date aware bins should be created.

frame: datatable.Frame or pandas.core.frame.DataFrame

Original data for which should be partial dependence computed.

grid_resolution: int

The number of equally spaced points used to create bins if the number of unique values is big.

features_meta: dict

Optional features metadata allowing to indicate whether given feature is date (use date key and list of feature names)

out_of_range_resolution: int

Number of out of range bins to create below / above the binning interval.

date_format: str or [str]

Pandas (Python string format based) date format to be used to decode featurs. Optinal list allows to specify per-feature date format. https://docs.python.org/3/library/datetime.html #strftime-and-strptime-behavior

Returns:
bins, oor_bins: tuple(list[list[object]], list[list[object]])

Data values for each target feature for which we want to compute partial dependence, vector if for single target feature, otherwise a matrix.

property diagnostics

Method diagnostics data.

abstractmethod explain(model, **kwargs)
property interpretable_model

Interpretable model.

static is_missing_value(value)

Determine whether input represents a missing value.

Parameters:
value:

Input value.

Returns:
bool:

True in case of missing value, False otherwise.

property method_name

Method name.

property method_type

Method type e.g. ‘loco’ or ‘ice’.

static opt_import_err_msg(pckg_names: list[str] | str, method_name: str = '', method_type: str = '')

h2o_sonar.methods.core.stats module

class h2o_sonar.methods.core.stats.KolmogorovSmirnovResult(statistic, p_value, same_distribution, p_value_method)

Bases: tuple

p_value

Alias for field number 1

p_value_method

Alias for field number 3

same_distribution

Alias for field number 2

statistic

Alias for field number 0

h2o_sonar.methods.core.stats.jensen_shannon_divergence(sample_u: list, sample_v: list) float

Calculate the Jensen-Shannon divergence (not distance) between two distributions.

Parameters:
sample_ulist

First probability distribution.

sample_vlist

Second probability distribution.

Returns:
float

Jensen-Shannon divergence between the two distributions.

h2o_sonar.methods.core.stats.kolmogorov_smirnov(sample_u: list, sample_v: list, p_calc_method: str = '', logger: SonarLogger | None = None) KolmogorovSmirnovResult

Discrete Kolmogorov-Smirnov (KS) test for two samples.

Compare two samples and understand if they come from the same discrete distribution.

KS metric interpretation: The KS statistic is the maximum distance between the empirical cumulative distribution functions (ECDFs) of the two samples. A larger KS statistic indicates a greater difference between the two distributions. The p-value indicates the probability of observing a KS statistic at least as extreme as the one calculated, assuming the null hypothesis (that the two samples come from the same distribution) is true. A small p-value (typically < 0.05) suggests that the null hypothesis can be rejected, indicating a significant difference between the two distributions. The KS test is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples. The KS test is non-parametric, meaning it does not assume any specific distribution for the data. This makes it a versatile tool for comparing distributions, especially when the underlying distributions are unknown or not normally distributed.

Parameters:
sample_ulist

First sample.

sample_vlist

Second sample.

p_calc_methodstr

Method to calculate p-value - options are “auto”, “exact”, and “asymp”.

  • exact method: exact distribution of test statistic - used w/ “auto” for small samples

  • asymp method: asymptotic distribution of test statistic - used w/ “auto” for large samples

loggerloggers.SonarLogger | None

Logger.

Returns:
KolmogorovSmirnovResult

Kolmogorov-Smirnov statistic (0.0 meaning perfect agreement, 1.0 disagreement), p-value (hypothesis testing), same distribution flag, and p-value method.

References

h2o_sonar.methods.core.stats.wasserstein_distance(sample_u: list, sample_v: list) float

Calculate the Wasserstein distance between two distributions.

  • The function assumes values are sorted, but handles sorting internally if not.

  • The function works correctly even if the value arrays don’t perfectly overlap.

Wasserstein distance interpretation: the distance represents the minimum “cost” (amount of probability mass multiplied by the distance moved) required to transform distribution 1 into distribution 2.

Parameters:
sample_ulist

First sample.

sample_vlist

Second sample.

Returns:
float

Wasserstein distance between the two distributions (lower value meaning higher agreement).

References

Module contents