h2o_sonar.methods.core package
Submodules
h2o_sonar.methods.core.method module
- class h2o_sonar.methods.core.method.FeatureTypes
Bases:
object- DEFAULT_DATE_FEATURE_FORMAT = '%Y%m%d'
- KEY_CATEGORICAL_FEATURES = 'categorical'
- KEY_CATNUM_FEATURES = 'catnum'
- KEY_DATE_FEATURES = 'date'
- KEY_DATE_FEATURES_FORMAT = 'date-format'
- KEY_DATE_TIME_FEATURES = 'datetime'
- KEY_ID_FEATURES = 'id'
- KEY_IMAGE_FEATURES = 'image'
- KEY_NUMERIC_FEATURES = 'numeric'
- KEY_QUANTILE_BINS = 'quantile-bin'
- KEY_TEXT_FEATURES = 'text'
- KEY_TIME_FEATURES = 'time'
- class h2o_sonar.methods.core.method.FeaturesMetadata(features_meta: dict | None = None)
Bases:
FeatureTypesUtility class to build dictionary with features metadata. For instances as used/determined by a machine learning model. Every feature used by model is marked with its type (numeric, categorical or both) and characteristic (date, time, datetime, text, image, ID).
- static create_blank_dict()
- property format_date_features: list
Format for date features - index of the format corresponds to the index of date feature.
- property qtile_binning_features: dict
Quantile binning specification for given features - key is the feature, value is quantile binning specification (the number of quantile bins to create e.g. 4 for quartiles)
- to_dict()
- to_json(indent=None)
- class h2o_sonar.methods.core.method.Method(method_name, method_type, interpretable_model=None)
Bases:
ABC,FeatureTypesAbstract class for all MLI objects exposing interpretation mechanisms.
- DEFAULT_GRID_RESOLUTION = 10
- KEY_CAT_WITH_NUM_BIN = 'categorical_with_numeric_bin'
- LABEL_PREFIX_CLASS = 'p_'
- LABEL_REGRESSION = 'p_0'
- MISSING_VALUES = ['', '?', 'None', 'nan', 'NA', 'N/A', 'unknown', 'inf', '-inf', '1.7976931348623157e+308', '-1.7976931348623157e+308']
- static create_date_aware_bins(features: list, frame, features_meta: dict = None, grid_resolution: int = 10, out_of_range_resolution: int = 0, date_format: str | list[str] = '%Y%m%d')
Create date aware bins (for basic formats) with given grid resolution.
- Parameters:
- features: list[int or str]
A list of features for which date aware bins should be created.
- frame: datatable.Frame or pandas.core.frame.DataFrame
Original data for which should be partial dependence computed.
- grid_resolution: int
The number of equally spaced points used to create bins if the number of unique values is big.
- features_meta: dict
Optional features metadata allowing to indicate whether given feature is date (use
datekey and list of feature names)- out_of_range_resolution: int
Number of out of range bins to create below / above the binning interval.
- date_format: str or [str]
Pandas (Python string format based) date format to be used to decode featurs. Optinal list allows to specify per-feature date format. https://docs.python.org/3/library/datetime.html #strftime-and-strptime-behavior
- Returns:
- bins, oor_bins: tuple(list[list[object]], list[list[object]])
Data values for each target feature for which we want to compute partial dependence, vector if for single target feature, otherwise a matrix.
- property diagnostics
Method diagnostics data.
- abstractmethod explain(model, **kwargs)
- property interpretable_model
Interpretable model.
- static is_missing_value(value)
Determine whether input represents a missing value.
- Parameters:
- value:
Input value.
- Returns:
- bool:
True in case of missing value, False otherwise.
- property method_name
Method name.
- property method_type
Method type e.g. ‘loco’ or ‘ice’.
h2o_sonar.methods.core.stats module
- class h2o_sonar.methods.core.stats.KolmogorovSmirnovResult(statistic, p_value, same_distribution, p_value_method)
Bases:
tuple- p_value
Alias for field number 1
- p_value_method
Alias for field number 3
- same_distribution
Alias for field number 2
- statistic
Alias for field number 0
- h2o_sonar.methods.core.stats.jensen_shannon_divergence(sample_u: list, sample_v: list) float
Calculate the Jensen-Shannon divergence (not distance) between two distributions.
- Parameters:
- sample_ulist
First probability distribution.
- sample_vlist
Second probability distribution.
- Returns:
- float
Jensen-Shannon divergence between the two distributions.
- h2o_sonar.methods.core.stats.kolmogorov_smirnov(sample_u: list, sample_v: list, p_calc_method: str = '', logger: SonarLogger | None = None) KolmogorovSmirnovResult
Discrete Kolmogorov-Smirnov (KS) test for two samples.
Compare two samples and understand if they come from the same discrete distribution.
KS metric interpretation: The KS statistic is the maximum distance between the empirical cumulative distribution functions (ECDFs) of the two samples. A larger KS statistic indicates a greater difference between the two distributions. The p-value indicates the probability of observing a KS statistic at least as extreme as the one calculated, assuming the null hypothesis (that the two samples come from the same distribution) is true. A small p-value (typically < 0.05) suggests that the null hypothesis can be rejected, indicating a significant difference between the two distributions. The KS test is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples. The KS test is non-parametric, meaning it does not assume any specific distribution for the data. This makes it a versatile tool for comparing distributions, especially when the underlying distributions are unknown or not normally distributed.
- Parameters:
- sample_ulist
First sample.
- sample_vlist
Second sample.
- p_calc_methodstr
Method to calculate p-value - options are “auto”, “exact”, and “asymp”.
exactmethod: exact distribution of test statistic - used w/ “auto” for small samplesasympmethod: asymptotic distribution of test statistic - used w/ “auto” for large samples
- loggerloggers.SonarLogger | None
Logger.
- Returns:
- KolmogorovSmirnovResult
Kolmogorov-Smirnov statistic (
0.0meaning perfect agreement,1.0disagreement), p-value (hypothesis testing), same distribution flag, and p-value method.
References
[1]scipy.stats.ks_2samp: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html
- h2o_sonar.methods.core.stats.wasserstein_distance(sample_u: list, sample_v: list) float
Calculate the Wasserstein distance between two distributions.
The function assumes values are sorted, but handles sorting internally if not.
The function works correctly even if the value arrays don’t perfectly overlap.
Wasserstein distance interpretation: the distance represents the minimum “cost” (amount of probability mass multiplied by the distance moved) required to transform distribution 1 into distribution 2.
- Parameters:
- sample_ulist
First sample.
- sample_vlist
Second sample.
- Returns:
- float
Wasserstein distance between the two distributions (lower value meaning higher agreement).
References
[1]scipy.stats.wasserstein_distance: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html