Change Log
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
v2.15.0 — 2025/5/29
This is a minor H2O Eval Studio release.
Added
Features:
Added configurable GPU acceleration to the following evaluators:
Answer Relevancy (Sentence Similarity) evaluator.
Answer Semantic Sentence Similarity evaluator.
Context Relevancy (Soft Recall and Precision) evaluator.
Fairness Bias evaluator.
Groundedness (semantic similarity) evaluator.
Hallucination evaluator.
Machine Translation (GPTScore) evaluator.
Perplexity evaluator.
Question Answering (GPTScore) evaluator.
Summarization with reference (GPTScore) evaluator.
Summarization without reference (GPTScore) evaluator.
Step Alignment and Completeness evaluator.
Summarization (Completeness and Faithfulness) evaluator.
Toxicity evaluator.
Enhancements:
hf-xet
to improve Hugging Face models handling performance.onnxruntime-gpu
to improve ONNX models performance when GPU is available.
Changed
lmppl
Python dependency0.0.1
patched with https://github.com/asahi417/lmppl/pull/13 and the wheel moved to the public S3 bucket.
v2.14.0 — 2025/5/22
This is a minor H2O Eval Studio release.
Security
Package
langchain
upgraded to version0.3.1
to fix the vulnerabilityCVE-2024-7042
.Package
langchain-community
upgraded to version0.3.1
to fix the vulnerabilityCVE-2024-7042
.Package
openai
upgraded to version1.81.0
as a dependency oflangchain
to fix the vulnerabilityCVE-2024-7042
.
v2.13.0 — 2025/5/20
This is a minor H2O Eval Studio release.
Added
Evaluators:
Encoding guardrail evaluator - a tool designed to assess the LLM/RAG’s ability to handle encoding attacks. It evaluates whether the system can be tricked into generating incorrect or unexpected outputs through manipulation of the prompt encoding, such as encoding the prompt text using Base64 or Base16, which should be discarded by the guardrails or the system.
Features:
Introducing statefull conversations / multi-turn chats / contextual conversation support for the h2oGPTe client - the client can now maintain the context of the conversation across multiple turns, allowing for new types of evaluations and attacks.
Encoding perturbator - a perturbator which encodes the prompt text using base16 encoding.
Adding ability to configure and enforce CPU, GPU or automatic device selection for running predictive and generative models. Automatic device selection is the default.
Added module which calculates various statistics to compare distributions: Kolmogorov-Smirnov test, Wasserstein distance, and Jensen-Shannon divergence.
H2O Eval Studio newly automatically uses the shell environment configuration overrides from environment variables starting with
H2O_SONAR_CFG_
prefix. The environment variables are automatically converted to the H2O Eval Studio configuration parameters (primitive values only).
Enhancements:
Introducing
NaN
tolerance to heatmap leaderboard which brings tolerance forNaN
values on average metric value calculation - it allows to ignore evaluation results withNaN
metric values if the number of evaluation results is lower or equal to the given percentage of the total number of evaluation results.RAGAs family evaluators newly support the
NaN
tolerance which can be configured using the evaluator parameters.Test lab completion newly supports
auto
,""
andNone
LLM selectors when the test lab is built from h2oGPTe collections. Theauto
selector lets h2oGPTe to automatically select the LLM model for the test lab completion;""
andNone
inherit the LLM model from the h2oGPTe collection configuration.
Fixed
Classification evaluator was fixed to correctly handle unknown labels in HTML report confusion matrices.
Classification leaderboard explanation improved to provide stable confusion matrices in the HTML report with unexpected labels.
Perplexity evaluator no longer requires Open AI API key.
Markdown representations texts are newly escaped to ensure formatting and avoid XSS.
Test suites and labs corpus URLs fixed to reflex AWS S3 bucket migration from
eu-central-1
tous-east-1
.
Changed
HMLI moved from public to private S3 bucket, which is accessible only from the H2O.ai infrastructure. Therefore, the HMLI wheel dependency must be installed from the private S3 bucket before installing H2O Sonar.
h2oGPTe client upgraded to the custom S3 hosted build
h2ogpte-1.6.28.dev8-py3-none-any.whl
, which has been moved from a public to a private S3 bucket.Package
pip
upgraded to version25.1.1
.
Deprecated
No deprecations.
Removed
No removals.
Security
Package
langchain-community
upgraded to version0.2.19
to fix the vulnerabilityCVE-2024-8309
.
v2.12.2 — 2025/4/25
This is a minor H2O Eval Studio release.
Changed
Changing an AWS region for the H2O Eval Studio artifacts from
eu-central-1
tous-east-1
.
v2.12.1 — 2025/4/4
This is a minor H2O Eval Studio release.
Fixed
Perplexity evaluator no longer requires Open AI API key.
Changed
h2oGPTe client upgraded to version
1.6.27.post1
.
v2.12.0 — 2025/3/27
This is a minor H2O Eval Studio release.
Fixed
Uploaded documents purging fixed for h2oGPTe client.
Changed
h2oGPTe client upgraded to version
1.6.25
.NLTK upgraded to version
3.9.1
to fix the vulnerabilityCVE-2024-39705
.Hugging Face Transformers Library upgraded to version
4.50.2
.
v2.11.1 — 2025/3/17
This is a patch H2O Eval Studio release.
Added
Enhancements:
Improving performance of test lab completion parallelization in case that the suite has less than 20 test cases.
v2.11.0 — 2025/3/12
This is a minor H2O Eval Studio release.
Added
Enhancements:
The Text matching evaluator uses the expected answer as the condition (exact match) if available, when no condition is specified by the test case.
Fixed
Fixed JSon representations of the LLM evaluation result explanation to contain evaluator descriptor again.
LLM evaluation result JSon representation does not include typed structure friendly metrics serialization by default.
Changed
Tokens presence evaluator renamed to Text matching evaluator.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v2.10.0 — 2025/3/6
This is a minor H2O Eval Studio release.
Added
Features:
LLM evaluation result JSon representation newly includes metrics serialized which is can be described using the proto definitions.
Exponential backoff driven timeout added to the h2oGPTe client to better perform and report the h2oGPTe timeouts.
Enhancements:
Test lab completion parallelization and sharding improved to parallelize also inputs assigned to particular RAG/LLM model if the number of the RAG/LLM models is smaller than a configurable threshold.
Fixed
HTML report performance statistics fixed to handle missing keys.
Changed
h2oGPTe client upgraded to version
1.6.23
.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v2.9.0 — 2025/2/17
This is a minor H2O Eval Studio release.
Added
Enhancements:
Improved Step alignment and completeness evaluator - better step extraction from the retrieved context and model answer, propagation of the dynamic programming metrics and alignment matrix to the HTML report, and new ability to combine multiple steps into one if the reference or the generated text contains compound step (left combined, right without the step combination).
Markdown summary of the evaluation newly includes statistics for response times per LLM model.
Identical insights reported by different evaluators are newly deduplicated and reported as a single insight.
Fixed
No fixes.
Changed
The default threshold for the Toxicity evaluator has been changed from
0.75
to0.25
based on empirical observations and feedback from users.h2oGPTe client upgraded to version
1.6.22
.
Deprecated
No deprecations.
Removed
No removals.
Security
Upgrading
scikit-learn
to version1.5.2
to fix the vulnerabilityCVE-2024-5206
.
v2.8.2 — 2025/2/05
This is a minor H2O Eval Studio release.
Changed
h2oGPTe client upgraded to version
1.6.18.post1
.
v2.8.1 — 2025/1/13
This is a minor H2O Eval Studio release.
Changed
HMLI wheel dependency location changed to the H2O Eval Studio AWS account.
v2.8.0 — 2025/1/10
This is a minor H2O Eval Studio release.
Added
Evaluators:
Step alignment and completeness evaluator (preview) - a tool for evaluating the steps of procedures, sequences, or process descriptions.
Features:
Support for agent-based and LLM-based perturbators.
New Contextual misinformation perturbator.
Evaluation data:
Test suite evaluation library with 1M+ test cases published at https://eval-studio-artifacts.s3.us-east-1.amazonaws.com/h2o-eval-studio-suite-library/index.html and https://eval-studio-artifacts.s3.us-east-1.amazonaws.com/h2o-eval-studio-suite-library/index.json Makefile targets to maintain the test suite evaluation library added to the project.
Documentation
ReStructuredText documentation for the new Step alignment and completeness evaluator.
Added ReStructuredText documentation for Fact-check evaluator evaluator parameters.
Fixed
No fixes.
Changed
No changes.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v2.7.0 — 2024/12/16
This is a minor H2O Eval Studio release.
Added
Evaluators:
Fact-check evaluator (agent-based).
Enhancements:
Enhanced test lab prompt cache which is meant for testing/demo purposes: improved configuration (environment variable and H2O Sonar configuration), added retrieved context caching.
Documentation
New evaluator documentation for the Fact-check evaluator.
Added documentation for new perturbators which were added in H2O Sonar 2.6.0.
Fixed
Heatmap leaderboard explanation no longer shows empty most difficult prompts section.
v2.6.0 — 2024/12/05
This is a minor H2O Eval Studio release.
Added
Evaluators:
Answer Semantic Sentence Similarity evaluator.
Features:
New character level perturbators - insert/delete random character(s), QWERTY keyboard typos, and common OCR errors.
Fixed
Keywords of RAGAs evaluator, Classification evaluator and GPTScore Q&A evaluator fixed.
RAGAs evaluator metadata in leaderboard serializations fixed to include exactly the metrics it calculates.
Fixed the escaping of special characters in classification class names for the multi-class Classification evaluator.
httpx
Python dependency fixed to0.27.0
to avoidopenai
Python library issues with unexpected proxy parameter.Resolved random hangs that occurred during h2oGPTe RAG retrieved context fetching when using a session connection managed by the resource manager.
Changed
H2O Eval Studio models online cache location has been moved from root to H2O EvalStudio tenant to download the models from the right location in case of the deployments with the internet access (and to cache the models from the right location in case of the air-gapped deployments).
v2.5.4 — 2024/10/14
This is a patch H2O Eval Studio release.
Fixed
Fixed missing and non-float bool metrics in BYOP evaluators.
Fixed
punkt
caching in Context relevancy (soft recall and precision) and Answer relevancy (sentence similarity) evaluators.Fixed keywords metadata in multiple evaluators.
v2.5.3 — 2024/11/13
This is a minor H2O Eval Studio release
Added
Enhancements:
Perturbators can newly work without raising the exceptions - instead they gather the errors and return them in the passed lists.
v2.5.2 — 2024/10/12
This is a patch H2O Eval Studio release.
Fixed
Fixed missing taglines in the evaluator descriptors.
Fixed singular/plural in classification evaluator metadata.
v2.5.1 — 2024/10/09
This is a patch H2O Eval Studio release.
Added
Enhancements:
Improved - shorter and concise - taglines in evaluators.
v2.5.0 — 2024/10/08
This is a minor H2O Eval Studio release.
Added
Enhancements:
Tagline added to all evaluators to provide a brief description of the evaluator.
Fixed
Fixed bugs / inconsistencies between evaluator metadata and keywords like LLM vs. RAG compatibility.
Fixed Classification evaluator metrics values included in the evaluation result to be consistent with the declared metrics in the evaluator metadata.
Ensured caching of the
punkt
tokenizer for the Fairness Bias evaluator, Groundedness evaluator, and Hallucination evaluator to work correctly in air-gapped deployments.Fixed Groundedness evaluator AVID error codes and tokenization unpacking.
Changed
Summarization (Completeness and Faithfulness) evaluator excluded from the explainer container as it is resource intensive, expensive, difficult to interpret, and not suitable for the use without GPU HW support.
h2oGPTe client upgraded to version
1.5.26
.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v2.4.0 — 2024/10/14
This is a minor H2O Eval Studio release.
Added
Features:
Amazon Bedrock RAG newly supports creation of the knowledge bases (collections) from test suites as a part of the test lab build and completion.
Enhancements:
The following evaluators newly report metrics values in the evaluation results on the sentence granularity as actual answer metadata and they highlight problems in the HTML report:
Groundedness evaluator
Toxicity evaluator
Fairness Bias evaluator
Answer Relevancy (sentence similarity) evaluator
Hallucination evaluator
PII evaluator
Sensitive Data evaluator
Token presence evaluator reports which part of the condition caused the evaluation failure. The error message is provided in the
meta
section of the actual answer metadata and highlighted in the HTML report (error message section).Summarization evaluator error messages improved to indicate the root cause of the summarization evaluation failure.
Test lab is newly accepting custom HTTP headers for the document caching when building the test lab or synchronizing the documents.
Documentation
Added generative AI section to the introduction of the ReStructuredText documentation.
Added missing licenses to the ReStructuredText documentation.
Fixed
AVID problem taxonomy fixed to report codes in the problems.
Failure to get LLM statistics in the h2oGPTe client no longer causes the evaluation to fail (it is optional to get the statistics).
Fixed case in the names of evaluators to ensure the naming consistency.
Changed
No changes.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v2.3.0 — 2024/10/07
This is a minor H2O Eval Studio release.
Added
Enhancements:
Text matching evaluator reports result parsing failures in the evaluation results.
Text matching evaluator ability to evaluate both actual answer and retrieved context is newly configurable - default is to actual answer only.
PII evaluator and Sensitive data evaluator ability to evaluate both actual answer and retrieved context is newly configurable - default is to evaluate both actual answer and retrieved context.
Fixed
Evaluation result JSon representation fixed to correctly serialize infinity and NaN values.
Generation/Retrieval/Generation+Retrieval prefix of model failure errors in the HTML report fixed to be visible again.
Passed and failed test cases counts in the test lab completion progress report fixed to be correctly calculated (when retrieved context failures are not considered).
Fixed missing resolved test cases when building lab using the parallel job completion - if resolution of all test cases fails in the job, then the result is not discarded, but kept.
Changed
Boolean leaderboard (JSon, Markdown, dataset) results changed to fail the test case evaluation if the generation fails, and/or retrieval fails, and/or generation+retrieval fails. Previously, retrieval failures were not considered as a failure of the test case evaluation which lead to confusing results. Users can enable/disable the retrieval checks in Text matching evaluator, PII evaluator, and Sensitive data evaluator.
h2oGPTe client upgraded to version
1.5.22
.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v2.2.0 — 2024/09/26
This is a minor H2O Eval Studio release.
Added
Enhancements:
h2oGPTe LLM performance statistics - like cost, input tokens, output tokens and time to the first token - added to the explainable model and Markdown boolean leaderboard explanation.
Markdown report newly includes h2oGPTe LLM vision model associated with the evaluated model.
Conditional evaluation by the Text matching evaluator newly reports sub-condition which caused the evaluation failure.
All row keys and all test cases added to problems reporting that model didn’t pass a metric threshold check
Evaluator descriptor added to LLM result JSon.
Fixed
No fixes.
Changed
Added LLM model metadata to the explainable (LLM and RAG) model, which changes the serialization and deserialization of the model metadata, test labs (impact H2O Eval Studio) and test results.
Problem attribute
test_case_key
renamed totest_case_keys
and type changed to list.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v2.1.0 — 2024/09/25
This minor H2O Eval Studio release brings looping detection evaluator and smaller enhancements.
Added
Evaluators:
Looping Detection evaluator.
Enhancements:
Amazon Bedrock RAG client models listing speed up.
Evaluated models added to the JSon representation of the evaluation results.
Test case key added to the JSon representation of the evaluation results.
Documentation
Added reStructuredText documentation of the evaluators.
Added prompts documentation for LLM judge-based evaluators.
Fixed
Row key(s), test case keys and model keys added to the problems and insights (where applicable) to simplify the mapping of the evaluation results to the original data.
Changed
Metrics column names of boolean leaderboard evaluators changed from ad hoc names to actual boolean metrics names.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v2.0.0 — 2024/09/18
This major H2O Eval Studio releases brings generative AI evaluation.
Added
Evaluators
Generation evaluation
Answer Correctness evaluator.
Answer Relevancy evaluator.
Answer Relevancy (Sentence Similarity) evaluator.
Answer Semantic Similarity evaluator.
Bring Your Own Prompt (BYOP) evaluator.
Faithfulness evaluator.
Groundedness (semantic similarity) evaluator.
Hallucination evaluator.
Language Mismatch evaluator.
Machine Translation (GPTScore) evaluator.
Perplexity evaluator.
Question Answering (GPTScore) evaluator.
RAGAS evaluator.
Text matching evaluator.
Retrieval evaluation
Context Precision evaluator.
Context Recall evaluator.
Context Relevancy evaluator.
Context Relevancy (Soft Recall and Precision) evaluator.
Privacy evaluation
Contact Information evaluator.
PII evaluator.
Sensitive Data evaluator.
Fairness evaluation
Fairness Bias evaluator.
Sexism evaluator.
Stereotype evaluator.
Toxicity evaluator.
Summarization evaluation
BLEU evaluator.
ROUGE evaluator.
Summarization (Completeness and Faithfulness) evaluator.
Summarization (Judge) evaluator.
Summarization with reference (GPTScore) evaluator.
Summarization without reference (GPTScore) evaluator.
Classification evaluation
Classification evaluator.
Features
Introducing
Evaluators
as a new type of explainers which are able to evaluate the quality of Retrieval-Augmented Generations (RAG) products.New evaluator API -
evaluate
module to run evaluators,evaluators
module to implement new evaluators and Bring Your Own Evaluator (BYOE).New evaluator specific datasets based on
LlmDataset
, modelsExplainableRagModel
with implementations forh2oGPTe
and OpenAI Assistants with retrieval.New evaluator
testing
module with the test support bringing test suites, test cases tests and test labs.New
genai
module with LLM/RAG host clients:H2O Enterprise
h2oGPTe
H2O GPT
H2O LLMOps
OpenAI Chat
Open AI Assistants with Retrieval tool (version 1) or File Search tool (version 2)
Microsoft Azure hosted OpenAI Chat
Open AI Chat compatible endpoints
Amazon Bedrock
ollama
HTML report branding for the EvalStudio.
Insights - new feature allowing explainers and explanations to provide insights into the evaluation results and suggest actions to be taken.
Explanations and formats
New leaderboard (heatmap and bool) explanations with support for multiple evaluation metrics along with HTML, JSon and Markdown formats.
New (normalized) evaluator result (
EvalResult
) and explanation formats (JSon, Markdown).
Enhancements
Installation of H2O Eval Studio using package extras - install only what you need: core,
explainers
and/orevaluators
.ragas
library integration (license).
Testing
New
llm
pytest label for LLM and RAG tests.Test suites, test labs and test datasets for the LLM and RAG evaluation:
h2oGPTe
benchmark, Kaggle LLM Data Science competition, Talk to report and evalgpt.ai.
Changes
Cython
Python dependency upgraded from0.29.32
to0.29.37
.
Backward compatibility breaking changes:
Python 3.8 is no longer officially supported.
Python 3.9 is no longer officially supported.
Python 3.10 is no longer officially supported.
JSon file with interpretation parameters which was stored in the interpretation directory is no longer persisted as it contained duplicate information which can be found in the
interpretation.json
file.
Documentation
Updated documentation of new features and enhancements.
v2.0.0 Release Candidates
List of 2.0.0 release candidates with the detailed description of the changes:
RC 68 - 2024/09/13
Evaluators:
The new Answer Relevancy (Sentence Similarity) evaluator assesses how relevant the actual answer is by computing the semantic similarity between the question and the actual answer sentences.
The new Context Relevancy (Soft Recall and Precision) evaluator measures the relevancy of the retrieved context based on the question and context sentences sentences semantic similarity.
Enhancements:
Toxicity evaluator improved to calculate the toxicity metrics on the sentence granularity and report the maximum of the toxicity metrics values. This enhancement makes the evaluator results more valuable as it can detect the toxic content in the generated text regardless its length (toxic content can no longer hide in long(er) actual answers).
Amazon Bedrock model host newly checks the accessibility of the LLM models supported by the RAG and filters out the inaccessible models.
RC 67 - 2024/09/06
Fixes:
Summarization (Completeness and Faithfulness) evaluator fixed to safely use MD5 for the metrics calculation.
Changes:
h2oGPTe
client downgraded to version1.5.16
to integrate with old(er) servers.
RC 66 - 2024/09/06
Fixes:
Evaluated model ID added to the HTML report to simplify mapping of model IDs in the evaluation results (JSon, CSV, frame) to human readable model metadata.
Changes:
h2oGPTe
client upgraded to version1.6.0.dev3
.H2O Eval Studio leaderboards Markdown representation title heading level changed to
H2
.
RC 65 - 2024/09/05
Features:
Amazon Bedrock model host support - evaluation of Amazon Bedrock RAG - knowledge bases (collections) and configured LLM models.
Fixes:
Perturbation flip detection fixed - it didn’t consider answers created by the different RAG/LLM models and reported false negatives.
RC 64 - 2024/09/03
Fixes:
h2oGPTe LLM models listing retries fixed to avoid the flakiness and ensure it will be performed at least once.
Documentation:
Comprehensive update of evaluator documentation: formulas, methods, prompts, links to used models, and fixes.
RC 63 - 2024/08/29
Enhancements:
H2O Eval Studio Markdown representation revamp - new header section for bool/heat/class based leaderboard summaries, model/prompt/… failure sections truncated to at most 3 entries to scale the UI in case of many failures.
Model
vectara/hallucination_evaluation_model
, which is used by the Hallucination evaluator, updated toHHEM-2.1-Open
and is frozen to avoid the model changes.Added retries to the
h2oGPTe
client to avoid the flakiness when listing base LLM models.Improved rendering of the multinomial classification confusion matrices in the HTML report.
RC 62 - 2024/08/27
Fixes:
Fixed broken retrieval and generation error messages construction in the Text matching evaluator.
Model and prompt leaderboard in the HTML report/Markdown/JSon representations - result failures are shown based on generation failures (not union of retrieval and generation failures) which ensures that failures and passes give 100%.
Model failure entries colors in the HTML report fixed - if the problem is in retrieval, only the context is in red. If the problem is in a generation, then the actual answer is in red.
Input field in the model failure list of the H2O Eval Studio markdown de-duplicated. Missing fields added to be on par with the HTML.
RC 61 - 2024/08/22
Enhancements:
Groundedness (semantic similarity) evaluator documentation updated.
Improved Hallucination evaluator error reporting on too long retrieved context chunks.
More robust perturbation flip direction detection.
RC 60 - 2024/08/21
Evaluators:
The new Groundedness evaluator assesses the groundedness of the generated text by considering the retrieved context - measuring hallucinations and fabricated text. It reports problems on sentence granularity in order to identify the hallucinations and fabricated root causes.
Enhancements:
Added infrastructure to detect the low number of evaluation examples in evaluators and report it as a problem.
Problems are newly categorized using the AVID taxonomy: https://docs.avidml.org/taxonomy/effect-sep-view/security
Fixes:
Threshold consistency between evaluator thresholds and metrics threshold defaults fixed.
Propagating of actual threshold values to the JSon leaderboard representation fixed.
Exception handling in the test lab completion on the parallel job failure fixed.
RC 59 - 2024/08/20
Enhancements:
Test lab completion progress reporting is now more detailed - it includes prompt, LLM, and RAG/LLM host names.
Fixes:
Rounding of metrics values in insights, problems and Markdown representations aligned to 4 decimal places. Percentage values are rounded to 1 decimal place.
Changes:
h2oGPTe
client upgraded to version1.5.11
.
RC 58 - 2024/08/14
Fixes:
Links to explanation data in the HTML report changed from directories to files in case of H2O Eval Studio branding as (S3) directories cannot be listed in case of the H2O Eval Studio deployment.
Changes:
Rollback to vulnerable NLTK
3.8.1
(CVE-2024-39705
) Python dependency as3.8.2
has been purged from pypi.org
RC 57 - 2024/08/14
Enhancements:
Default h2oGPTe client timeout to get answer from the LLM or RAG collection is newly 420s (was 1000s).
Metrics values in Markdown are newly rounded to 4 decimal places.
Fixes:
Perturbation of perturbed test suites is newly cloned when not perturbing in place.
GPTScore threshold parameter description fixed in the evaluator metadata.
Hiding H2O Eval Studio specific texts in the HTML report in case of H2O Eval Studio branding.
RC 56 - 2024/08/12
Enhancements:
Added detection of Summarization evaluator failures on all dataset rows and fail fast via raising an exception.
Added precondition check on empty evaluation results to all leaderboard types.
Evaluator metadata lookup made possible for incompatible evaluators in the HTML report.
Test lab completion no longer uses “shard” terminology, but “parallel job” instead.
English variant of
punkt
from NLTK is newly cached as the model used by the evaluators.
Changes:
Updated vulnerable NLTK
3.8.1
(CVE-2024-39705
) Python dependency to fixed version3.8.2
.
RC 55 - 2024/08/09
Fixes:
Minor robustness fix in the handling of extra argument passed to the h2oGPTe client.
RC 54 - 2024/08/08
Fixes:
Fixed problem detection on Answer semantic similarity evaluator flip detection. RAGAs evaluator fixed to declare all metrics it calculates in the metadata. Also RAGAs evaluator docstring changed to announce RAGAs metrics only in the documentation.
RC 53 - 2024/08/08
Fixes:
Perturbation of a test suite using multiple perturbators no longer creates exponential number of perturbed test cases. Instead, there are original tests with their test cases and perturbed tests with their perturbed test cases. Thus the number of test cases is 2x the original number of test cases.
Changes:
Internal perturbation API of test suites, tests and test cases changed to support multiple perturbators so that the perturbations can be created in place and relationships properly set.
RC 52 - 2024/08/07
Enhancements:
Test lab completion newly fails fast - raises exception - in case that completion of all test lab’s test cases fail.
Evaluations, interpretations and their JSon representations has new
error
field which contains the error message in case of the evaluation/interpretation failure.
Changes:
h2oGPTe
client upgraded to version1.5.11-dev2
.
RC 51 - 2024/08/06
Enhancements:
Missing expected answer in the test case is reported as a problem by the evaluators.
Fixes:
The HTML report generator doesn’t fail on an invalid explainer ID when getting the display name, but returns the ID with a prefix. An error message is logged.
RC 50 - 2024/08/02
Enhancements:
Progress report in the test lab completion no longer includes a full prompt, but just a prefix.
RC 49 - 2024/08/02
Enhancements:
Brief evaluators descriptions were shortened - newly contain just the first paragraph of the full description.
Evaluators check whether actual answers in test cases/suites/labs has correct type and if not, they generate the corresponding problems.
Air-gapped deployment support improved - 3rd party models used by the evaluators/evaluation libraries are newly frozen (where possible) to prevent model changes.
Fixes:
In an attempt to complete the test lab for exactly one model in parallel, the test lab automatically switches to the serial mode.
Insights about the fastest/slowest/cheapest/most expensive models are not generated for the evaluations with exactly one model.
Changes:
Progress reports generated by evaluators newly start with display names of the evaluators rather then IDs.
RC 48 - 2024/07/30
Enhancements:
Brief evaluator description added to the public API -
list_evaluators()
anddescribe_evaluator()
newly return it.
RC 47 - 2024/07/30
Features:
Added support of the Open AI RAG version 2.0 - Assistants with File Search tool.
New conditions in Token Presence evaluator - new syntax which brings support of
NOT
and parentheses for the complex conditions.Red teaming test suite with various LLM/RAG attacks added to the repository. This test suite can be used for penetration testing of the LLM/RAG models.
Enhancements:
Improved test lab API allows to complete test labs of RAG system using given (existing) collections instead of creating new ones. This API allows user to create, configure and customize the collections, upload corpus and documents, and then use them in the test lab completion.
Evaluator container newly detects invalid LLM dataset rows which contain RAG/LLM host error messages instead of the actual data and reports them as problems.
Evaluators newly provide brief description apart to full description.
Perturbators are newly ensuring that the perturbed data are not equal to the original data and fail if the perturbation did not change the data.
Connection configuration has new
extra_params
dictionary field which can be used to pass additional parameters to the connection client. For example, setting thetimeout
parameter on the h2oGPTe connection will apply the timeout parameter to all requests (that support it) made by the h2oGPTe client.Versions of cached/downloaded models - like
vectara/hallucination_evaluation_model
orgpt2-medium
- used by evaluators are newly frozen to avoid the model changes.
Fixes:
Negative (RAG/LLM) cost of the prompt is reported as a problem by evaluators which create boolean leaderboards. The cost is also set to
0.0
in the evaluation results to minimize the impact of the cost on the evaluation.
Changes:
h2oGPTe
client upgraded to version1.5.8
.Perturbation probability intensity increased in Qwerty and Antonym perturbators to ensure sufficient perturbation of the data.
Security:
setuptools
upgraded to70.0.0
to fix vulnerabilityCVE-2024-6345
.Open AI RAG version 2.0 support brings upgrade of the
openai
Python library from version1.20.0
to the version1.35.13
, which fixes LangChain community vulnerabilityCVE-2024-2965
.
RC 46 - 2024/06/28
Evaluators:
Four new GPTScore-based evaluators for the evaluation of the summarizations with the reference summaries, evaluation of the summarizations without the reference summaries, evaluation of machine translations and evaluation of the question answering.
Features:
Evaluation / interpretation API can list all and incompatible evaluators / explainers.
Enhancements:
Evaluators assessing
boolean
metrics, such as token presence or PII leakage, now have the ability to use custom metric names and descriptions to make reports and evaluation data more comprehensive.Evaluators newly have keywords indicating whether they require LLM judge, prompt, expected answer, actual answer, retrieved context or constraints.
Significantly improved descriptions of all evaluators - descriptions are mostly generated from the evaluator class metadata.
Problems are newly sorted by severity (from highest to lowest).
Insights are sorted by type (alphabetically).
All and incompable evaluators/explainers newly shown in the evaluation report.
Fixes:
Missing threshold added to parametrizable BYOP evaluator.
Breaking changes:
Evaluator keyword
sr-11-7-ongoing-analysis
has been fixed to the correctsr-11-7-ongoing-monitoring
keyword.
Documentation:
reStructuredText documentation of the evaluators rewritten - every evaluator has brief description, requirements, evaluation method, evaluation metrics, insights, and problems sections.
RC 45 - 2023/06/25
Enhancements:
New Random character type perturbator.
Fixes:
Integrity checks and validation of the model configuration (like embeddings, tokenization, temperature, token limits) used to build the test lab.
Changes:
Interpretation/evaluation is marked as successful if at least one evaluator successfully finishes.
h2oGPTe client upgraded to version
1.5.1-dev7
.Python 3.11 dependencies upgraded:
cryptography
to version42.0.8
,scikit-learn
to version1.5.0
, andtoml
to version0.10.2
.
RC 44 - 2023/06/14
Enhancements:
h2oGPTe client upgraded to version
1.5.0-dev21
to support the upcoming H2O Enterprise h2oGPTe release.Colorized evaluation status added to the HTML report.
Crash of an evaluator is newly reported as a high severity problem, and makes the evaluation to be marked as failed. However, the evaluation continues with the other evaluators.
An attempt to run non-registered evaluator is newly reported as a high severity problem, and makes the evaluation to be marked as failed. However, the evaluation continues with the other evaluators.
Improved measurements of the LLM latency in the GenAI client.
Fixes:
Fixed duplicate prompts in the model weak points (the most difficult prompts) section of the HTML report.
RC 43 - 2024/06/11
Features:
Ability to configure h2oGPTe, h2oGPT, H2O LLMOps, ollama, OpenAI chat, OpenAI RAG, and Microsoft Azure hosted OpenAI clients to control the evaluation of LLM models (for instance
temperature
) and RAG systems (for instanceembeddings provider
,system prompt
orprompt template
).
Enhancements:
All perturbators are newly deterministic for improved robustness and testability (except synonym and antonym pertubators which are deterministic in testing only).
Synonym and antonym perturbators improved with eager synonym/antonym swap which tries to match the percentage of words swapped (prior the fix perturbators tried only x times, and if the new synonym/antonym was the same word, it would not swap anything).
Fixes:
Fixed all perturbators for issues with special tokens in de/tokenization like undesired spaces around expressions in parenthesis after detokenization.
Security:
Upgraded
scikit-learn
library to version1.5.0
to solve vulnerabilities detected by SNYK.Upgraded
cryptography
library to version42.0.8
to mitigate vulnerabilities detected by SNYK.
Documentation:
reStructuredText documentation of the evaluation and new features (host configuration) with configuration prototypes examples.
RC 42 - 2024/05/31
Enhancements:
Keyword groups for grouping of keywords which are used to tag evaluators.
H2O Eval Studio purpose keyword group which organizes evaluators into disjunct sets.
RC 41 - 2024/05/30
Evaluators:
New perplexity evaluator for LLMs which calculates the perplexity - “measure of uncertainty” - of the generated text.
Enhancements:
Save JSon data decoder for NaN and infinities.
H2O Eval Studio can be configured whether to use GPU or CPU for the evaluation.
Fixes:
HTML report generation fixed in case that evaluation of all rows in the dataset fails.
RC 40 - 2024/05/29
Evaluators:
New summary evaluator provides completeness and faithfulness metrics for LLM summarization tasks evaluation without the need for a reference summary.
Features:
Insights - new feature allowing explainers and explanations to provide insights into the evaluation results and suggest actions to be taken.
Enhancements:
Evaluation JSon and HTML result includes overall evaluation result represented as one value which is based on the severity of the problems detected in the evaluation. It is represented as traffic light colors (green, yellow, red) in the HTML report.
All evaluators report insights about the evaluation results and suggest actions to be taken via insight enhancements in bool, heatmap and classification leaderboards explanations.
Text matching, PII and Sensitive data leakage evaluators report apart problems and accuracy related insights also insights about cost and performance (speed) of evaluated models.
Models section in the HTML report rewritten to contain model details, insights, and problems.
Example PIIs (emails, credit cards, SSNs) in the PII evaluator are no longer reported as problems. These false positives are now marked as
False
in the evaluation results.Test lab statistics.
Fixes:
Hallucination evaluator fixed to correctly handle low values as hallucinations (not vice versa).
Changes:
Bool leaderboard JSon representation values (and metrics metadata) changed from percentages to
[0.0, 1.0]
float range.
RC 39 - 2024/05/06
Enhancements:
ragas
library upgrade to version0.1.7
.
Fixes:
Added on-demand caching of
tiktoken
’s BLOBs which are used byragas
library.Fixed Faithfullness evaluator and RAGAs evaluator flakiness (
NaN
) byragas
library upgrade.
RC 38 - 2024/05/03
Features:
ollama
(https://ollama.com/) hosted LLMs support - new connection, client and test lab builder.
Enhancements:
All evaluators detect flip of metrics and report the flip in the evaluation results as problems. In case of boolean metrics, the flip is detected as change from
True
toFalse
and vice versa. In case of numeric metrics, the flip is detected as change from above to below the threshold and vice versa. In case of the classification, the flip is detected as change from the correct to incorrect classification and vice versa.
Changes:
Introducing relationships among test cases which adds new
relationships
key to test case, test suite and test lab as well as columnrelationsihps
to LLM dataset and LLM evaluation result. JSon representations (key) and CSV representations are extended (column). Old JSon files are deserialized in loosely coupled way to avoid the backward compatibility breaking changes.Added
key
field inputs in the test lab.Added
key
field/column to LLM dataset inputs (rows).Added
key
field/column to evaluation result inputs (rows).
Fixes:
Fixed undesired retries in the RAG/LLM test lab completion of h2oGPTe LLM and H2O LLMOps hosts in case of the successful completion of the test cases.
Fixed
NaN
(not a number) handling in leaderboard pallette color lookup.
RC 37 - 2024/04/25
Evaluators:
New Classification evaluator for RAGs/LLMs used for classification problems. The evaluator calculates common metrics used in case of binomial and multinomial classification problems like accuracy, precision, recall and F1. The Classification evaluator is also bringing new classification leaderboard explanation.
Features:
New perturbations module with the ability to perturb the input data (5 perturbation methods) in order to test the robustness of the RAGs/LLMs and the quality of the data: comma, word swap, QWERTY, synonym and antonym.
New public perturbations API with list, filtering and (multiple) perturbation methods application to string, test case, test suite or LLM dataset prompts.
3 new summarization tests for evaluation of summaries both with and without reference summary (Frank, SamSum and SummEval).
Enhancements:
Format specifier in evaluation metrics metadata changed from Python f-strings to JavaScript D3 format strings.
Fixes:
Ranges in evaluation metrics metadata fixed - [0, 1] vs. [0, 100].
Testing:
RAG/LLM test suite can finish successfully even if OpenAI API key is not set (auto reconfiguration to 3rd party judges; tests which use OpenAI endpoints are skipped).
RC 36 - 2024/04/18
Fixes:
OpenAI client fixed to version 1.20.0 to keep version 1 API compatibility (OpenAI Assistants code in H2O Eval Studio must be rewritten to version 2 to move from retrieval tool to file search).
RC 35 - 2024/04/18
Feature:
New metrics metadata - all evaluators newly declare the metrics they calculate with the metadata (name, description, type, unit, range, scale, …). Metrics metadata are used in the evaluator (descriptor, evaluation, results), in the leaderbords (JSon representation, HTML report generation), and explanation/evaluation formats (JSon, HTML, Markdown).
Loosely coupled serialization and deserialization of object/JSon data structures:
ExplainerDescriptor
,ExplanationDescriptor
,ConfigItem
andFilterEntry
.Caching of the models used (internally) by evaluators and explainers: public API, caching module, and caching configuration enabling air gapped evaluators deployment.
Backward compatibility breaking changes:
data
key added to heatmap and bool leaderboards JSon representations.
RC 34 - 2024/04/12
Fixes:
NaN
(not a number) handling/encoding in the heatmap leaderboard JSon “all metrics” data file.
RC 33 - 2024/04/12
Features:
Microsoft Azure hosted OpenAI LLMs support - new connection, client and test lab builder.
H2O LLMOps hosted LLMs support - new connection, client and test lab builder.
Security:
HTTPS requests SSL certificate verification configuration: H2O Sonar configuration controls the SSL certificate verification process/level in requests library, LLM hosts client libraries and other HTTP(S) clients.
Changes:
H2O GPT client rewritten to OpenAI API client (please update server port and base URL).
H2O LLMOps client rewritten to OpenAI API client (no configuration changes needed).
Base URL parameter removed from OpenAI API client constructor (connection configuration is used).
RC 32 - 2024/04/10
Enhancements:
Constants for keys in
datasets.py
Python modules.
Documentation:
BLEU and ROUGE evaluators .rst documentation.
RC 31 - 2024/04/08
Evaluators:
BLEU evaluator.
ROUGE evaluator.
Enhancements:
New keywords for the most important ML problem types solved by RAGs/LLMs: question answering, information retrieval, summarization, classification (binomail and multinomial) and regression. All evaluators were decorated with relevant keywords.
New keyword for the referential user role: regulator.
Fixes:
NaN (not a number) handling in the evaluator results, formats and leaderboard.
Security:
nltk
added as evaluators Pytho extras dependency.rouge-score
added as evaluators Pytho extras dependency.punkt
is new cached NLTK model for text to sentence tokenization.
RC 30 - 2024/03/27
Enhancements:
Toxicity evaluator reimplemented to directly use the
toxicity
library and show several metrics which explain what type of toxic content has been detected in the answer.Fairness bias evaluator reimplemented to directly use bias detection model (in ONNX format) for the evaluation.
Hallucinations evaluator reimplemented to use LLM judge for the hallucination detection.
Security:
deepeval
Python dependency removed.Evaluators
based ondeepeval
were rewritten to use underlying libraries without relying ondeepeval
.TensorFlow
andDBias
Python dependencies removed. Fairness Bias evaluator newly does not rely onDBias
Python library as the underlying model was ported fromTensorFlow
toONNX
.HMLI
moved from the core H2O Eval Studio dependencies to theexplainers
package extras. in order to avoid the CVE vulnerabilities which must be fixed for H2O Eval Studio cloud deployment certification.H2O-3
moved from the core H2O Eval Studio dependencies to theexplainers
package extras in order to avoid the CVE vulnerabilities which must be fixed for H2O Eval Studio cloud deployment certification.
RC 29 - 2024/03/21
Security:
HMLI upgraded to MLI version 1.10.26 to mitigate CVE-2023-39013 (HMLI’s Duke dependency vulnerability).
RC 28 - 2024/03/17
Features:
Bring Your Own Judge (BYOJ) - ability to configure H2O Eval Studio so that evaluators use custom LLM judges. For instance in order to ensure privacy and avoid sending of the sensitive data to a 3rd party. This feature includes reconfiguration of embeddings provider from the same reasons. Custom judges can be either forced from the H2O Eval Studio configuration or specified in the evaluator parameters.
Bring Your Own Prompt (BYOP) - ability to easily run evaluation just be providing a prompt template or implement a new evaluator just by inheriting from the BYOP abstract class and specifying prompt which returns the boolean value.
OpenAI LLM client (only Assistants with retrieval tool was supported before). The client supports both OpenAI service (no base URL specified) and OpenAI compatible endpoints (base URL specified).
Evaluators:
Contact Information evaluator (BYOP).
Language Mismatch evaluator (BYOP).
Parametrizable BYOP evaluator with the ability to specify the prompt template in the evaluator parameters.
Sexism evaluator (BYOP).
Stereotype evaluator which detects undesired gender/race content in the answer (BYOP).
Summarization evaluator (BYOP).
Enhancements:
ragas
library upgrade to version0.1.3
.
RC 27 - 2024/03/14
Security:
Fairness Bias evaluator removed as it used
dbias
library which depends on vulnerable TensorFlow version. This change ensures there is no TensorFlow, un-registers the evaluator and skips all evaluator tests (code is kept in the codebase).
Enhancements:
Problems are newly loaded with the load of the evaluation from the JSon representation.
QA:
MMC builds disabled (it was extra cost in addition to GH Actions build; MMC has old Python version)
RC 26 - 2024/03/08
Features and enhancements:
progress reporting:
end to end, evaluation, all evaluators, lab (build and completion)
callback or file-system
evaluators can be filtered by labels for:
SR 11-7
NIST AI MRM
HTML report refactored
sections shuffled by importance
new evaluation (details) group added
dataset section content reordering
explanation and title added to insight leaderboards
Markdown representation robustness
input/output escaping
LLM vs. RAG failures listing fixed
ES summary .md redesigned
improved text matching regexp error messages and docstring (ES UI)
improved .rst documentation
evaluator parameters refactored from standalone file to i/e.json
Changes:
evaluation result is stored on the file system (no longer discarded)
Fixed:
3x faster lab completion (fixed duplicate requests)
hangs/deadlocks in the lab completion (configurable multiprocessing)
QA:
new GH Actions test suite in GREEN since 09117c3ce891e410e68e62361d109f179ed4c79f
GHA builds and test H2O EvalStudio deployment runtime configuration only
improved h2oGPT/h2oGPTe test server selection (config switch)
method to purge h2oGPTe relics
new h2oGPT servers
RC 25 - 2024/02/05
Evaluators:
Fairness bias evaluator (
deepeval
based).
RC 24 - 2024/02/02
Evaluators:
Toxicity evaluator (
deepeval
based).
Fixed:
PII and sensitive data leakages (regexps).
RC 23 - 2024/01/26
Features:
LLM/RAG clients telementry.
Prompt cache: LLM/RAG responses can be cached on building a test lab. The cache can be build from existing test lab and used in RD only mode.
Enhancements:
LLM/RAG client retries (3 by default).
Evaluators which require OpenAI key are tagged using keywords.
…
RC 22 - 2024/01/22
Enhancements:
Changing h2oGPTe dependency to the last Python package version.
Fixes:
Hiding retrieval errors in the bool leaderboard.
Evaluation tests:
Removal of constraints OR expressions from test suite/labs for Atlanta event as H2O Eval Studio does not support it yet.
RC 21 - 2024/01/21
Fixes:
Fixed RAGAs leaderboard calculation.
Retrieved context builder enhancements.
Tests:
OpenAI end to end CI test which runs all evaluators.
Evaluation tests:
Polished, fixed (duplicate prompts) and extended SR 11-7 and Bank teller test suites.
RC 20 - 2024/01/18
Evaluators:
Sensitive data leakage evaluator.
Enhancements:
Test lab build fallbacks: dummy doc for RAG.
Fixes:
OpenAI test lab build (missing arguments).
Evaluation tests:
SR 11-7 test suite w/ 171 prompts.
RC 19 - 2024/01/18
Evaluators:
PII evaluator.
Fixes:
Asynchronous interpretation execution fixed (inconsistent method signatures).
Changes:
datatable
upgraded from AWS S3 hosted version, to1.1.0
pypi.org hosted version.
v1.1.1 — 2023/10/9
A patch release bringing minor fixes and enhancements.
Added
Both CLI and Python API accept library configuration and encryption key parameters in case that the the interpretation arguments are provided as JSon.
Fixed
HTML interpretation report path in the CLI output fixed (it was pointing to the interpretation HTML index).
False positive feature importance leak detection is no longer reported in case of multinomial problems.
Morris Sensitivity Analysis no longer fails in case of non-numeric boolean columns presence in the training dataset.
Changed
No changes.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v1.2.0 — 2023/10/31
Talk to H2O Eval Studio report - upload your interpretation report to the Enterprise h2oGPT in order to find out more about your model, data, problems, insights and suggested (mitigation) actions.
Added
Features
Ability to upload your interpretation report to Enterprise h2oGPT either using Python API (
run_interpretation()
method parameter,upload_interpretation()
method), or CLI. The feature is supported with Python 3.10 and Python 3.11 only.
Documentation
H2O.ai documentation theme.
Fixed
Wheels are no longer built with the legacy pip resolver which was causing dependency conflicts in some cases on certain platforms.
Test and validation dataset details are newly shwon in the HTML report.
Opened port / Driverless AI server port check is no longer verbose.
Changed
No changes.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v1.1.1 — 2023/10/9
A patch release bringing minor fixes and enhancements.
Added
Both CLI and Python API accept library configuration and encryption key parameters in case that the the interpretation arguments are provided as JSon.
Fixed
HTML interpretation report path in the CLI output fixed (it was pointing to the interpretation HTML index).
False positive feature importance leak detection is no longer reported in case of multinomial problems.
Morris Sensitivity Analysis no longer fails in case of non-numeric boolean columns presence in the training dataset.
Changed
No changes.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v1.1.2 — 2023/10/13
A patch release bringing minor fixes and enhancements.
Added
No additions.
Fixed
SHAP library version fixed to
shap>=0.40.0,<=0.42.5
as new version is causing instability in feature importance explainers.
Changed
H2O Model Validation upgraded to
0.16.3
with updatedh2osteam
and H2O MLOps clients which avoid version clashes in upcoming H2O.ai Cloud notebook kernels.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v1.1.1 — 2023/10/9
A patch release bringing minor fixes and enhancements.
Added
Both CLI and Python API accept library configuration and encryption key parameters in case that the the interpretation arguments are provided as JSon.
Fixed
HTML interpretation report path in the CLI output fixed (it was pointing to the interpretation HTML index).
False positive feature importance leak detection is no longer reported in case of multinomial problems.
Morris Sensitivity Analysis no longer fails in case of non-numeric boolean columns presence in the training dataset.
Changed
No changes.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v1.1.0 — 2023/10/03
Integration of H2O Eval Studio and H2O Model Validation projects.
Added
New explainers
Adversarial Similarity explainer.
Backtesting explainer.
Drift Detection explainer (reports exceeded PSI threshold as a problem).
Size Dependency explainer.
Segment Performance explainer.
Calibration Score explainer.
Features
H2O Model Validation based explainers are able to use H2O AIEM hosted Driverless AI, H2O Enterprise Steam hosted Driverless AI or any H2O Driverless AI which uses username/password authentication.
Ability of H2O Eval Studio to run with or without H2O Model Validation library installed. If H2O Model Validation is not available, then H2O Model Validation based explainers just indicate incompatibility and do not cause the interpretation to fail.
Portable export and import of
MVTest
related instances like settings, results, artifacts, and logs. The implementation is based on JSon, CSV, and directory hierarchy. Therefore it can be used by a wide range of tools, programming languages, and runtimes.RemoteHandle
s bring support for remote (Driverless AI) datasets and models. Apart from the data structure, it is a part of explainers metadata and compatibility checks.Model is no longer required when running a new interpretation which allows to run explainers on datasets only.
Automatic fallback guess of the model metadata - like problem type, labels, and used features - in case the model does not provide them.
Enhancements
Attributes (dictionary) added to
ProblemAndAction
class which enables explainers to pass machine-processable data from problems to actions for further actionability.Connections and licenses are newly identified by unique keys (identifiers) in the H2O Eval Studio configuration and through the runtime.
Python 3.10 support.
Python 3.11 support - H2O Model Validation explainers not available as transitive library dependencies do not support Python 3.11.
daimojo
library pre-heat prediction to activate the MOJO models introspection.Interpretations index HTML path added to the CLI interpretation output.
Completion of the
testset
andvalidset
handling implementation in the explainer container - datasets are newly passed to explainers along with their metadata.The following configuration keys were added to the H2O Eval Studio library configuration:
server_id
environment_url
token_use_type
Shapley Values for Original Features (Kernel SHAP Method) explainer is approximately 3x faster case of multinomial problems (the speed up is proportional to the number of classes - more classes, more speed up).
Utilities
Shapley contributions sorter which can be used by all Shapley-based explainers whenever multi-class contributions are reported within the same frame - makes the code cleaner and simpler.
Documentation
Library configuration CLI API
reStructuredText
documentation.Jupyter Notebook with examples of how to run H2O Model Validation explainers using the Python API and CLI.
reStructuredText
documentation of all new H2O Model Validation based explainers.New explainers overview table with per-explainer features and requirements added to both
README.md
andreStructuredText
.Explainers overview diagram is newly organized according to the functional architecture of explainers.
Tests
Python and CLI tests of all H2O Model Validation explainers.
Fixed
Shapley Values for Original Features (Kernel SHAP Method) explainer reports per-class contributions in the case of multinomial problems (contributions were mixed together).
Morris Sensitivity Analysis explainer fixed to work with
InterpretML
0.1.20.Pseudocode and Python code generated by the Decision Tree explainer is consistent again.
HTML report fixed to properly handle if no explainer is run within the interpretation.
Thread safe interpretation executor shutdown.
Changed
The following configuration key was changed in H2O Eval Studio library configuration:
client_refresh_token
has been renamed totoken
.
Deprecated
No deprecations.
Removed
Test suites which were replaced by Pytest markers.
Tests of legacy Driverless AI models (
Makefile
targets, S3 archives).
Security
No security enhancements.
v1.0.0 — 2023/6/30
The first stable H2O Eval Studio release.
Added
Enhancements
Multiple sampling methods for the explainer dataset (stratified, random, head).
Configurable out-of-memory (OOM) protection.
Improved ability of the interpretable model to extract
scikit-learn
models metadata.
Utilities
Random attack utility that tests H2O Eval Studio on many datasets and models: it gets a directory with datasets as a parameter, trains a (
scikit-learn
) model for a random dataset and its column, and finally runs all the explainers to test the H2O Eval Studio.
Documentation
Explainers overview diagram indicates whether the explainer reports problem(s).
Configuration management documentation (including encryption).
Per-explainer problem reporting capabilities documentation.
Fixed
Summary Shapley explainer and Original feature importance explainer fixed to properly use SHAP library to get Shapley values for regression vs. multinomial (experiment type detection).
Disparate Impact Analysis calculation fixes (comparisons in metrics) in case of string features.
Decision tree Python code and pseudo-code generator fixed.
HTML report fixed to properly display explanations type and format(s).
Division by zero fixed in the progress reporting runtime.
Changed
CLI, JSon and Python parameter names were unified - this change breaks backward compatibility and was intentionally done before the first stable release.
Deprecated
No deprecations.
Removed
No removals.
Security
Added encryption of sensitive fields in the H2O Eval Studio configuration (config, CLI, documentation).
v0.11.2 — 2023/7/26
Added
Enhancements
No enhancements.
Fixed
Fix .py/pseudo code generated by DT: > vs. >=
Changed
Upgrade MLI jar to 1.10.23
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v0.11.1 — 2023/5/22
Handle missing value bins for PD when OOR is enabled and output histogram data to PD results.
Added
Enhancements
Output previously missing histogram data to PD results.
Fixed
Correctly handle missing value bins for PD when OOR is enabled.
Changed
No changes.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v0.11.0 — 2023/4/24
Leak detection added to feature importance explainers.
Added
Enhancements
Leak detection added to feature importance explainers: Shapley Values for Original Features (naive method) explainer, Morris Sensitivity Analysis explainer, Shapley Values for Original Features (Kernel SHAP method) explainer.
Missing values are treated as a separate bin in the PD explainer.
H2O Eval Studio CLI can read arguments from JSon file.
Fixed
Fixed display of plots in Jupyter notebooks.
Changed
No changes.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v0.10.1 — 2023/3/9
Patch release bringing Result
(documentation) enhancements.
Added
No new features.
Fixed
No fixes.
Changed
No changes.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v0.10.0 — 2023/2/2
New Dataset and Model Insights explainer and fixes of bugs found by a new random attack.
Added
Explainers
New Dataset and Model Insights explainer.
Enhancements
Residual Decision Tree explainer newly highlights the whole path to the highest residuum in the visualized tree.
DIA result API help related to the reference level improved.
Fixed
Surrogate Decision Tree Python code generator fixed: added missing
(
)
in boolean expressions, features can have any characters in their names.Move from
os.rename
toshutil.move
in order to ensure that the operation will not fail if the source and target are on different file systems.Missing
isna
symbol used in the Disparate Impact Analysis explainer.Comparison of
string
s andbool
s in the ICE method.Float division by zero in the Residual Decision Tree explainer.
Changed
No changes.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v0.9.0 — 2023/1/13
Minor H2O Eval Studio release which brings asynchronous interpretation execution.
Added
Features
New option allowing to run interpretations asynchronously.
Enhancements
New introspection API for Result classes (method parameters).
Fixed
Sqrt MSE to get RMSE in the Surrogate Decision Tree explainer.
Handling of date, time and date time features in the PD/ICE explainer.
Changed
No changes.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v0.8.0 — 2022/12/8
New Partial Dependence for 2 Features explainer and enhancements for H2O Sonar explainer container implementation for Driverless AI.
Added
Explainers
New Partial Dependence for 2 Features explainer.
Features
New Global 3D Data result, explanation and associated formats (JSon, CSV).
Enhancements
Command-line interface with pretty-printed listing of explainers, improved formatting of explainer descriptions and H2O Eval Studio version
show
action.Residual PD/ICE for multinomial problems added.
Improved explainer container resolution and creation (identifier, instance).
Model agnostic API to indicate the ability to provide/calculate Shapley values added.
Improved compatibility checks and new compatibility error type.
Explainable model’s features metadata simplification, completion and consolidation.
Explainable dataset’s metadata simplification, completion and consolidation.
Improved HTML report highlights failed explainers, brings a comprehensive overview section, shows new modal and dataset metadata fields.
Documentation
Added Jupyter Notebook documentation of how to run H2O Eval Studio in the Internal H2O.ai Cloud.
Fixed
Disparate Impact Analysis explanations completed to be 100% binary compatible with Driverless AI’s Grammar of MLI (entities).
Disparate Impact Analysis explainer feature resolution for DIA calculation rewritten.
Disparate Impact Analysis explainer and PD/ICE explainer fixed to work on a dataset with string (target) column(s).
Residual PD/ICE no longer returns regular PD/ICE as the default representation (and residual as an extension), but the residual PD/ICE.
Residual PD/ICE HTML fragment representation path to images fixed so that it no longer renders the same charts for all classes.
Summary Shapley explainer name correctly indicates SHAP method (not wrong naive Shapley method).
Changed
Features metadata class of the explainable model has been refactored to the
h2o_sonar.methods.core.method
module and all constant references consolidated to this class.Operating system version to build Linux distribution and wheels has been changed from
Ubuntu 20.04
toUbuntu 18.04
to ensure that H2O Eval Studio wheels will work both on this and new Ubuntu versions.
Deprecated
No deprecations.
Removed
No removals.
Security
MLI upgrade to 1.10.21 to mitigate CVE-2022-2048 and CVE-2022-25647.
v0.7.0 — 2022/10/18
H2O Eval Studio beta release with Bring Your Own Explainer based extensibility, reporting of model problems, new Residual PD/ICE explainer, new Morris sensitivity analysis and various smaller enhancements.
Added
Features
BYOE - Bring Your Own Explainer.
Model problems and actions.
Explainers
New Residual Partial Dependence/Individual Conditional Expectations explainer.
New Morris Sensitivity Analysis explainer.
Residual Decision Tree explainer reports problems and actions.
Explanations
New interpretation report - structure, content, and theme in H2O.ai colors.
Organization of explainers to functional groups.
Utilities
Improved label encoder to simplify the use of 3rd party libraries that require numeric (non-categorical) features. Label encoder is integrated into both explainable dataset and explainable model APIs.
Command-line interface
All Python API’s interpretation parameters are newly available on CLI.
Documentation
Added Getting started with BYOE.
Fixed
HTML report paths to images and explanations are relative and valid regardless of the results directory location.
Explainer container runtime and explainers stabilized to work on raw (non-sanitized) datasets.
Explainers listing action help fixed on the command line interface.
Changed
list_explainers()
method on both Python API and CLI lists all explainers by default (it listed only basic explainers withrun-by-default
keywords beforethis change).
Logging consolidated to single module
h2o_sonar.loggers
and loggers renamed/refactored so that it can be used both in methods and explainers.Migration of explainer container runtime from
HMLI
toh2o
wheel dependency.Parameter
path
ofzip()
method used by explainer’sResult
class has been changed tofile_path
to make it consistent with otherResult
parameters.Result
classes refactoring from explainer implementations into consolidated and reusable results classes for main supported explanation types.The
summary()
method’s functionality is moved toparams()
and the newsummary()
method returns the summary of the explanation (content ofresult_descriptor.json
)
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v0.6.0 — 2022/9/8
New Friedman’s H-statistic and Residual Surrogate Decision Tree explainers, Driverless AI REST interface model support and improved HTML interpretation representation.
Added
Explainers
Friedman’s H-statistic explainer for feature behavior explanations.
Residual Surrogate Decision Tree for model debugging (new default explainer).
Model support
Added Driverless AI REST interface model support.
Explanations
Significantly improved HTML interpretation representation with new explanation charts for every explainer, interpretation parameters and explainers parameters.
Command line interface
Added parameter to run all explainers (not just basic explainers).
Interpretation listing including HTML representation.
Documentation
Bring Your Own Explainer templates and examples added to distributions.
Fixed
Improved scikit-learn multinomial models support with labels lookup.
Compatibility check function gets all available parameters for more advanced checks.
DIA HTML fragment representation path to images.
In-memory persistence store (keys) stabilization.
Logging names and interpretation and explainer logging keys consistency.
Changed
hmli
anddaimojo
dependencies updated.Source distribution -
tarball
- build changed so that doesn’t contain.whl
.Binary distributions are built for every supported platform.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v0.5.0 — 2022/8/16
Fix release which brings binary distribution with improved documentation and Jupyter Notebook examples.
Added
Documentation
Improved
ReStructuredText
documentation with getting started, library documentation (interpretation, configuration, explainers), licenses and change log.New and improved Jupyter Notebook examples.
Model support
Added pickled (Scikit-learn) models interpretability.
Command line interface
Added parameters to specify features used by the model and per-explainer parameters.
Fixed
Summary Shapley explainer stabilization: scatter plot feature values fixed, main chart includes all features, regression/binomial/multinomial labels fixed,
max_features
parameter honored, per-class multinomial explanations are generated in all supported formats.Fixed the simple mock model prediction function and added SHAP method support for mock models.
Changed
Models and datasets - used by examples, demos and tests - consolidated and refactored to indicate dataset and model type.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v0.4.2 — 2022/11/29
Fix of the following MLI Java backend security issues: CVE-2022-2048 and CVE-2022-25647.
Added
Fixed
Changed
Deprecated
No deprecations.
Removed
No removals.
Security
MLI upgrade to 1.10.17.2 to mitigate CVE-2022-2048 and CVE-2022-25647.
v0.4.1 — 2022/11/17
Fix of the following MLI Java backend security issues: CVE-2022-2048 and CVE-2022-25647.
Added
Fixed
Changed
Deprecated
No deprecations.
Removed
No removals.
Security
MLI upgrade to 1.10.17.1 to mitigate CVE-2022-2048 and CVE-2022-25647.
v0.4.0 — 2022/6/29
New Transformed Feature Importance explainer for Driverless AI MOJO models and preparation for H2O Eval Studio integration to Driverless AI.
Added
Explainers
Transformed Feature Importance explainer for Driverless AI MOJO models.
Explainer container API and CLI
H2O Eval Studio version available in runtime.
Documentation
Jupyter Notebook with interpretation result API for the new explainer.
H2O Eval Studio explainers overview diagram updated.
Fixed
Changed
Core H2O Eval Studio dependencies updated to be aligned with Driverless AI 1.10.4, two separate builds will be available going forward - regular and Driverless AI.
Deprecated
No deprecations.
Removed
No removals.
Security
MLI upgrade to 0.10.17 to mitigate CVE-2022-25647.
v0.3.0 — 2022/6/22
New Kernel SHAP feature Importance explainer.
Added
Explainers
Kernel SHAP Feature Importance explainer for all supported interpretable models.
Explainer container API and CLI
H2O-3 is automatically started (or reused) - based on H2O-3 configuration.
CLI rewrite to provide more accurate help, error reporting and robust execution.
Documentation
Jupyter Notebook with interpretation result API for the new explainer.
Fixed
Interpretation HTML representation links are no longer broken on the use of the relative path.
Explainers’ summary method returns the correct (non-empty) parameters of the explainer run.
Disparate Impact Analysis explainer core dump on invalid target column specification.
Changed
No changes.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v0.2.0 — 2022/6/3
New Feature Importance explainer for Driverless AI MOJO models.
Added
Explainers
Naive Shapley Feature Importance explainer for Driverless AI MOJO models.
Explainer container API and CLI
list explainers to get available explainer IDs or descriptors.
Documentation
Jupyter Notebook with interpretation result API for the new explainer.
Fixed
CLI: log level specification case insensitivity.
macOS: Driverless AI MOJO import made local.
Changed
No changes.
Deprecated
No deprecations.
Removed
No removals.
Security
No security fixes.
v0.1.0 — 2022/5/27
Initial H2O Eval Studio internal MVP release.
Added
Explainers
Partial dependence/Individual Conditional Expectations explainer (PD/ICE)
Shapley summary plot explainer
Decision tree explainer
Disparate Impact Analysis explainer (DIA)
Explainer container with public explainer APIs
Interpretation, model, dataset, explainer and persistence API.
Explainer container (runtime).
File-system and in-memory persistence.
Easy to use API for retrieval of explainer results.
Model vendor support
Scikit-learn models.
H2O-3 models.
Driverless AI MOJO models.
Command line interface
CLI support of MOJO and pickled models interpretations.
Documentation
Per-explainer Jupyter Notebook with interpretation result API.
Installation, Getting Started and Reference Guide (Sphinx/HTML).
Fixed
No fixes (initial release).
Changed
No changes (initial release).
Deprecated
No deprecations (initial release).
Removed
No removals (initial release).
Security
No security fixes (initial release).