Change Log

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

v2.18.0 — 2025/8/8

This is a minor H2O Eval Studio release

Fixed

OpenAPI Assistants client fixed to be compatible with the latest OpenAI Assistants version 2 API changes.

v2.17.2 — 2025/8/4

This is a minor H2O Eval Studio release

Fixed

Fixed the reference to Hugging Face model all-MiniLM-L6-v2 in the Step alignment and completeness evaluator.
Fixed models caching module to be more robust - fully qualified Hugging Face model names via constants newly used through the code base.

Changed

Package pip upgraded to version 25.2.0.

v2.17.1 — 2025/7/31

This is a minor H2O Eval Studio release

Fixed

Fixed the Hugging Face model all-MiniLM-L6-v2 caching for the Step alignment and completeness evaluator to ensure that the model is cached in the H2O Eval Studio models online cache location by using the fully qualified model name sentence-transformers/all-MiniLM-L6-v2.

v2.17.0 — 2025/7/31

This is a minor H2O Eval Studio release

Added

Enhancements:
- Test suite and test have new categories allowing to indicate the purpose and the type of the test suites and tests. Consider for instance sr-11-7 test suites or stateful tests.
Documentation
- Polished math formulas in docstring documentation of the Looping, GPT Score family, and Classification evaluators.

Fixed

Added caching of the Hugging Face model all-MiniLM-L6-v2 model for the Step alignment and completeness evaluator.

Changed

pyarrow version fixed to 20.0.0 to keep symbols used by depending libraries which were removed by subsequent versions.

v2.16.0 — 2025/6/24

This is a minor H2O Eval Studio release

Added

Enhancements:
- If test lab completion fails to retrieve actual answer from the RAG/LLM host, then the metrics values are set to the worst possible values (depending on whether higher or lower is better) by all evaluators to ensure that the metric score is correct.
Documentation
- Updated evaluators diagram with Encoding Guardrail evaluator.

Fixed

Exponential back-off timeout strategy is not used in case of agentic h2oGPTe test lab build in order to avoid the server overload.
Text Matching evaluator fixed to escape constraints containing apostrophes and quotes.
Step Alignment and Completeness evaluator preferred model changed to specific up-to-date Claude version. Robustness of the LLM model answers parsing improved (Markdown formatting trimming).

Changed

h2oGPTe client upgraded to pypi.org hosted version h2ogpte-1.6.36.

v2.15.0 — 2025/5/29

This is a minor H2O Eval Studio release.

Added

Features:
- Added configurable GPU acceleration to the following evaluators:
  - Answer Relevancy (Sentence Similarity) evaluator.
  - Answer Semantic Sentence Similarity evaluator.
  - Context Relevancy (Soft Recall and Precision) evaluator.
  - Fairness Bias evaluator.
  - Groundedness (semantic similarity) evaluator.
  - Hallucination evaluator.
  - Machine Translation (GPTScore) evaluator.
  - Perplexity evaluator.
  - Question Answering (GPTScore) evaluator.
  - Summarization with reference (GPTScore) evaluator.
  - Summarization without reference (GPTScore) evaluator.
  - Step Alignment and Completeness evaluator.
  - Summarization (Completeness and Faithfulness) evaluator.
  - Toxicity evaluator.
Enhancements:
- hf-xet to improve Hugging Face models handling performance.
- onnxruntime-gpu to improve ONNX models performance when GPU is available.

Changed

lmppl Python dependency 0.0.1 patched with https://github.com/asahi417/lmppl/pull/13 and the wheel moved to the public S3 bucket.

v2.14.0 — 2025/5/22

This is a minor H2O Eval Studio release.

Security

Package langchain upgraded to version 0.3.1 to fix the vulnerability CVE-2024-7042.
Package langchain-community upgraded to version 0.3.1 to fix the vulnerability CVE-2024-7042.
Package openai upgraded to version 1.81.0 as a dependency of langchain to fix the vulnerability CVE-2024-7042.

v2.13.0 — 2025/5/20

This is a minor H2O Eval Studio release.

Added

Evaluators:
- Encoding guardrail evaluator - a tool designed to assess the LLM/RAG’s ability to handle encoding attacks. It evaluates whether the system can be tricked into generating incorrect or unexpected outputs through manipulation of the prompt encoding, such as encoding the prompt text using Base64 or Base16, which should be discarded by the guardrails or the system.
Features:
- Introducing statefull conversations / multi-turn chats / contextual conversation support for the h2oGPTe client - the client can now maintain the context of the conversation across multiple turns, allowing for new types of evaluations and attacks.
- Encoding perturbator - a perturbator which encodes the prompt text using base16 encoding.
- Adding ability to configure and enforce CPU, GPU or automatic device selection for running predictive and generative models. Automatic device selection is the default.
- Added module which calculates various statistics to compare distributions: Kolmogorov-Smirnov test, Wasserstein distance, and Jensen-Shannon divergence.
- H2O Eval Studio newly automatically uses the shell environment configuration overrides from environment variables starting with H2O_SONAR_CFG_ prefix. The environment variables are automatically converted to the H2O Eval Studio configuration parameters (primitive values only).
Enhancements:
- Introducing NaN tolerance to heatmap leaderboard which brings tolerance for NaN values on average metric value calculation - it allows to ignore evaluation results with NaN metric values if the number of evaluation results is lower or equal to the given percentage of the total number of evaluation results.
- RAGAs family evaluators newly support the NaN tolerance which can be configured using the evaluator parameters.
- Test lab completion newly supports auto, "" and None LLM selectors when the test lab is built from h2oGPTe collections. The auto selector lets h2oGPTe to automatically select the LLM model for the test lab completion; "" and None inherit the LLM model from the h2oGPTe collection configuration.

Fixed

Classification evaluator was fixed to correctly handle unknown labels in HTML report confusion matrices.
Classification leaderboard explanation improved to provide stable confusion matrices in the HTML report with unexpected labels.
Perplexity evaluator no longer requires Open AI API key.
Markdown representations texts are newly escaped to ensure formatting and avoid XSS.
Test suites and labs corpus URLs fixed to reflex AWS S3 bucket migration from eu-central-1 to us-east-1.

Changed

HMLI moved from public to private S3 bucket, which is accessible only from the H2O.ai infrastructure. Therefore, the HMLI wheel dependency must be installed from the private S3 bucket before installing H2O Sonar.
h2oGPTe client upgraded to the custom S3 hosted build h2ogpte-1.6.28.dev8-py3-none-any.whl, which has been moved from a public to a private S3 bucket.
Package pip upgraded to version 25.1.1.

Deprecated

No deprecations.

Removed

No removals.

Security

Package langchain-community upgraded to version 0.2.19 to fix the vulnerability CVE-2024-8309.

v2.12.2 — 2025/4/25

This is a minor H2O Eval Studio release.

Changed

Changing an AWS region for the H2O Eval Studio artifacts from eu-central-1 to us-east-1.

v2.12.1 — 2025/4/4

This is a minor H2O Eval Studio release.

Fixed

Perplexity evaluator no longer requires Open AI API key.

Changed

h2oGPTe client upgraded to version 1.6.27.post1.

v2.12.0 — 2025/3/27

This is a minor H2O Eval Studio release.

Fixed

Uploaded documents purging fixed for h2oGPTe client.

Changed

h2oGPTe client upgraded to version 1.6.25.
NLTK upgraded to version 3.9.1 to fix the vulnerability CVE-2024-39705.
Hugging Face Transformers Library upgraded to version 4.50.2.

v2.11.1 — 2025/3/17

This is a patch H2O Eval Studio release.

Added

Enhancements:
- Improving performance of test lab completion parallelization in case that the suite has less than 20 test cases.

v2.11.0 — 2025/3/12

This is a minor H2O Eval Studio release.

Added

Enhancements:
- The Text matching evaluator uses the expected answer as the condition (exact match) if available, when no condition is specified by the test case.

Fixed

Fixed JSon representations of the LLM evaluation result explanation to contain evaluator descriptor again.
LLM evaluation result JSon representation does not include typed structure friendly metrics serialization by default.

Changed

Tokens presence evaluator renamed to Text matching evaluator.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v2.10.0 — 2025/3/6

This is a minor H2O Eval Studio release.

Added

Features:
- LLM evaluation result JSon representation newly includes metrics serialized which is can be described using the proto definitions.
- Exponential backoff driven timeout added to the h2oGPTe client to better perform and report the h2oGPTe timeouts.
Enhancements:
- Test lab completion parallelization and sharding improved to parallelize also inputs assigned to particular RAG/LLM model if the number of the RAG/LLM models is smaller than a configurable threshold.

Fixed

HTML report performance statistics fixed to handle missing keys.

Changed

h2oGPTe client upgraded to version 1.6.23.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v2.9.0 — 2025/2/17

This is a minor H2O Eval Studio release.

Added

Enhancements:
- Improved Step alignment and completeness evaluator - better step extraction from the retrieved context and model answer, propagation of the dynamic programming metrics and alignment matrix to the HTML report, and new ability to combine multiple steps into one if the reference or the generated text contains compound step (left combined, right without the step combination).
- Markdown summary of the evaluation newly includes statistics for response times per LLM model.
- Identical insights reported by different evaluators are newly deduplicated and reported as a single insight.

Fixed

No fixes.

Changed

The default threshold for the Toxicity evaluator has been changed from 0.75 to 0.25 based on empirical observations and feedback from users.
h2oGPTe client upgraded to version 1.6.22.

Deprecated

No deprecations.

Removed

No removals.

Security

Upgrading scikit-learn to version 1.5.2 to fix the vulnerability CVE-2024-5206.

v2.8.2 — 2025/2/05

This is a minor H2O Eval Studio release.

Changed

h2oGPTe client upgraded to version 1.6.18.post1.

v2.8.1 — 2025/1/13

This is a minor H2O Eval Studio release.

Changed

HMLI wheel dependency location changed to the H2O Eval Studio AWS account.

v2.8.0 — 2025/1/10

This is a minor H2O Eval Studio release.

Added

Evaluators:
- Step alignment and completeness evaluator (preview) - a tool for evaluating the steps of procedures, sequences, or process descriptions.
Features:
- Support for agent-based and LLM-based perturbators.
- New Contextual misinformation perturbator.
Evaluation data:
- Test suite evaluation library with 1M+ test cases published at https://eval-studio-artifacts.s3.us-east-1.amazonaws.com/h2o-eval-studio-suite-library/index.html and https://eval-studio-artifacts.s3.us-east-1.amazonaws.com/h2o-eval-studio-suite-library/index.json Makefile targets to maintain the test suite evaluation library added to the project.
Documentation
- ReStructuredText documentation for the new Step alignment and completeness evaluator.
- Added ReStructuredText documentation for Fact-check evaluator evaluator parameters.

Fixed

No fixes.

Changed

No changes.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v2.7.0 — 2024/12/16

This is a minor H2O Eval Studio release.

Added

Evaluators:
- Fact-check evaluator (agent-based).
Enhancements:
- Enhanced test lab prompt cache which is meant for testing/demo purposes: improved configuration (environment variable and H2O Sonar configuration), added retrieved context caching.
Documentation
- New evaluator documentation for the Fact-check evaluator.
- Added documentation for new perturbators which were added in H2O Sonar 2.6.0.

Fixed

Heatmap leaderboard explanation no longer shows empty most difficult prompts section.

v2.6.0 — 2024/12/05

This is a minor H2O Eval Studio release.

Added

Evaluators:
- Answer Semantic Sentence Similarity evaluator.
Features:
- New character level perturbators - insert/delete random character(s), QWERTY keyboard typos, and common OCR errors.

Fixed

Keywords of RAGAs evaluator, Classification evaluator and GPTScore Q&A evaluator fixed.
RAGAs evaluator metadata in leaderboard serializations fixed to include exactly the metrics it calculates.
Fixed the escaping of special characters in classification class names for the multi-class Classification evaluator.
httpx Python dependency fixed to 0.27.0 to avoid openai Python library issues with unexpected proxy parameter.
Resolved random hangs that occurred during h2oGPTe RAG retrieved context fetching when using a session connection managed by the resource manager.

Changed

H2O Eval Studio models online cache location has been moved from root to H2O EvalStudio tenant to download the models from the right location in case of the deployments with the internet access (and to cache the models from the right location in case of the air-gapped deployments).

v2.5.4 — 2024/10/14

This is a patch H2O Eval Studio release.

Fixed

Fixed missing and non-float bool metrics in BYOP evaluators.
Fixed punkt caching in Context relevancy (soft recall and precision) and Answer relevancy (sentence similarity) evaluators.
Fixed keywords metadata in multiple evaluators.

v2.5.3 — 2024/11/13

This is a minor H2O Eval Studio release

Added

Enhancements:
- Perturbators can newly work without raising the exceptions - instead they gather the errors and return them in the passed lists.

v2.5.2 — 2024/10/12

This is a patch H2O Eval Studio release.

Fixed

Fixed missing taglines in the evaluator descriptors.
Fixed singular/plural in classification evaluator metadata.

v2.5.1 — 2024/10/09

This is a patch H2O Eval Studio release.

Added

Enhancements:
- Improved - shorter and concise - taglines in evaluators.

v2.5.0 — 2024/10/08

This is a minor H2O Eval Studio release.

Added

Enhancements:
- Tagline added to all evaluators to provide a brief description of the evaluator.

Fixed

Fixed bugs / inconsistencies between evaluator metadata and keywords like LLM vs. RAG compatibility.
Fixed Classification evaluator metrics values included in the evaluation result to be consistent with the declared metrics in the evaluator metadata.
Ensured caching of the punkt tokenizer for the Fairness Bias evaluator, Groundedness evaluator, and Hallucination evaluator to work correctly in air-gapped deployments.
Fixed Groundedness evaluator AVID error codes and tokenization unpacking.

Changed

Summarization (Completeness and Faithfulness) evaluator excluded from the explainer container as it is resource intensive, expensive, difficult to interpret, and not suitable for the use without GPU HW support.
h2oGPTe client upgraded to version 1.5.26.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v2.4.0 — 2024/10/14

This is a minor H2O Eval Studio release.

Added

Features:
- Amazon Bedrock RAG newly supports creation of the knowledge bases (collections) from test suites as a part of the test lab build and completion.
Enhancements:
- The following evaluators newly report metrics values in the evaluation results on the sentence granularity as actual answer metadata and they highlight problems in the HTML report:
  - Groundedness evaluator
  - Toxicity evaluator
  - Fairness Bias evaluator
  - Answer Relevancy (sentence similarity) evaluator
  - Hallucination evaluator
  - PII evaluator
  - Sensitive Data evaluator
- Token presence evaluator reports which part of the condition caused the evaluation failure. The error message is provided in the meta section of the actual answer metadata and highlighted in the HTML report (error message section).
- Summarization evaluator error messages improved to indicate the root cause of the summarization evaluation failure.
- Test lab is newly accepting custom HTTP headers for the document caching when building the test lab or synchronizing the documents.
Documentation
- Added generative AI section to the introduction of the ReStructuredText documentation.
- Added missing licenses to the ReStructuredText documentation.

Fixed

AVID problem taxonomy fixed to report codes in the problems.
Failure to get LLM statistics in the h2oGPTe client no longer causes the evaluation to fail (it is optional to get the statistics).
Fixed case in the names of evaluators to ensure the naming consistency.

Changed

No changes.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v2.3.0 — 2024/10/07

This is a minor H2O Eval Studio release.

Added

Enhancements:
- Text matching evaluator reports result parsing failures in the evaluation results.
- Text matching evaluator ability to evaluate both actual answer and retrieved context is newly configurable - default is to actual answer only.
- PII evaluator and Sensitive data evaluator ability to evaluate both actual answer and retrieved context is newly configurable - default is to evaluate both actual answer and retrieved context.

Fixed

Evaluation result JSon representation fixed to correctly serialize infinity and NaN values.
Generation/Retrieval/Generation+Retrieval prefix of model failure errors in the HTML report fixed to be visible again.
Passed and failed test cases counts in the test lab completion progress report fixed to be correctly calculated (when retrieved context failures are not considered).
Fixed missing resolved test cases when building lab using the parallel job completion - if resolution of all test cases fails in the job, then the result is not discarded, but kept.

Changed

Boolean leaderboard (JSon, Markdown, dataset) results changed to fail the test case evaluation if the generation fails, and/or retrieval fails, and/or generation+retrieval fails. Previously, retrieval failures were not considered as a failure of the test case evaluation which lead to confusing results. Users can enable/disable the retrieval checks in Text matching evaluator, PII evaluator, and Sensitive data evaluator.
h2oGPTe client upgraded to version 1.5.22.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v2.2.0 — 2024/09/26

This is a minor H2O Eval Studio release.

Added

Enhancements:
- h2oGPTe LLM performance statistics - like cost, input tokens, output tokens and time to the first token - added to the explainable model and Markdown boolean leaderboard explanation.
- Markdown report newly includes h2oGPTe LLM vision model associated with the evaluated model.
- Conditional evaluation by the Text matching evaluator newly reports sub-condition which caused the evaluation failure.
- All row keys and all test cases added to problems reporting that model didn’t pass a metric threshold check
- Evaluator descriptor added to LLM result JSon.

Fixed

No fixes.

Changed

Added LLM model metadata to the explainable (LLM and RAG) model, which changes the serialization and deserialization of the model metadata, test labs (impact H2O Eval Studio) and test results.
Problem attribute test_case_key renamed to test_case_keys and type changed to list.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v2.1.0 — 2024/09/25

This minor H2O Eval Studio release brings looping detection evaluator and smaller enhancements.

Added

Evaluators:
- Looping Detection evaluator.
Enhancements:
- Amazon Bedrock RAG client models listing speed up.
- Evaluated models added to the JSon representation of the evaluation results.
- Test case key added to the JSon representation of the evaluation results.
Documentation
- Added reStructuredText documentation of the evaluators.
- Added prompts documentation for LLM judge-based evaluators.

Fixed

Row key(s), test case keys and model keys added to the problems and insights (where applicable) to simplify the mapping of the evaluation results to the original data.

Changed

Metrics column names of boolean leaderboard evaluators changed from ad hoc names to actual boolean metrics names.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v2.0.0 — 2024/09/18

This major H2O Eval Studio releases brings generative AI evaluation.

Added

Evaluators
- Generation evaluation
  - Answer Correctness evaluator.
  - Answer Relevancy evaluator.
  - Answer Relevancy (Sentence Similarity) evaluator.
  - Answer Semantic Similarity evaluator.
  - Bring Your Own Prompt (BYOP) evaluator.
  - Faithfulness evaluator.
  - Groundedness (semantic similarity) evaluator.
  - Hallucination evaluator.
  - Language Mismatch evaluator.
  - Machine Translation (GPTScore) evaluator.
  - Perplexity evaluator.
  - Question Answering (GPTScore) evaluator.
  - RAGAS evaluator.
  - Text matching evaluator.
- Retrieval evaluation
  - Context Precision evaluator.
  - Context Recall evaluator.
  - Context Relevancy evaluator.
  - Context Relevancy (Soft Recall and Precision) evaluator.
- Privacy evaluation
  - Contact Information evaluator.
  - PII evaluator.
  - Sensitive Data evaluator.
- Fairness evaluation
  - Fairness Bias evaluator.
  - Sexism evaluator.
  - Stereotype evaluator.
  - Toxicity evaluator.
- Summarization evaluation
  - BLEU evaluator.
  - ROUGE evaluator.
  - Summarization (Completeness and Faithfulness) evaluator.
  - Summarization (Judge) evaluator.
  - Summarization with reference (GPTScore) evaluator.
  - Summarization without reference (GPTScore) evaluator.
- Classification evaluation
  - Classification evaluator.
Features
- Introducing Evaluators as a new type of explainers which are able to evaluate the quality of Retrieval-Augmented Generations (RAG) products.
- New evaluator API - evaluate module to run evaluators, evaluators module to implement new evaluators and Bring Your Own Evaluator (BYOE).
- New evaluator specific datasets based on LlmDataset, models ExplainableRagModel with implementations for h2oGPTe and OpenAI Assistants with retrieval.
- New evaluator testing module with the test support bringing test suites, test cases tests and test labs.
- New genai module with LLM/RAG host clients:
  - H2O Enterprise h2oGPTe
  - H2O GPT
  - H2O LLMOps
  - OpenAI Chat
  - Open AI Assistants with Retrieval tool (version 1) or File Search tool (version 2)
  - Microsoft Azure hosted OpenAI Chat
  - Open AI Chat compatible endpoints
  - Amazon Bedrock
  - ollama
- HTML report branding for the EvalStudio.
- Insights - new feature allowing explainers and explanations to provide insights into the evaluation results and suggest actions to be taken.
Explanations and formats
- New leaderboard (heatmap and bool) explanations with support for multiple evaluation metrics along with HTML, JSon and Markdown formats.
- New (normalized) evaluator result (EvalResult) and explanation formats (JSon, Markdown).
Enhancements
- Installation of H2O Eval Studio using package extras - install only what you need: core, explainers and/or evaluators.
- ragas library integration (license).
Testing
- New llm pytest label for LLM and RAG tests.
- Test suites, test labs and test datasets for the LLM and RAG evaluation: h2oGPTe benchmark, Kaggle LLM Data Science competition, Talk to report and evalgpt.ai.
Changes
- Cython Python dependency upgraded from 0.29.32 to 0.29.37.
Backward compatibility breaking changes:
- Python 3.8 is no longer officially supported.
- Python 3.9 is no longer officially supported.
- Python 3.10 is no longer officially supported.
- JSon file with interpretation parameters which was stored in the interpretation directory is no longer persisted as it contained duplicate information which can be found in the interpretation.json file.
Documentation
- Updated documentation of new features and enhancements.

v2.0.0 Release Candidates

List of 2.0.0 release candidates with the detailed description of the changes:

RC 68 - 2024/09/13
- Evaluators:
  - The new Answer Relevancy (Sentence Similarity) evaluator assesses how relevant the actual answer is by computing the semantic similarity between the question and the actual answer sentences.
  - The new Context Relevancy (Soft Recall and Precision) evaluator measures the relevancy of the retrieved context based on the question and context sentences sentences semantic similarity.
- Enhancements:
  - Toxicity evaluator improved to calculate the toxicity metrics on the sentence granularity and report the maximum of the toxicity metrics values. This enhancement makes the evaluator results more valuable as it can detect the toxic content in the generated text regardless its length (toxic content can no longer hide in long(er) actual answers).
  - Amazon Bedrock model host newly checks the accessibility of the LLM models supported by the RAG and filters out the inaccessible models.
RC 67 - 2024/09/06
- Fixes:
  - Summarization (Completeness and Faithfulness) evaluator fixed to safely use MD5 for the metrics calculation.
- Changes:
  - h2oGPTe client downgraded to version 1.5.16 to integrate with old(er) servers.
RC 66 - 2024/09/06
- Fixes:
  - Evaluated model ID added to the HTML report to simplify mapping of model IDs in the evaluation results (JSon, CSV, frame) to human readable model metadata.
- Changes:
  - h2oGPTe client upgraded to version 1.6.0.dev3.
  - H2O Eval Studio leaderboards Markdown representation title heading level changed to H2.
RC 65 - 2024/09/05
- Features:
  - Amazon Bedrock model host support - evaluation of Amazon Bedrock RAG - knowledge bases (collections) and configured LLM models.
- Fixes:
  - Perturbation flip detection fixed - it didn’t consider answers created by the different RAG/LLM models and reported false negatives.
RC 64 - 2024/09/03
- Fixes:
  - h2oGPTe LLM models listing retries fixed to avoid the flakiness and ensure it will be performed at least once.
- Documentation:
  - Comprehensive update of evaluator documentation: formulas, methods, prompts, links to used models, and fixes.
RC 63 - 2024/08/29
- Enhancements:
  - H2O Eval Studio Markdown representation revamp - new header section for bool/heat/class based leaderboard summaries, model/prompt/… failure sections truncated to at most 3 entries to scale the UI in case of many failures.
  - Model vectara/hallucination_evaluation_model, which is used by the Hallucination evaluator, updated to HHEM-2.1-Open and is frozen to avoid the model changes.
  - Added retries to the h2oGPTe client to avoid the flakiness when listing base LLM models.
  - Improved rendering of the multinomial classification confusion matrices in the HTML report.
RC 62 - 2024/08/27
- Fixes:
  - Fixed broken retrieval and generation error messages construction in the Text matching evaluator.
  - Model and prompt leaderboard in the HTML report/Markdown/JSon representations - result failures are shown based on generation failures (not union of retrieval and generation failures) which ensures that failures and passes give 100%.
  - Model failure entries colors in the HTML report fixed - if the problem is in retrieval, only the context is in red. If the problem is in a generation, then the actual answer is in red.
  - Input field in the model failure list of the H2O Eval Studio markdown de-duplicated. Missing fields added to be on par with the HTML.
RC 61 - 2024/08/22
- Enhancements:
  - Groundedness (semantic similarity) evaluator documentation updated.
  - Improved Hallucination evaluator error reporting on too long retrieved context chunks.
  - More robust perturbation flip direction detection.
RC 60 - 2024/08/21
- Evaluators:
  - The new Groundedness evaluator assesses the groundedness of the generated text by considering the retrieved context - measuring hallucinations and fabricated text. It reports problems on sentence granularity in order to identify the hallucinations and fabricated root causes.
- Enhancements:
  - Added infrastructure to detect the low number of evaluation examples in evaluators and report it as a problem.
  - Problems are newly categorized using the AVID taxonomy: https://docs.avidml.org/taxonomy/effect-sep-view/security
- Fixes:
  - Threshold consistency between evaluator thresholds and metrics threshold defaults fixed.
  - Propagating of actual threshold values to the JSon leaderboard representation fixed.
  - Exception handling in the test lab completion on the parallel job failure fixed.
RC 59 - 2024/08/20
- Enhancements:
  - Test lab completion progress reporting is now more detailed - it includes prompt, LLM, and RAG/LLM host names.
- Fixes:
  - Rounding of metrics values in insights, problems and Markdown representations aligned to 4 decimal places. Percentage values are rounded to 1 decimal place.
- Changes:
  - h2oGPTe client upgraded to version 1.5.11.
RC 58 - 2024/08/14
- Fixes:
  - Links to explanation data in the HTML report changed from directories to files in case of H2O Eval Studio branding as (S3) directories cannot be listed in case of the H2O Eval Studio deployment.
- Changes:
  - Rollback to vulnerable NLTK 3.8.1 (CVE-2024-39705) Python dependency as 3.8.2 has been purged from pypi.org
RC 57 - 2024/08/14
- Enhancements:
  - Default h2oGPTe client timeout to get answer from the LLM or RAG collection is newly 420s (was 1000s).
  - Metrics values in Markdown are newly rounded to 4 decimal places.
- Fixes:
  - Perturbation of perturbed test suites is newly cloned when not perturbing in place.
  - GPTScore threshold parameter description fixed in the evaluator metadata.
  - Hiding H2O Eval Studio specific texts in the HTML report in case of H2O Eval Studio branding.
RC 56 - 2024/08/12
- Enhancements:
  - Added detection of Summarization evaluator failures on all dataset rows and fail fast via raising an exception.
  - Added precondition check on empty evaluation results to all leaderboard types.
  - Evaluator metadata lookup made possible for incompatible evaluators in the HTML report.
  - Test lab completion no longer uses “shard” terminology, but “parallel job” instead.
  - English variant of punkt from NLTK is newly cached as the model used by the evaluators.
- Changes:
  - Updated vulnerable NLTK 3.8.1 (CVE-2024-39705) Python dependency to fixed version 3.8.2.
RC 55 - 2024/08/09
- Fixes:
  - Minor robustness fix in the handling of extra argument passed to the h2oGPTe client.
RC 54 - 2024/08/08
- Fixes:
  - Fixed problem detection on Answer semantic similarity evaluator flip detection. RAGAs evaluator fixed to declare all metrics it calculates in the metadata. Also RAGAs evaluator docstring changed to announce RAGAs metrics only in the documentation.
RC 53 - 2024/08/08
- Fixes:
  - Perturbation of a test suite using multiple perturbators no longer creates exponential number of perturbed test cases. Instead, there are original tests with their test cases and perturbed tests with their perturbed test cases. Thus the number of test cases is 2x the original number of test cases.
- Changes:
  - Internal perturbation API of test suites, tests and test cases changed to support multiple perturbators so that the perturbations can be created in place and relationships properly set.
RC 52 - 2024/08/07
- Enhancements:
  - Test lab completion newly fails fast - raises exception - in case that completion of all test lab’s test cases fail.
  - Evaluations, interpretations and their JSon representations has new error field which contains the error message in case of the evaluation/interpretation failure.
- Changes:
  - h2oGPTe client upgraded to version 1.5.11-dev2.
RC 51 - 2024/08/06
- Enhancements:
  - Missing expected answer in the test case is reported as a problem by the evaluators.
- Fixes:
  - The HTML report generator doesn’t fail on an invalid explainer ID when getting the display name, but returns the ID with a prefix. An error message is logged.
RC 50 - 2024/08/02
- Enhancements:
  - Progress report in the test lab completion no longer includes a full prompt, but just a prefix.
RC 49 - 2024/08/02
- Enhancements:
  - Brief evaluators descriptions were shortened - newly contain just the first paragraph of the full description.
  - Evaluators check whether actual answers in test cases/suites/labs has correct type and if not, they generate the corresponding problems.
  - Air-gapped deployment support improved - 3rd party models used by the evaluators/evaluation libraries are newly frozen (where possible) to prevent model changes.
- Fixes:
  - In an attempt to complete the test lab for exactly one model in parallel, the test lab automatically switches to the serial mode.
  - Insights about the fastest/slowest/cheapest/most expensive models are not generated for the evaluations with exactly one model.
- Changes:
  - Progress reports generated by evaluators newly start with display names of the evaluators rather then IDs.
RC 48 - 2024/07/30
- Enhancements:
  - Brief evaluator description added to the public API - list_evaluators() and describe_evaluator() newly return it.
RC 47 - 2024/07/30
- Features:
  - Added support of the Open AI RAG version 2.0 - Assistants with File Search tool.
  - New conditions in Token Presence evaluator - new syntax which brings support of NOT and parentheses for the complex conditions.
  - Red teaming test suite with various LLM/RAG attacks added to the repository. This test suite can be used for penetration testing of the LLM/RAG models.
- Enhancements:
  - Improved test lab API allows to complete test labs of RAG system using given (existing) collections instead of creating new ones. This API allows user to create, configure and customize the collections, upload corpus and documents, and then use them in the test lab completion.
  - Evaluator container newly detects invalid LLM dataset rows which contain RAG/LLM host error messages instead of the actual data and reports them as problems.
  - Evaluators newly provide brief description apart to full description.
  - Perturbators are newly ensuring that the perturbed data are not equal to the original data and fail if the perturbation did not change the data.
  - Connection configuration has new extra_params dictionary field which can be used to pass additional parameters to the connection client. For example, setting the timeout parameter on the h2oGPTe connection will apply the timeout parameter to all requests (that support it) made by the h2oGPTe client.
  - Versions of cached/downloaded models - like vectara/hallucination_evaluation_model or gpt2-medium - used by evaluators are newly frozen to avoid the model changes.
- Fixes:
  - Negative (RAG/LLM) cost of the prompt is reported as a problem by evaluators which create boolean leaderboards. The cost is also set to 0.0 in the evaluation results to minimize the impact of the cost on the evaluation.
- Changes:
  - h2oGPTe client upgraded to version 1.5.8.
  - Perturbation probability intensity increased in Qwerty and Antonym perturbators to ensure sufficient perturbation of the data.
- Security:
  - setuptools upgraded to 70.0.0 to fix vulnerability CVE-2024-6345.
  - Open AI RAG version 2.0 support brings upgrade of the openai Python library from version 1.20.0 to the version 1.35.13, which fixes LangChain community vulnerability CVE-2024-2965.
RC 46 - 2024/06/28
- Evaluators:
  - Four new GPTScore-based evaluators for the evaluation of the summarizations with the reference summaries, evaluation of the summarizations without the reference summaries, evaluation of machine translations and evaluation of the question answering.
- Features:
  - Evaluation / interpretation API can list all and incompatible evaluators / explainers.
- Enhancements:
  - Evaluators assessing boolean metrics, such as token presence or PII leakage, now have the ability to use custom metric names and descriptions to make reports and evaluation data more comprehensive.
  - Evaluators newly have keywords indicating whether they require LLM judge, prompt, expected answer, actual answer, retrieved context or constraints.
  - Significantly improved descriptions of all evaluators - descriptions are mostly generated from the evaluator class metadata.
  - Problems are newly sorted by severity (from highest to lowest).
  - Insights are sorted by type (alphabetically).
  - All and incompable evaluators/explainers newly shown in the evaluation report.
- Fixes:
  - Missing threshold added to parametrizable BYOP evaluator.
- Breaking changes:
  - Evaluator keyword sr-11-7-ongoing-analysis has been fixed to the correct sr-11-7-ongoing-monitoring keyword.
- Documentation:
  - reStructuredText documentation of the evaluators rewritten - every evaluator has brief description, requirements, evaluation method, evaluation metrics, insights, and problems sections.
RC 45 - 2023/06/25
- Enhancements:
  - New Random character type perturbator.
- Fixes:
  - Integrity checks and validation of the model configuration (like embeddings, tokenization, temperature, token limits) used to build the test lab.
- Changes:
  - Interpretation/evaluation is marked as successful if at least one evaluator successfully finishes.
  - h2oGPTe client upgraded to version 1.5.1-dev7.
  - Python 3.11 dependencies upgraded: cryptography to version 42.0.8, scikit-learn to version 1.5.0, and toml to version 0.10.2.
RC 44 - 2023/06/14
- Enhancements:
  - h2oGPTe client upgraded to version 1.5.0-dev21 to support the upcoming H2O Enterprise h2oGPTe release.
  - Colorized evaluation status added to the HTML report.
  - Crash of an evaluator is newly reported as a high severity problem, and makes the evaluation to be marked as failed. However, the evaluation continues with the other evaluators.
  - An attempt to run non-registered evaluator is newly reported as a high severity problem, and makes the evaluation to be marked as failed. However, the evaluation continues with the other evaluators.
  - Improved measurements of the LLM latency in the GenAI client.
- Fixes:
  - Fixed duplicate prompts in the model weak points (the most difficult prompts) section of the HTML report.
RC 43 - 2024/06/11
- Features:
  - Ability to configure h2oGPTe, h2oGPT, H2O LLMOps, ollama, OpenAI chat, OpenAI RAG, and Microsoft Azure hosted OpenAI clients to control the evaluation of LLM models (for instance temperature) and RAG systems (for instance embeddings provider, system prompt or prompt template).
- Enhancements:
  - All perturbators are newly deterministic for improved robustness and testability (except synonym and antonym pertubators which are deterministic in testing only).
  - Synonym and antonym perturbators improved with eager synonym/antonym swap which tries to match the percentage of words swapped (prior the fix perturbators tried only x times, and if the new synonym/antonym was the same word, it would not swap anything).
- Fixes:
  - Fixed all perturbators for issues with special tokens in de/tokenization like undesired spaces around expressions in parenthesis after detokenization.
- Security:
  - Upgraded scikit-learn library to version 1.5.0 to solve vulnerabilities detected by SNYK.
  - Upgraded cryptography library to version 42.0.8 to mitigate vulnerabilities detected by SNYK.
- Documentation:
  - reStructuredText documentation of the evaluation and new features (host configuration) with configuration prototypes examples.
RC 42 - 2024/05/31
- Enhancements:
  - Keyword groups for grouping of keywords which are used to tag evaluators.
  - H2O Eval Studio purpose keyword group which organizes evaluators into disjunct sets.
RC 41 - 2024/05/30
- Evaluators:
  - New perplexity evaluator for LLMs which calculates the perplexity - “measure of uncertainty” - of the generated text.
- Enhancements:
  - Save JSon data decoder for NaN and infinities.
  - H2O Eval Studio can be configured whether to use GPU or CPU for the evaluation.
- Fixes:
  - HTML report generation fixed in case that evaluation of all rows in the dataset fails.
RC 40 - 2024/05/29
- Evaluators:
  - New summary evaluator provides completeness and faithfulness metrics for LLM summarization tasks evaluation without the need for a reference summary.
- Features:
  - Insights - new feature allowing explainers and explanations to provide insights into the evaluation results and suggest actions to be taken.
- Enhancements:
  - Evaluation JSon and HTML result includes overall evaluation result represented as one value which is based on the severity of the problems detected in the evaluation. It is represented as traffic light colors (green, yellow, red) in the HTML report.
  - All evaluators report insights about the evaluation results and suggest actions to be taken via insight enhancements in bool, heatmap and classification leaderboards explanations.
  - Text matching, PII and Sensitive data leakage evaluators report apart problems and accuracy related insights also insights about cost and performance (speed) of evaluated models.
  - Models section in the HTML report rewritten to contain model details, insights, and problems.
  - Example PIIs (emails, credit cards, SSNs) in the PII evaluator are no longer reported as problems. These false positives are now marked as False in the evaluation results.
  - Test lab statistics.
- Fixes:
  - Hallucination evaluator fixed to correctly handle low values as hallucinations (not vice versa).
- Changes:
  - Bool leaderboard JSon representation values (and metrics metadata) changed from percentages to [0.0, 1.0] float range.
RC 39 - 2024/05/06
- Enhancements:
  - ragas library upgrade to version 0.1.7.
- Fixes:
  - Added on-demand caching of tiktoken’s BLOBs which are used by ragas library.
  - Fixed Faithfullness evaluator and RAGAs evaluator flakiness (NaN) by ragas library upgrade.
RC 38 - 2024/05/03
- Features:
  - ollama (https://ollama.com/) hosted LLMs support - new connection, client and test lab builder.
- Enhancements:
  - All evaluators detect flip of metrics and report the flip in the evaluation results as problems. In case of boolean metrics, the flip is detected as change from True to False and vice versa. In case of numeric metrics, the flip is detected as change from above to below the threshold and vice versa. In case of the classification, the flip is detected as change from the correct to incorrect classification and vice versa.
- Changes:
  - Introducing relationships among test cases which adds new relationships key to test case, test suite and test lab as well as column relationsihps to LLM dataset and LLM evaluation result. JSon representations (key) and CSV representations are extended (column). Old JSon files are deserialized in loosely coupled way to avoid the backward compatibility breaking changes.
  - Added key field inputs in the test lab.
  - Added key field/column to LLM dataset inputs (rows).
  - Added key field/column to evaluation result inputs (rows).
- Fixes:
  - Fixed undesired retries in the RAG/LLM test lab completion of h2oGPTe LLM and H2O LLMOps hosts in case of the successful completion of the test cases.
  - Fixed NaN (not a number) handling in leaderboard pallette color lookup.
RC 37 - 2024/04/25
- Evaluators:
  - New Classification evaluator for RAGs/LLMs used for classification problems. The evaluator calculates common metrics used in case of binomial and multinomial classification problems like accuracy, precision, recall and F1. The Classification evaluator is also bringing new classification leaderboard explanation.
- Features:
  - New perturbations module with the ability to perturb the input data (5 perturbation methods) in order to test the robustness of the RAGs/LLMs and the quality of the data: comma, word swap, QWERTY, synonym and antonym.
  - New public perturbations API with list, filtering and (multiple) perturbation methods application to string, test case, test suite or LLM dataset prompts.
  - 3 new summarization tests for evaluation of summaries both with and without reference summary (Frank, SamSum and SummEval).
- Enhancements:
  - Format specifier in evaluation metrics metadata changed from Python f-strings to JavaScript D3 format strings.
- Fixes:
  - Ranges in evaluation metrics metadata fixed - [0, 1] vs. [0, 100].
- Testing:
  - RAG/LLM test suite can finish successfully even if OpenAI API key is not set (auto reconfiguration to 3rd party judges; tests which use OpenAI endpoints are skipped).
RC 36 - 2024/04/18
- Fixes:
  - OpenAI client fixed to version 1.20.0 to keep version 1 API compatibility (OpenAI Assistants code in H2O Eval Studio must be rewritten to version 2 to move from retrieval tool to file search).
RC 35 - 2024/04/18
- Feature:
  - New metrics metadata - all evaluators newly declare the metrics they calculate with the metadata (name, description, type, unit, range, scale, …). Metrics metadata are used in the evaluator (descriptor, evaluation, results), in the leaderbords (JSon representation, HTML report generation), and explanation/evaluation formats (JSon, HTML, Markdown).
  - Loosely coupled serialization and deserialization of object/JSon data structures: ExplainerDescriptor, ExplanationDescriptor, ConfigItem and FilterEntry.
  - Caching of the models used (internally) by evaluators and explainers: public API, caching module, and caching configuration enabling air gapped evaluators deployment.
- Backward compatibility breaking changes:
  - data key added to heatmap and bool leaderboards JSon representations.
RC 34 - 2024/04/12
- Fixes:
  - NaN (not a number) handling/encoding in the heatmap leaderboard JSon “all metrics” data file.
RC 33 - 2024/04/12
- Features:
  - Microsoft Azure hosted OpenAI LLMs support - new connection, client and test lab builder.
  - H2O LLMOps hosted LLMs support - new connection, client and test lab builder.
- Security:
  - HTTPS requests SSL certificate verification configuration: H2O Sonar configuration controls the SSL certificate verification process/level in requests library, LLM hosts client libraries and other HTTP(S) clients.
- Changes:
  - H2O GPT client rewritten to OpenAI API client (please update server port and base URL).
  - H2O LLMOps client rewritten to OpenAI API client (no configuration changes needed).
  - Base URL parameter removed from OpenAI API client constructor (connection configuration is used).
RC 32 - 2024/04/10
- Enhancements:
  - Constants for keys in datasets.py Python modules.
- Documentation:
  - BLEU and ROUGE evaluators .rst documentation.
RC 31 - 2024/04/08
- Evaluators:
  - BLEU evaluator.
  - ROUGE evaluator.
- Enhancements:
  - New keywords for the most important ML problem types solved by RAGs/LLMs: question answering, information retrieval, summarization, classification (binomail and multinomial) and regression. All evaluators were decorated with relevant keywords.
  - New keyword for the referential user role: regulator.
- Fixes:
  - NaN (not a number) handling in the evaluator results, formats and leaderboard.
- Security:
  - nltk added as evaluators Pytho extras dependency.
  - rouge-score added as evaluators Pytho extras dependency.
  - punkt is new cached NLTK model for text to sentence tokenization.
RC 30 - 2024/03/27
- Enhancements:
  - Toxicity evaluator reimplemented to directly use the toxicity library and show several metrics which explain what type of toxic content has been detected in the answer.
  - Fairness bias evaluator reimplemented to directly use bias detection model (in ONNX format) for the evaluation.
  - Hallucinations evaluator reimplemented to use LLM judge for the hallucination detection.
- Security:
  - deepeval Python dependency removed. Evaluators based on deepeval were rewritten to use underlying libraries without relying on deepeval.
  - TensorFlow and DBias Python dependencies removed. Fairness Bias evaluator newly does not rely on DBias Python library as the underlying model was ported from TensorFlow to ONNX.
  - HMLI moved from the core H2O Eval Studio dependencies to the explainers package extras. in order to avoid the CVE vulnerabilities which must be fixed for H2O Eval Studio cloud deployment certification.
  - H2O-3 moved from the core H2O Eval Studio dependencies to the explainers package extras in order to avoid the CVE vulnerabilities which must be fixed for H2O Eval Studio cloud deployment certification.
RC 29 - 2024/03/21
- Security:
  - HMLI upgraded to MLI version 1.10.26 to mitigate CVE-2023-39013 (HMLI’s Duke dependency vulnerability).
RC 28 - 2024/03/17
- Features:
  - Bring Your Own Judge (BYOJ) - ability to configure H2O Eval Studio so that evaluators use custom LLM judges. For instance in order to ensure privacy and avoid sending of the sensitive data to a 3rd party. This feature includes reconfiguration of embeddings provider from the same reasons. Custom judges can be either forced from the H2O Eval Studio configuration or specified in the evaluator parameters.
  - Bring Your Own Prompt (BYOP) - ability to easily run evaluation just be providing a prompt template or implement a new evaluator just by inheriting from the BYOP abstract class and specifying prompt which returns the boolean value.
  - OpenAI LLM client (only Assistants with retrieval tool was supported before). The client supports both OpenAI service (no base URL specified) and OpenAI compatible endpoints (base URL specified).
- Evaluators:
  - Contact Information evaluator (BYOP).
  - Language Mismatch evaluator (BYOP).
  - Parametrizable BYOP evaluator with the ability to specify the prompt template in the evaluator parameters.
  - Sexism evaluator (BYOP).
  - Stereotype evaluator which detects undesired gender/race content in the answer (BYOP).
  - Summarization evaluator (BYOP).
- Enhancements:
  - ragas library upgrade to version 0.1.3.
RC 27 - 2024/03/14
- Security:
  - Fairness Bias evaluator removed as it used dbias library which depends on vulnerable TensorFlow version. This change ensures there is no TensorFlow, un-registers the evaluator and skips all evaluator tests (code is kept in the codebase).
- Enhancements:
  - Problems are newly loaded with the load of the evaluation from the JSon representation.
- QA:
  - MMC builds disabled (it was extra cost in addition to GH Actions build; MMC has old Python version)
RC 26 - 2024/03/08
- Features and enhancements:
  - progress reporting:
    - end to end, evaluation, all evaluators, lab (build and completion)
    - callback or file-system
  - evaluators can be filtered by labels for:
    - SR 11-7
    - NIST AI MRM
  - HTML report refactored
    - sections shuffled by importance
    - new evaluation (details) group added
    - dataset section content reordering
    - explanation and title added to insight leaderboards
  - Markdown representation robustness
    - input/output escaping
    - LLM vs. RAG failures listing fixed
    - ES summary .md redesigned
  - improved text matching regexp error messages and docstring (ES UI)
  - improved .rst documentation
  - evaluator parameters refactored from standalone file to i/e.json
- Changes:
  - evaluation result is stored on the file system (no longer discarded)
- Fixed:
  - 3x faster lab completion (fixed duplicate requests)
  - hangs/deadlocks in the lab completion (configurable multiprocessing)
- QA:
  - new GH Actions test suite in GREEN since 09117c3ce891e410e68e62361d109f179ed4c79f
  - GHA builds and test H2O EvalStudio deployment runtime configuration only
  - improved h2oGPT/h2oGPTe test server selection (config switch)
  - method to purge h2oGPTe relics
  - new h2oGPT servers
RC 25 - 2024/02/05
- Evaluators:
  - Fairness bias evaluator (deepeval based).
RC 24 - 2024/02/02
- Evaluators:
  - Toxicity evaluator (deepeval based).
- Fixed:
  - PII and sensitive data leakages (regexps).
RC 23 - 2024/01/26
- Features:
  - LLM/RAG clients telementry.
  - Prompt cache: LLM/RAG responses can be cached on building a test lab. The cache can be build from existing test lab and used in RD only mode.
- Enhancements:
  - LLM/RAG client retries (3 by default).
  - Evaluators which require OpenAI key are tagged using keywords.
  - …
RC 22 - 2024/01/22
- Enhancements:
  - Changing h2oGPTe dependency to the last Python package version.
- Fixes:
  - Hiding retrieval errors in the bool leaderboard.
- Evaluation tests:
  - Removal of constraints OR expressions from test suite/labs for Atlanta event as H2O Eval Studio does not support it yet.
RC 21 - 2024/01/21
- Fixes:
  - Fixed RAGAs leaderboard calculation.
  - Retrieved context builder enhancements.
- Tests:
  - OpenAI end to end CI test which runs all evaluators.
- Evaluation tests:
  - Polished, fixed (duplicate prompts) and extended SR 11-7 and Bank teller test suites.
RC 20 - 2024/01/18
- Evaluators:
  - Sensitive data leakage evaluator.
- Enhancements:
  - Test lab build fallbacks: dummy doc for RAG.
- Fixes:
  - OpenAI test lab build (missing arguments).
- Evaluation tests:
  - SR 11-7 test suite w/ 171 prompts.
RC 19 - 2024/01/18
- Evaluators:
  - PII evaluator.
- Fixes:
  - Asynchronous interpretation execution fixed (inconsistent method signatures).
- Changes:
  - datatable upgraded from AWS S3 hosted version, to 1.1.0 pypi.org hosted version.

v1.1.1 — 2023/10/9

A patch release bringing minor fixes and enhancements.

Added

Both CLI and Python API accept library configuration and encryption key parameters in case that the the interpretation arguments are provided as JSon.

Fixed

HTML interpretation report path in the CLI output fixed (it was pointing to the interpretation HTML index).
False positive feature importance leak detection is no longer reported in case of multinomial problems.
Morris Sensitivity Analysis no longer fails in case of non-numeric boolean columns presence in the training dataset.

Changed

No changes.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v1.2.0 — 2023/10/31

Talk to H2O Eval Studio report - upload your interpretation report to the Enterprise h2oGPT in order to find out more about your model, data, problems, insights and suggested (mitigation) actions.

Added

Features
- Ability to upload your interpretation report to Enterprise h2oGPT either using Python API (run_interpretation() method parameter, upload_interpretation() method), or CLI. The feature is supported with Python 3.10 and Python 3.11 only.
Documentation
- H2O.ai documentation theme.

Fixed

Wheels are no longer built with the legacy pip resolver which was causing dependency conflicts in some cases on certain platforms.
Test and validation dataset details are newly shwon in the HTML report.
Opened port / Driverless AI server port check is no longer verbose.

Changed

No changes.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v1.1.1 — 2023/10/9

A patch release bringing minor fixes and enhancements.

Added

Both CLI and Python API accept library configuration and encryption key parameters in case that the the interpretation arguments are provided as JSon.

Fixed

HTML interpretation report path in the CLI output fixed (it was pointing to the interpretation HTML index).
False positive feature importance leak detection is no longer reported in case of multinomial problems.
Morris Sensitivity Analysis no longer fails in case of non-numeric boolean columns presence in the training dataset.

Changed

No changes.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v1.1.2 — 2023/10/13

A patch release bringing minor fixes and enhancements.

Added

No additions.

Fixed

SHAP library version fixed to shap>=0.40.0,<=0.42.5 as new version is causing instability in feature importance explainers.

Changed

H2O Model Validation upgraded to 0.16.3 with updated h2osteam and H2O MLOps clients which avoid version clashes in upcoming H2O.ai Cloud notebook kernels.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v1.1.1 — 2023/10/9

A patch release bringing minor fixes and enhancements.

Added

Both CLI and Python API accept library configuration and encryption key parameters in case that the the interpretation arguments are provided as JSon.

Fixed

HTML interpretation report path in the CLI output fixed (it was pointing to the interpretation HTML index).
False positive feature importance leak detection is no longer reported in case of multinomial problems.
Morris Sensitivity Analysis no longer fails in case of non-numeric boolean columns presence in the training dataset.

Changed

No changes.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v1.1.0 — 2023/10/03

Integration of H2O Eval Studio and H2O Model Validation projects.

Added

New explainers
- Adversarial Similarity explainer.
- Backtesting explainer.
- Drift Detection explainer (reports exceeded PSI threshold as a problem).
- Size Dependency explainer.
- Segment Performance explainer.
- Calibration Score explainer.
Features
- H2O Model Validation based explainers are able to use H2O AIEM hosted Driverless AI, H2O Enterprise Steam hosted Driverless AI or any H2O Driverless AI which uses username/password authentication.
- Ability of H2O Eval Studio to run with or without H2O Model Validation library installed. If H2O Model Validation is not available, then H2O Model Validation based explainers just indicate incompatibility and do not cause the interpretation to fail.
- Portable export and import of MVTest related instances like settings, results, artifacts, and logs. The implementation is based on JSon, CSV, and directory hierarchy. Therefore it can be used by a wide range of tools, programming languages, and runtimes.
- RemoteHandles bring support for remote (Driverless AI) datasets and models. Apart from the data structure, it is a part of explainers metadata and compatibility checks.
- Model is no longer required when running a new interpretation which allows to run explainers on datasets only.
- Automatic fallback guess of the model metadata - like problem type, labels, and used features - in case the model does not provide them.
Enhancements
- Attributes (dictionary) added to ProblemAndAction class which enables explainers to pass machine-processable data from problems to actions for further actionability.
- Connections and licenses are newly identified by unique keys (identifiers) in the H2O Eval Studio configuration and through the runtime.
- Python 3.10 support.
- Python 3.11 support - H2O Model Validation explainers not available as transitive library dependencies do not support Python 3.11.
- daimojo library pre-heat prediction to activate the MOJO models introspection.
- Interpretations index HTML path added to the CLI interpretation output.
- Completion of the testset and validset handling implementation in the explainer container - datasets are newly passed to explainers along with their metadata.
- The following configuration keys were added to the H2O Eval Studio library configuration:
  - server_id
  - environment_url
  - token_use_type
- Shapley Values for Original Features (Kernel SHAP Method) explainer is approximately 3x faster case of multinomial problems (the speed up is proportional to the number of classes - more classes, more speed up).
Utilities
- Shapley contributions sorter which can be used by all Shapley-based explainers whenever multi-class contributions are reported within the same frame - makes the code cleaner and simpler.
Documentation
- Library configuration CLI API reStructuredText documentation.
- Jupyter Notebook with examples of how to run H2O Model Validation explainers using the Python API and CLI.
- reStructuredText documentation of all new H2O Model Validation based explainers.
- New explainers overview table with per-explainer features and requirements added to both README.md and reStructuredText.
- Explainers overview diagram is newly organized according to the functional architecture of explainers.
Tests
- Python and CLI tests of all H2O Model Validation explainers.

Fixed

Shapley Values for Original Features (Kernel SHAP Method) explainer reports per-class contributions in the case of multinomial problems (contributions were mixed together).
Morris Sensitivity Analysis explainer fixed to work with InterpretML 0.1.20.
Pseudocode and Python code generated by the Decision Tree explainer is consistent again.
HTML report fixed to properly handle if no explainer is run within the interpretation.
Thread safe interpretation executor shutdown.

Changed

The following configuration key was changed in H2O Eval Studio library configuration:
- client_refresh_token has been renamed to token.

Deprecated

No deprecations.

Removed

Test suites which were replaced by Pytest markers.
Tests of legacy Driverless AI models (Makefile targets, S3 archives).

Security

No security enhancements.

v1.0.0 — 2023/6/30

The first stable H2O Eval Studio release.

Added

Enhancements
- Multiple sampling methods for the explainer dataset (stratified, random, head).
- Configurable out-of-memory (OOM) protection.
- Improved ability of the interpretable model to extract scikit-learn models metadata.
Utilities
- Random attack utility that tests H2O Eval Studio on many datasets and models: it gets a directory with datasets as a parameter, trains a (scikit-learn) model for a random dataset and its column, and finally runs all the explainers to test the H2O Eval Studio.
Documentation
- Explainers overview diagram indicates whether the explainer reports problem(s).
- Configuration management documentation (including encryption).
- Per-explainer problem reporting capabilities documentation.

Fixed

Summary Shapley explainer and Original feature importance explainer fixed to properly use SHAP library to get Shapley values for regression vs. multinomial (experiment type detection).
Disparate Impact Analysis calculation fixes (comparisons in metrics) in case of string features.
Decision tree Python code and pseudo-code generator fixed.
HTML report fixed to properly display explanations type and format(s).
Division by zero fixed in the progress reporting runtime.

Changed

CLI, JSon and Python parameter names were unified - this change breaks backward compatibility and was intentionally done before the first stable release.

Deprecated

No deprecations.

Removed

No removals.

Security

Added encryption of sensitive fields in the H2O Eval Studio configuration (config, CLI, documentation).

v0.11.2 — 2023/7/26

Added

Enhancements

No enhancements.

Fixed

Fix .py/pseudo code generated by DT: > vs. >=

Changed

Upgrade MLI jar to 1.10.23

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v0.11.1 — 2023/5/22

Handle missing value bins for PD when OOR is enabled and output histogram data to PD results.

Added

Enhancements
- Output previously missing histogram data to PD results.

Fixed

Correctly handle missing value bins for PD when OOR is enabled.

Changed

No changes.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v0.11.0 — 2023/4/24

Leak detection added to feature importance explainers.

Added

Enhancements
- Leak detection added to feature importance explainers: Shapley Values for Original Features (naive method) explainer, Morris Sensitivity Analysis explainer, Shapley Values for Original Features (Kernel SHAP method) explainer.
- Missing values are treated as a separate bin in the PD explainer.
- H2O Eval Studio CLI can read arguments from JSon file.

Fixed

Fixed display of plots in Jupyter notebooks.

Changed

No changes.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v0.10.1 — 2023/3/9

Patch release bringing Result (documentation) enhancements.

Added

No new features.

Fixed

No fixes.

Changed

No changes.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v0.10.0 — 2023/2/2

New Dataset and Model Insights explainer and fixes of bugs found by a new random attack.

Added

Explainers
- New Dataset and Model Insights explainer.
Enhancements
- Residual Decision Tree explainer newly highlights the whole path to the highest residuum in the visualized tree.
- DIA result API help related to the reference level improved.

Fixed

Surrogate Decision Tree Python code generator fixed: added missing ( ) in boolean expressions, features can have any characters in their names.
Move from os.rename to shutil.move in order to ensure that the operation will not fail if the source and target are on different file systems.
Missing isna symbol used in the Disparate Impact Analysis explainer.
Comparison of strings and bools in the ICE method.
Float division by zero in the Residual Decision Tree explainer.

Changed

No changes.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v0.9.0 — 2023/1/13

Minor H2O Eval Studio release which brings asynchronous interpretation execution.

Added

Features
- New option allowing to run interpretations asynchronously.
Enhancements
- New introspection API for Result classes (method parameters).

Fixed

Sqrt MSE to get RMSE in the Surrogate Decision Tree explainer.
Handling of date, time and date time features in the PD/ICE explainer.

Changed

No changes.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v0.8.0 — 2022/12/8

New Partial Dependence for 2 Features explainer and enhancements for H2O Sonar explainer container implementation for Driverless AI.

Added

Explainers
- New Partial Dependence for 2 Features explainer.
Features
- New Global 3D Data result, explanation and associated formats (JSon, CSV).
Enhancements
- Command-line interface with pretty-printed listing of explainers, improved formatting of explainer descriptions and H2O Eval Studio version show action.
- Residual PD/ICE for multinomial problems added.
- Improved explainer container resolution and creation (identifier, instance).
- Model agnostic API to indicate the ability to provide/calculate Shapley values added.
- Improved compatibility checks and new compatibility error type.
- Explainable model’s features metadata simplification, completion and consolidation.
- Explainable dataset’s metadata simplification, completion and consolidation.
- Improved HTML report highlights failed explainers, brings a comprehensive overview section, shows new modal and dataset metadata fields.
Documentation
- Added Jupyter Notebook documentation of how to run H2O Eval Studio in the Internal H2O.ai Cloud.

Fixed

Disparate Impact Analysis explanations completed to be 100% binary compatible with Driverless AI’s Grammar of MLI (entities).
Disparate Impact Analysis explainer feature resolution for DIA calculation rewritten.
Disparate Impact Analysis explainer and PD/ICE explainer fixed to work on a dataset with string (target) column(s).
Residual PD/ICE no longer returns regular PD/ICE as the default representation (and residual as an extension), but the residual PD/ICE.
Residual PD/ICE HTML fragment representation path to images fixed so that it no longer renders the same charts for all classes.
Summary Shapley explainer name correctly indicates SHAP method (not wrong naive Shapley method).

Changed

Features metadata class of the explainable model has been refactored to the h2o_sonar.methods.core.method module and all constant references consolidated to this class.
Operating system version to build Linux distribution and wheels has been changed from Ubuntu 20.04 to Ubuntu 18.04 to ensure that H2O Eval Studio wheels will work both on this and new Ubuntu versions.

Deprecated

No deprecations.

Removed

No removals.

Security

MLI upgrade to 1.10.21 to mitigate CVE-2022-2048 and CVE-2022-25647.

v0.7.0 — 2022/10/18

H2O Eval Studio beta release with Bring Your Own Explainer based extensibility, reporting of model problems, new Residual PD/ICE explainer, new Morris sensitivity analysis and various smaller enhancements.

Added

Features
- BYOE - Bring Your Own Explainer.
- Model problems and actions.
Explainers
- New Residual Partial Dependence/Individual Conditional Expectations explainer.
- New Morris Sensitivity Analysis explainer.
- Residual Decision Tree explainer reports problems and actions.
Explanations
- New interpretation report - structure, content, and theme in H2O.ai colors.
- Organization of explainers to functional groups.
Utilities
- Improved label encoder to simplify the use of 3rd party libraries that require numeric (non-categorical) features. Label encoder is integrated into both explainable dataset and explainable model APIs.
Command-line interface
- All Python API’s interpretation parameters are newly available on CLI.
Documentation
- Added Getting started with BYOE.

Fixed

HTML report paths to images and explanations are relative and valid regardless of the results directory location.
Explainer container runtime and explainers stabilized to work on raw (non-sanitized) datasets.
Explainers listing action help fixed on the command line interface.

Changed

list_explainers() method on both Python API and CLI lists all explainers by default (it listed only basic explainers with run-by-default keywords before
this change).
Logging consolidated to single module h2o_sonar.loggers and loggers renamed/refactored so that it can be used both in methods and explainers.
Migration of explainer container runtime from HMLI to h2o wheel dependency.
Parameter path of zip() method used by explainer’s Result class has been changed to file_path to make it consistent with other Result parameters.
Result classes refactoring from explainer implementations into consolidated and reusable results classes for main supported explanation types.
The summary() method’s functionality is moved to params() and the new summary() method returns the summary of the explanation (content of result_descriptor.json)

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v0.6.0 — 2022/9/8

New Friedman’s H-statistic and Residual Surrogate Decision Tree explainers, Driverless AI REST interface model support and improved HTML interpretation representation.

Added

Explainers
- Friedman’s H-statistic explainer for feature behavior explanations.
- Residual Surrogate Decision Tree for model debugging (new default explainer).
Model support
- Added Driverless AI REST interface model support.
Explanations
- Significantly improved HTML interpretation representation with new explanation charts for every explainer, interpretation parameters and explainers parameters.
Command line interface
- Added parameter to run all explainers (not just basic explainers).
- Interpretation listing including HTML representation.
Documentation
- Bring Your Own Explainer templates and examples added to distributions.

Fixed

Improved scikit-learn multinomial models support with labels lookup.
Compatibility check function gets all available parameters for more advanced checks.
DIA HTML fragment representation path to images.
In-memory persistence store (keys) stabilization.
Logging names and interpretation and explainer logging keys consistency.

Changed

hmli and daimojo dependencies updated.
Source distribution - tarball - build changed so that doesn’t contain .whl.
Binary distributions are built for every supported platform.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v0.5.0 — 2022/8/16

Fix release which brings binary distribution with improved documentation and Jupyter Notebook examples.

Added

Documentation
- Improved ReStructuredText documentation with getting started, library documentation (interpretation, configuration, explainers), licenses and change log.
- New and improved Jupyter Notebook examples.
Model support
- Added pickled (Scikit-learn) models interpretability.
Command line interface
- Added parameters to specify features used by the model and per-explainer parameters.

Fixed

Summary Shapley explainer stabilization: scatter plot feature values fixed, main chart includes all features, regression/binomial/multinomial labels fixed, max_features parameter honored, per-class multinomial explanations are generated in all supported formats.
Fixed the simple mock model prediction function and added SHAP method support for mock models.

Changed

Models and datasets - used by examples, demos and tests - consolidated and refactored to indicate dataset and model type.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v0.4.2 — 2022/11/29

Fix of the following MLI Java backend security issues: CVE-2022-2048 and CVE-2022-25647.

Added

Fixed

Changed

Deprecated

No deprecations.

Removed

No removals.

Security

MLI upgrade to 1.10.17.2 to mitigate CVE-2022-2048 and CVE-2022-25647.

v0.4.1 — 2022/11/17

Fix of the following MLI Java backend security issues: CVE-2022-2048 and CVE-2022-25647.

Added

Fixed

Changed

Deprecated

No deprecations.

Removed

No removals.

Security

MLI upgrade to 1.10.17.1 to mitigate CVE-2022-2048 and CVE-2022-25647.

v0.4.0 — 2022/6/29

New Transformed Feature Importance explainer for Driverless AI MOJO models and preparation for H2O Eval Studio integration to Driverless AI.

Added

Explainers
- Transformed Feature Importance explainer for Driverless AI MOJO models.
Explainer container API and CLI
- H2O Eval Studio version available in runtime.
Documentation
- Jupyter Notebook with interpretation result API for the new explainer.
- H2O Eval Studio explainers overview diagram updated.

Fixed

All MLI-2 fixes between H2O Sonar fork and now ported to this repository.
Naive Shapley Feature Importance explainer multinomial explanations fixed and the performance improved.

Changed

Core H2O Eval Studio dependencies updated to be aligned with Driverless AI 1.10.4, two separate builds will be available going forward - regular and Driverless AI.

Deprecated

No deprecations.

Removed

No removals.

Security

MLI upgrade to 0.10.17 to mitigate CVE-2022-25647.

v0.3.0 — 2022/6/22

New Kernel SHAP feature Importance explainer.

Added

Explainers
- Kernel SHAP Feature Importance explainer for all supported interpretable models.
Explainer container API and CLI
- H2O-3 is automatically started (or reused) - based on H2O-3 configuration.
- CLI rewrite to provide more accurate help, error reporting and robust execution.
Documentation
- Jupyter Notebook with interpretation result API for the new explainer.

Fixed

Interpretation HTML representation links are no longer broken on the use of the relative path.
Explainers’ summary method returns the correct (non-empty) parameters of the explainer run.
Disparate Impact Analysis explainer core dump on invalid target column specification.

Changed

No changes.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v0.2.0 — 2022/6/3

New Feature Importance explainer for Driverless AI MOJO models.

Added

Explainers
- Naive Shapley Feature Importance explainer for Driverless AI MOJO models.
Explainer container API and CLI
- list explainers to get available explainer IDs or descriptors.
Documentation
- Jupyter Notebook with interpretation result API for the new explainer.

Fixed

CLI: log level specification case insensitivity.
macOS: Driverless AI MOJO import made local.

Changed

No changes.

Deprecated

No deprecations.

Removed

No removals.

Security

No security fixes.

v0.1.0 — 2022/5/27

Initial H2O Eval Studio internal MVP release.

Added

Explainers
- Partial dependence/Individual Conditional Expectations explainer (PD/ICE)
- Shapley summary plot explainer
- Decision tree explainer
- Disparate Impact Analysis explainer (DIA)
Explainer container with public explainer APIs
- Interpretation, model, dataset, explainer and persistence API.
- Explainer container (runtime).
- File-system and in-memory persistence.
- Easy to use API for retrieval of explainer results.
Model vendor support
- Scikit-learn models.
- H2O-3 models.
- Driverless AI MOJO models.
Command line interface
- CLI support of MOJO and pickled models interpretations.
Documentation
- Per-explainer Jupyter Notebook with interpretation result API.
- Installation, Getting Started and Reference Guide (Sphinx/HTML).

Fixed

No fixes (initial release).

Changed

No changes (initial release).

Deprecated

No deprecations (initial release).

Removed

No removals (initial release).

Security

No security fixes (initial release).