Test Case, Suite, Lab and LLM Dataset
H2O Eval Studio can evaluate standalone LLMs (Large Language Models) and LLMs used by RAG (Retrieval-augmented generation) applications using the evaluation data shipped with the product and/or custom evaluation data provided by the user.
Evaluation data can be provided in the form of test cases, tests, test suites, and test labs. LLM dataset - which is normalized test lab - is used as the input of the evaluation process.
Terminology:
- Raw Test Data
A dataset - provided by the user, exported from a system or generated - which can be used to create the evaluation data.
- Test Case
A single test case that is used to evaluate the LLM / RAG - it consists of a prompt, ground truth, conditions, and other parameters that are used for the evaluation.
- Test
A set of test cases linked with a collection of documents (corpus) which must be provided in case of RAG evaluation (test cases typically make sense only in the context of particular corpus - the corpus is empty in case of LLM evaluation).
- Test Suite
A set of tests which can be used to evaluate the LLM / RAG model.
- Test Lab
A test suite completed with the actual data - answers generated by the LLM / RAG models for given prompts, retrieved contexts (in case of RAG), durations, and costs.
- LLM Dataset
A normalized test lab that is used as the input of the evaluation process.
- Report
A collection of metrics and visualizations that describe the validity of a RAG / LLM model on a given test configuration.
Test Case
Test case is a single test case that is used to evaluate the LLM / RAG - it consists of a prompt, ground truth, conditions, and other parameters that are used for the evaluation.
Example:
{
"key": "d981bf23-9790-4c5e-905e-2ab56cba7ece",
"prompt": "What was the revenue of Brazil?",
"categories": [
"question_answering",
"perturbed",
"perturbed_by:h2o_sonar.utils.perturbations.SynonymPerturbator:PerturbationIntensity.HIGH"
],
"relationships": [
{
"type": "perturbation_source",
"target": "9c3a7df3-67df-4819-babb-20636611f077",
"target_type": "test_case"
}
],
"expected_output": "Brazil revenue was 15,969 million.",
"condition": "\"15,969\" AND \"million\""
}
Format description:
- key
Key is optional. In case that it is specified it must be unique within the test suite.
- prompt
Prompt which is used to evaluate the LLM / RAG model.
- categories
Optional categories which are used to classify the test cases.
- relationships
Optional relationships which are used to interlink the test cases.
- relationships/type
Type of the relationship - for instance test case which was used as the perturbation_source in this case.
- relationships/target
Key of the relationship target - a test case key.
- relationships/target_type
Target type - for instance test_case.
- expected_output
Ground truth / expected output.
- condition
Optional condition is used by Text Matching evaluator (and may be used by other evaluators) to evaluate the LLM / RAG model.
Python code which can be used to create the test case:
testing.RagTestCaseConfig(
prompt="What was the revenue of Brazil?",
expected_output="Brazil revenue was 15,969 million.",
condition='"15,969" AND "million"'
)
Test
Test is a set of test cases linked with a collection of documents (corpus) which must be provided in case of RAG evaluation (test cases typically make sense only in the context of particular corpus - the corpus is empty in case of LLM evaluation).
Example:
{
"key": "48099214-7b33-443a-a456-eb44519521f1",
"documents": [],
"test_cases": [
{
"key": "9c3a7df3-67df-4819-babb-20636611f077",
"prompt": "Respond to the following question with single letter answer...",
"categories": [
"question-answering"
],
"relationships": [],
"condition": "regexp(^A.*)",
"expected_output": "A"
},
{
"key": "6709c0e1-de57-4b5b-b826-2060bf935574",
"prompt": "Respond to the next question with one letter answer...",
"categories": [
"question-answering",
"perturbed",
"perturbed_by:h2o_sonar.utils.perturbations.SynonymPerturbator:PerturbationIntensity.HIGH"
],
"relationships": [
{
"type": "perturbation_source",
"target": "9c3a7df3-67df-4819-babb-20636611f077",
"target_type": "test_case"
}
],
"condition": "regexp(^D.*)",
"expected_output": "D"
}
]
}
Format description:
- key
Key is optional. In case that it is specified it must be unique within the test suite.
- documents
In case of RAG evaluation Test has a set of documents - corpus.
- test_cases
Test is formed by a set of Test Case`s which are used to evaluate the LLM / RAG model in the context of given corpus specified within the test’s `documents field.
Python code which can be used to create the test:
# test
test = testing.RagTestConfig(documents=[doc_url])
# test cases
test_case_1 = testing.RagTestCaseConfig(
prompt="What was the revenue of Brazil?",
expected_output="Brazil revenue was 15,969 million.",
condition='"15,969" AND "million"',
config=test,
)
...
test_case_n = testing.RagTestCaseConfig(
prompt="What was the revenue of Argentina?",
expected_output="Argentina revenue was 5,969 million.",
condition='"5,969" AND "million"',
config=test,
)
Test Suite
Test suite is a set of tests which can be used to evaluate the LLM / RAG model.
{
"name": "Test Suite Example",
"description": "This is a test suite example.",
"tests": [
{
"key": "da503caf-a2dc-4c0a-b5f7-618bebd7860d",
"documents": [
"https://public-data.s3.amazonaws.com/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf"
],
"test_cases": [
{
"key": "d981bf23-9790-4c5e-905e-2ab56cba7ece",
"prompt": "What was the revenue of Brazil?",
"categories": [
"question_answering"
],
"relationships": [],
"expected_output": "Brazil revenue was 15,969 million.",
"condition": "\"15,969\" AND \"million\"",
}
]
},
{
"key": "be087c6b-ad41-4b2e-a9f1-0ffd5e6743ec",
"documents": [
"https://public-data.s3.amazonaws.com/bradesco-2022-integrated-report.pdf"
],
"test_cases": [
{
"key": "3db6878b-1511-4b58-b42f-985ca9699d1d",
"prompt": "Who is the chairman of the board?",
"categories": [
"question_answering"
],
"relationships": [],
"expected_output": "The chairman of the board was Luiz Carlos Trabuco Cappi.",
"condition": "\"Luiz Carlos Trabuco Cappi\"",
},
{
"key": "13011931-70c4-40b9-aacc-3870f521dbaa",
"prompt": "What was the number of agreements that include human rights clauses, in 2022?",
"expected_output": "The number of agreements that include human rights clauses in 2022 was 22.",
"categories": [
"question_answering"
],
"relationships": [],
"condition": "\"22\""
}
]
}
]
}
Format description:
- /name
Test suite name.
- /description
Test suite description.
- /tests
Test suite is formed by a set of :ref:`Test`s.
- /tests/documents
In case of RAG evaluation Test has a set of documents - corpus.
- /tests/key
Key is optional. In case that it is specified it must be unique within the test suite.
- test_cases
Test suite is formed by a set of :ref:`Test Case`s which are used to evaluate the LLM / RAG model in the context of given corpus.
Python code which can be used to create the test suite:
# test suite
test_suite = testing.RagTestSuiteConfig()
# test
test = testing.RagTestConfig(documents=[doc_url_1, doc_url_2])
# test cases
test_suite.add_test_case(
testing.RagTestCaseConfig(
prompt="What was the revenue of Brazil?",
expected_output="Brazil revenue was 15,969 million.",
condition='"15,969" AND "million"',
config=test,
)
)
...
test_suite.add_test_case(
testing.RagTestCaseConfig(
prompt="What was the revenue of Argentina?",
expected_output="Argentina revenue was 5,969 million.",
condition='"5,969" AND "million"',
config=test,
)
)
Test Lab
Test lab is a test suite completed with the actual data - answers generated by the LLM / RAG models for given prompts, retrieved contexts (in case of RAG), duration, and cost.
Example:
{
"name": "TestLab",
"description": "Test lab for RAG evaluation.",
"raw_dataset": {
"inputs": [
{
"input": "What is the purpose of the document?",
"corpus": [
"https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf"
],
"context": [],
"categories": [
"question_answering"
],
"relationships": [],
"expected_output": "The purpose of this document is to provide comprehensive guid...",
"output_condition": "\"guidance\" AND \"model risk management\"",
"actual_output": "",
"actual_duration": 0.0,
"cost": 0.0,
"model_key": "0c34cdae-d2a4-4557-a305-d182adb01f6f"
},
{
"input": "What is model validation?",
"corpus": [
"https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf"
],
"context": [],
"categories": [
"question_answering"
],
"relationships": [],
"expected_output": "Model validation is the set of processes and activities inte...",
"output_condition": "\"processes\" AND \"activities\"",
"actual_output": "",
"actual_duration": 0.0,
"cost": 0.0,
"model_key": "14ac0946-6547-48d1-b19a-41bc90d34849"
}
]
},
"dataset": {
"inputs": [
{
"input": "What is the purpose of the document?",
"corpus": [
"https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf"
],
"context": [
"Changes in\nregulation have spurred some of the recent developments, parti...",
"Analysis of in-\nsample fit and of model performance in holdout samples (d...",
"Validation reports should articulate model aspects that were reviewed, hig...",
"Page 1\nSR Letter 11-7\nAttachment\nBoard of Governors of the Federal Rese...",
"Documentation of model development and validation should be sufficiently\n...",
],
"categories": [
"question_answering"
],
"relationships": [],
"expected_output": "The purpose of this document is to provide comprehensive gu...",
"output_condition": "\"guidance\" AND \"model risk management\"",
"actual_output": "The purpose of the document is to provide comprehensive guida...",
"actual_duration": 8.280992269515991,
"cost": 0.0036560000000000013,
"model_key": "0c34cdae-d2a4-4557-a305-d182adb01f6f"
},
{
"input": "What is model validation?",
"corpus": [
"https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf"
],
"context": [
"But it is not enough for model developers and users\nto understand and acce...",
"While they may grant\nexceptions to typical procedures of model validation ...",
"Management should have a clear plan for using the results of sensitivity an...",
"Validation reports should articulate model aspects that were reviewed, high...",
"Rigorous model validation plays a critical role in\nmodel risk management; ..."
],
"categories": [
"question_answering"
],
"relationships": [],
"expected_output": "Model validation is the set of processes and activities int...",
"output_condition": "\"processes\" AND \"activities\"",
"actual_output": "Model validation is the process of assessing the quality of a...",
"actual_duration": 19.800140142440796,
"cost": 0.004117999999999998,
"model_key": "14ac0946-6547-48d1-b19a-41bc90d34849"
}
]
},
"models": [
{
"connection": "612db877-cba2-4ab8-bacf-88d84b396450",
"model_type": "h2ogpte",
"name": "RAG model h2oai/h2ogpt-4096-llama2-70b-chat (docs: ['sr1107a1.pdf'])",
"collection_id": "263c58e6-7424-4f9e-aa02-347d7281439b",
"collection_name": "RAG collection (docs: ['sr1107a1.pdf'])",
"llm_model_name": "h2oai/h2ogpt-4096-llama2-70b-chat",
"documents": [
"https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf"
],
"key": "0c34cdae-d2a4-4557-a305-d182adb01f6f"
},
{
"connection": "612db877-cba2-4ab8-bacf-88d84b396450",
"model_type": "h2ogpte",
"name": "RAG model h2oai/h2ogpt-4096-llama2-13b-chat (docs: ['sr1107a1.pdf'])",
"collection_id": "263c58e6-7424-4f9e-aa02-347d7281439b",
"collection_name": "RAG collection (docs: ['sr1107a1.pdf'])",
"llm_model_name": "h2oai/h2ogpt-4096-llama2-13b-chat",
"documents": [
"https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf"
],
"key": "14ac0946-6547-48d1-b19a-41bc90d34849"
}
],
"llm_model_names": [
"h2oai/h2ogpt-4096-llama2-70b-chat",
"h2oai/h2ogpt-4096-llama2-13b-chat"
],
"docs_cache": {
"https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf": "/tmp/pytest-of-me/sr1107a1.pdf"
}
}
Format description:
- /name
Test lab name.
- /description
Test lab description.
- /raw_dataset
Test suite - which does not contain resolved data - normalized as LLM dataset.
- /dataset
Test suite with resolved data normalized as LLM dataset.
- /models
List of models which were used to resolve the test suite - generated the actual data for the test cases like answers, duration, and cost.
- /models/connection
Key of the connection (stored in the H2O Eval Studio configuration) which was used to connect to the model host.
- /models/model_type
Type of the model host (e.g.
h2ogpte
orh2ogpt
).
- /models/name
Name of the model.
- /models/collection_id
Corpus collection ID in case of RAG evaluation (not used in case of LLM evaluation).
- /models/collection_name
Corpus collection name in case of RAG evaluation (not used in case of LLM evaluation).
- /models/llm_model_name
Name of the LLM model which was evaluated and which generated the actual data.
- /models/documents
List of documents which were used in the evaluation of a RAG as corpus (not used in case of LLM evaluation).
- /models/key
Key of the model which is used to identify the model in the test suite. It is used in
/dataset/model_key
to link the resolved data with the model which created the data.
- /llm_model_names
List of LLM model (names) which are hosted by
models
and which generated the actual data.
- /docs_cache
Cache of the documents used in the test suite in case of RAG evaluation.
Python code which can be used to create the test lab:
# test lab
test_lab = testing.RagTestLab.from_llm_test_suite(
llm_host_connection=h2o_gpte_connection,
llm_test_suite=test_suite,
llm_model_type=models.ExplainableModelType.h2ogpte_llm,
llm_model_names=llm_model_names,
)
# create collections and upload corpus documents (RAG only)
test_lab.build()
# resolve test lab
test_lab.complete_dataset()
The host-specific configuration can be used to create the test lab. In this case, the host client might be used to obtain the available parameters, which can be modified by the user and then passed to the code that resolves the test lab. In order to be able to compare RAGs/LLMs with different configurations, it is possible to provide multiple configurations for the same model. The configuration is a dictionary that contains the parameters that are used to create the model -see RAG and LLM Hosts for reference. For instance:
#
# LLMs to be evaluated
#
llm_model_names = ["h2oai/h2ogpt-4096-llama2-70b-chat"]
#
# custom host configurations
#
# A
# host configuration proto w/ default values
custom_config_a = genai.H2oGpteRagClient.config_factory()
# modify system prompt
custom_config_a["system_prompt"] = (
"You are h2oGPTe, an expert question-answering AI system created by H2O.ai "
"that performs like GPT-4 by OpenAI."
)
# modify LLM temperature in order to make the model more creative
custom_config_a["llm_args"] = {"temperature": 0.9}
# B
# host configuration proto w/ default values
custom_config_b = genai.H2oGpteRagClient.config_factory()
# modify system prompt
custom_config_b["system_prompt"] = (
"You are flaky, a rookie question-answering AI system that performs poorly."
)
# modify LLM temperature to avoid hallucinations
custom_config_b["llm_args"] = {"temperature": 0.0}
#
# LLM configurations: LLM name -> list of host configurations
#
llm_host_configs = {
llm_model_names[0]: [
custom_config_a, custom_config_b,
]
}
#
# test lab
#
test_lab = testing.RagTestLab.from_llm_test_suite(
llm_host_connection=h2o_gpte_connection,
llm_test_suite=test_suite,
llm_model_type=models.ExplainableModelType.h2ogpte_llm,
llm_model_names=llm_model_names,
llm_host_configs=llm_host_configs,
)
# create collections and upload corpus documents (RAG only)
test_lab.build()
# resolve test lab
test_lab.complete_dataset()
LLM Dataset
LLM dataset is a normalized test lab that is used as the input of the evaluation process. LLM dataset can be serialized as JSon, CSV or Pandas frame.
Example:
{
"inputs": [
{
"input": "Which of the following statements accurately describes the impact of ...",
"corpus": [
"https://www.wikipedia.org/"
],
"context": [
"Physical Review Letters. 99 (14): 141302. arXiv:0704.1932. Bibcode:2007P...",
"The homogeneously distributed mass of the universe would result in a rou...",
"Modified Newtonian dynamics (MOND) is a hypothesis that proposes a modif..."
],
"categories": [
"question_answering"
],
"relationships": [],
"expected_output": "MOND is a theory that reduces the discrepancy between the...",
"output_condition": "",
"actual_output": "MOND is a theory that reduces the discrepancy between the o...",
"actual_duration": 0.415,
"model_key": "4b60abf6-0c8d-4765-804a-e830a211f152"
}
]
}
Format description:
- /inputs
List of test cases.
- /inputs/input
Prompt.
- /inputs/corpus
List of documents which are used as the corpus in case of this test case.
- /inputs/context
List of chunks - retrieved contexts.
- /inputs/categories
List of categories used to classify the test case.
- /inputs/relationships
List of relationships.
- /inputs/expected_output
Expected output.
- /inputs/output_condition
Condition used by Text Matching evaluator (and may be used by other evaluators) to evaluate the LLM / RAG model.
- /inputs/actual_output
Actual output.
- /inputs/actual_duration
Duration of the request to RAG / LLM model.
- /inputs/model_key
Key of the model which generated the actual data - must be valid key of a model when running the evaluation process.
Python code which can be used to create the LLM dataset:
# LLM dataset
dataset = datasets.LlmDataset()
# add row(s) to the dataset
dataset.add_input(
i="What is the target column of the model?",
actual_output="default payment next month",
)