Test Case, Suite, Lab and LLM Dataset

H2O Eval Studio can evaluate standalone LLMs (Large Language Models) and LLMs used by RAG (Retrieval-augmented generation) applications using the evaluation data shipped with the product and/or custom evaluation data provided by the user.

Evaluation data can be provided in the form of test cases, tests, test suites, and test labs. LLM dataset - which is normalized test lab - is used as the input of the evaluation process.

Terminology:

  • Raw Test Data
    • A dataset - provided by the user, exported from a system or generated - which can be used to create the evaluation data.

  • Test Case
    • A single test case that is used to evaluate the LLM / RAG - it consists of a prompt, ground truth, conditions, and other parameters that are used for the evaluation.

  • Test
    • A set of test cases linked with a collection of documents (corpus) which must be provided in case of RAG evaluation (test cases typically make sense only in the context of particular corpus - the corpus is empty in case of LLM evaluation).

  • Test Suite
    • A set of tests which can be used to evaluate the LLM / RAG model.

  • Test Lab
    • A test suite completed with the actual data - answers generated by the LLM / RAG models for given prompts, retrieved contexts (in case of RAG), durations, and costs.

  • LLM Dataset
    • A normalized test lab that is used as the input of the evaluation process.

  • Report
    • A collection of metrics and visualizations that describe the validity of a RAG / LLM model on a given test configuration.

Test Case

Test case is a single test case that is used to evaluate the LLM / RAG - it consists of a prompt, ground truth, conditions, and other parameters that are used for the evaluation.

Example:

{
    "key": "d981bf23-9790-4c5e-905e-2ab56cba7ece",
    "prompt": "What was the revenue of Brazil?",
    "categories": [
        "question_answering",
        "perturbed",
        "perturbed_by:h2o_sonar.utils.perturbations.SynonymPerturbator:PerturbationIntensity.HIGH"
    ],
    "relationships": [
        {
            "type": "perturbation_source",
            "target": "9c3a7df3-67df-4819-babb-20636611f077",
            "target_type": "test_case"
        }
    ],
    "expected_output": "Brazil revenue was 15,969 million.",
    "condition": "\"15,969\" AND \"million\""
}

Format description:

  • key
    • Key is optional. In case that it is specified it must be unique within the test suite.

  • prompt
    • Prompt which is used to evaluate the LLM / RAG model.

  • categories
    • Optional categories which are used to classify the test cases.

  • relationships
    • Optional relationships which are used to interlink the test cases.

  • relationships/type
    • Type of the relationship - for instance test case which was used as the perturbation_source in this case.

  • relationships/target
    • Key of the relationship target - a test case key.

  • relationships/target_type
    • Target type - for instance test_case.

  • expected_output
    • Ground truth / expected output.

  • condition
    • Optional condition is used by Text Matching evaluator (and may be used by other evaluators) to evaluate the LLM / RAG model.

Python code which can be used to create the test case:

testing.RagTestCaseConfig(
    prompt="What was the revenue of Brazil?",
    expected_output="Brazil revenue was 15,969 million.",
    condition='"15,969" AND "million"'
)

Test

Test is a set of test cases linked with a collection of documents (corpus) which must be provided in case of RAG evaluation (test cases typically make sense only in the context of particular corpus - the corpus is empty in case of LLM evaluation).

Example:

{
    "key": "48099214-7b33-443a-a456-eb44519521f1",
    "documents": [],
    "test_cases": [
        {
            "key": "9c3a7df3-67df-4819-babb-20636611f077",
            "prompt": "Respond to the following question with single letter answer...",
            "categories": [
                "question-answering"
            ],
            "relationships": [],
            "condition": "regexp(^A.*)",
            "expected_output": "A"
        },
        {
            "key": "6709c0e1-de57-4b5b-b826-2060bf935574",
            "prompt": "Respond to the next question with one letter answer...",
            "categories": [
                "question-answering",
                "perturbed",
                "perturbed_by:h2o_sonar.utils.perturbations.SynonymPerturbator:PerturbationIntensity.HIGH"
            ],
            "relationships": [
                {
                    "type": "perturbation_source",
                    "target": "9c3a7df3-67df-4819-babb-20636611f077",
                    "target_type": "test_case"
                }
            ],
            "condition": "regexp(^D.*)",
            "expected_output": "D"
        }
    ]
}

Format description:

  • key
    • Key is optional. In case that it is specified it must be unique within the test suite.

  • documents
    • In case of RAG evaluation Test has a set of documents - corpus.

  • test_cases
    • Test is formed by a set of Test Case`s which are used to evaluate the LLM / RAG model in the context of given corpus specified within the test’s `documents field.

Python code which can be used to create the test:

# test
test = testing.RagTestConfig(documents=[doc_url])

# test cases
test_case_1 = testing.RagTestCaseConfig(
    prompt="What was the revenue of Brazil?",
    expected_output="Brazil revenue was 15,969 million.",
    condition='"15,969" AND "million"',
    config=test,
)
...
test_case_n = testing.RagTestCaseConfig(
    prompt="What was the revenue of Argentina?",
    expected_output="Argentina revenue was 5,969 million.",
    condition='"5,969" AND "million"',
    config=test,
)

Test Suite

Test suite is a set of tests which can be used to evaluate the LLM / RAG model.

{
    "name": "Test Suite Example",
    "description": "This is a test suite example.",
    "tests": [
        {
            "key": "da503caf-a2dc-4c0a-b5f7-618bebd7860d",
            "documents": [
                "https://public-data.s3.amazonaws.com/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf"
            ],
            "test_cases": [
                {
                    "key": "d981bf23-9790-4c5e-905e-2ab56cba7ece",
                    "prompt": "What was the revenue of Brazil?",
                    "categories": [
                        "question_answering"
                    ],
                    "relationships": [],
                    "expected_output": "Brazil revenue was 15,969 million.",
                    "condition": "\"15,969\" AND \"million\"",
                }
            ]
        },
        {
            "key": "be087c6b-ad41-4b2e-a9f1-0ffd5e6743ec",
            "documents": [
                "https://public-data.s3.amazonaws.com/bradesco-2022-integrated-report.pdf"
            ],
            "test_cases": [
                {
                    "key": "3db6878b-1511-4b58-b42f-985ca9699d1d",
                    "prompt": "Who is the chairman of the board?",
                    "categories": [
                        "question_answering"
                    ],
                    "relationships": [],
                    "expected_output": "The chairman of the board was Luiz Carlos Trabuco Cappi.",
                    "condition": "\"Luiz Carlos Trabuco Cappi\"",
                },
                {
                    "key": "13011931-70c4-40b9-aacc-3870f521dbaa",
                    "prompt": "What was the number of agreements that include human rights clauses, in 2022?",
                    "expected_output": "The number of agreements that include human rights clauses in 2022 was 22.",
                    "categories": [
                        "question_answering"
                    ],
                    "relationships": [],
                    "condition": "\"22\""
                }
            ]
        }
    ]
}

Format description:

  • /name
    • Test suite name.

  • /description
    • Test suite description.

  • /tests
    • Test suite is formed by a set of :ref:`Test`s.

  • /tests/documents
    • In case of RAG evaluation Test has a set of documents - corpus.

  • /tests/key
    • Key is optional. In case that it is specified it must be unique within the test suite.

  • test_cases
    • Test suite is formed by a set of :ref:`Test Case`s which are used to evaluate the LLM / RAG model in the context of given corpus.

Python code which can be used to create the test suite:

# test suite
test_suite = testing.RagTestSuiteConfig()

# test
test = testing.RagTestConfig(documents=[doc_url_1, doc_url_2])

# test cases
test_suite.add_test_case(
    testing.RagTestCaseConfig(
        prompt="What was the revenue of Brazil?",
        expected_output="Brazil revenue was 15,969 million.",
        condition='"15,969" AND "million"',
        config=test,
    )
)
...
test_suite.add_test_case(
    testing.RagTestCaseConfig(
        prompt="What was the revenue of Argentina?",
        expected_output="Argentina revenue was 5,969 million.",
        condition='"5,969" AND "million"',
        config=test,
    )
)

Test Lab

Test lab is a test suite completed with the actual data - answers generated by the LLM / RAG models for given prompts, retrieved contexts (in case of RAG), duration, and cost.

Example:

{
    "name": "TestLab",
    "description": "Test lab for RAG evaluation.",
    "raw_dataset": {
        "inputs": [
            {
                "input": "What is the purpose of the document?",
                "corpus": [
                    "https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf"
                ],
                "context": [],
                "categories": [
                    "question_answering"
                ],
                "relationships": [],
                "expected_output": "The purpose of this document is to provide comprehensive guid...",
                "output_condition": "\"guidance\" AND \"model risk management\"",
                "actual_output": "",
                "actual_duration": 0.0,
                "cost": 0.0,
                "model_key": "0c34cdae-d2a4-4557-a305-d182adb01f6f"
            },
            {
                "input": "What is model validation?",
                "corpus": [
                    "https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf"
                ],
                "context": [],
                "categories": [
                    "question_answering"
                ],
                "relationships": [],
                "expected_output": "Model validation is the set of processes and activities inte...",
                "output_condition": "\"processes\" AND \"activities\"",
                "actual_output": "",
                "actual_duration": 0.0,
                "cost": 0.0,
                "model_key": "14ac0946-6547-48d1-b19a-41bc90d34849"
            }
        ]
    },
    "dataset": {
        "inputs": [
            {
                "input": "What is the purpose of the document?",
                "corpus": [
                    "https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf"
                ],
                "context": [
                    "Changes in\nregulation have spurred some of the recent developments, parti...",
                    "Analysis of in-\nsample fit and of model performance in holdout samples (d...",
                    "Validation reports should articulate model aspects that were reviewed, hig...",
                    "Page 1\nSR Letter 11-7\nAttachment\nBoard of Governors of the Federal Rese...",
                    "Documentation of model development and validation should be sufficiently\n...",
                ],
                "categories": [
                    "question_answering"
                ],
                "relationships": [],
                "expected_output": "The purpose of this document is to provide comprehensive gu...",
                "output_condition": "\"guidance\" AND \"model risk management\"",
                "actual_output": "The purpose of the document is to provide comprehensive guida...",
                "actual_duration": 8.280992269515991,
                "cost": 0.0036560000000000013,
                "model_key": "0c34cdae-d2a4-4557-a305-d182adb01f6f"
            },
            {
                "input": "What is model validation?",
                "corpus": [
                    "https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf"
                ],
                "context": [
                    "But it is not enough for model developers and users\nto understand and acce...",
                    "While they may grant\nexceptions to typical procedures of model validation ...",
                    "Management should have a clear plan for using the results of sensitivity an...",
                    "Validation reports should articulate model aspects that were reviewed, high...",
                    "Rigorous model validation plays a critical role in\nmodel risk management; ..."
                ],
                "categories": [
                    "question_answering"
                ],
                "relationships": [],
                "expected_output": "Model validation is the set of processes and activities int...",
                "output_condition": "\"processes\" AND \"activities\"",
                "actual_output": "Model validation is the process of assessing the quality of a...",
                "actual_duration": 19.800140142440796,
                "cost": 0.004117999999999998,
                "model_key": "14ac0946-6547-48d1-b19a-41bc90d34849"
            }
        ]
    },
    "models": [
        {
            "connection": "612db877-cba2-4ab8-bacf-88d84b396450",
            "model_type": "h2ogpte",
            "name": "RAG model h2oai/h2ogpt-4096-llama2-70b-chat (docs: ['sr1107a1.pdf'])",
            "collection_id": "263c58e6-7424-4f9e-aa02-347d7281439b",
            "collection_name": "RAG collection (docs: ['sr1107a1.pdf'])",
            "llm_model_name": "h2oai/h2ogpt-4096-llama2-70b-chat",
            "documents": [
                "https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf"
            ],
            "key": "0c34cdae-d2a4-4557-a305-d182adb01f6f"
        },
        {
            "connection": "612db877-cba2-4ab8-bacf-88d84b396450",
            "model_type": "h2ogpte",
            "name": "RAG model h2oai/h2ogpt-4096-llama2-13b-chat (docs: ['sr1107a1.pdf'])",
            "collection_id": "263c58e6-7424-4f9e-aa02-347d7281439b",
            "collection_name": "RAG collection (docs: ['sr1107a1.pdf'])",
            "llm_model_name": "h2oai/h2ogpt-4096-llama2-13b-chat",
            "documents": [
                "https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf"
            ],
            "key": "14ac0946-6547-48d1-b19a-41bc90d34849"
        }
    ],
    "llm_model_names": [
        "h2oai/h2ogpt-4096-llama2-70b-chat",
        "h2oai/h2ogpt-4096-llama2-13b-chat"
    ],
    "docs_cache": {
        "https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf": "/tmp/pytest-of-me/sr1107a1.pdf"
    }
}

Format description:

  • /name
    • Test lab name.

  • /description
    • Test lab description.

  • /raw_dataset
    • Test suite - which does not contain resolved data - normalized as LLM dataset.

  • /dataset
    • Test suite with resolved data normalized as LLM dataset.

  • /models
    • List of models which were used to resolve the test suite - generated the actual data for the test cases like answers, duration, and cost.

  • /models/connection
    • Key of the connection (stored in the H2O Eval Studio configuration) which was used to connect to the model host.

  • /models/model_type
    • Type of the model host (e.g. h2ogpte or h2ogpt).

  • /models/name
    • Name of the model.

  • /models/collection_id
    • Corpus collection ID in case of RAG evaluation (not used in case of LLM evaluation).

  • /models/collection_name
    • Corpus collection name in case of RAG evaluation (not used in case of LLM evaluation).

  • /models/llm_model_name
    • Name of the LLM model which was evaluated and which generated the actual data.

  • /models/documents
    • List of documents which were used in the evaluation of a RAG as corpus (not used in case of LLM evaluation).

  • /models/key
    • Key of the model which is used to identify the model in the test suite. It is used in /dataset/model_key to link the resolved data with the model which created the data.

  • /llm_model_names
    • List of LLM model (names) which are hosted by models and which generated the actual data.

  • /docs_cache
    • Cache of the documents used in the test suite in case of RAG evaluation.

Python code which can be used to create the test lab:

# test lab
test_lab = testing.RagTestLab.from_llm_test_suite(
    llm_host_connection=h2o_gpte_connection,
    llm_test_suite=test_suite,
    llm_model_type=models.ExplainableModelType.h2ogpte_llm,
    llm_model_names=llm_model_names,
)

# create collections and upload corpus documents (RAG only)
test_lab.build()

# resolve test lab
test_lab.complete_dataset()

The host-specific configuration can be used to create the test lab. In this case, the host client might be used to obtain the available parameters, which can be modified by the user and then passed to the code that resolves the test lab. In order to be able to compare RAGs/LLMs with different configurations, it is possible to provide multiple configurations for the same model. The configuration is a dictionary that contains the parameters that are used to create the model -see RAG and LLM Hosts for reference. For instance:

#
# LLMs to be evaluated
#
llm_model_names = ["h2oai/h2ogpt-4096-llama2-70b-chat"]

#
# custom host configurations
#

# A
# host configuration proto w/ default values
custom_config_a = genai.H2oGpteRagClient.config_factory()
# modify system prompt
custom_config_a["system_prompt"] = (
    "You are h2oGPTe, an expert question-answering AI system created by H2O.ai "
    "that performs like GPT-4 by OpenAI."
)
# modify LLM temperature in order to make the model more creative
custom_config_a["llm_args"] = {"temperature": 0.9}

# B
# host configuration proto w/ default values
custom_config_b = genai.H2oGpteRagClient.config_factory()
# modify system prompt
custom_config_b["system_prompt"] = (
    "You are flaky, a rookie question-answering AI system that performs poorly."
)
# modify LLM temperature to avoid hallucinations
custom_config_b["llm_args"] = {"temperature": 0.0}

#
# LLM configurations: LLM name -> list of host configurations
#
llm_host_configs = {
    llm_model_names[0]: [
        custom_config_a, custom_config_b,
    ]
}

#
# test lab
#
test_lab = testing.RagTestLab.from_llm_test_suite(
    llm_host_connection=h2o_gpte_connection,
    llm_test_suite=test_suite,
    llm_model_type=models.ExplainableModelType.h2ogpte_llm,
    llm_model_names=llm_model_names,
    llm_host_configs=llm_host_configs,
)
# create collections and upload corpus documents (RAG only)
test_lab.build()
# resolve test lab
test_lab.complete_dataset()

LLM Dataset

LLM dataset is a normalized test lab that is used as the input of the evaluation process. LLM dataset can be serialized as JSon, CSV or Pandas frame.

Example:

{
    "inputs": [
        {
            "input": "Which of the following statements accurately describes the impact of ...",
            "corpus": [
                "https://www.wikipedia.org/"
            ],
            "context": [
                "Physical Review Letters. 99 (14): 141302. arXiv:0704.1932. Bibcode:2007P...",
                "The homogeneously distributed mass of the universe would result in a rou...",
                "Modified Newtonian dynamics (MOND) is a hypothesis that proposes a modif..."
            ],
            "categories": [
                "question_answering"
            ],
            "relationships": [],
            "expected_output": "MOND is a theory that reduces the discrepancy between the...",
            "output_condition": "",
            "actual_output": "MOND is a theory that reduces the discrepancy between the o...",
            "actual_duration": 0.415,
            "model_key": "4b60abf6-0c8d-4765-804a-e830a211f152"
        }
    ]
}

Format description:

  • /inputs
    • List of test cases.

  • /inputs/input
    • Prompt.

  • /inputs/corpus
    • List of documents which are used as the corpus in case of this test case.

  • /inputs/context
    • List of chunks - retrieved contexts.

  • /inputs/categories
    • List of categories used to classify the test case.

  • /inputs/relationships
    • List of relationships.

  • /inputs/expected_output
    • Expected output.

  • /inputs/output_condition
    • Condition used by Text Matching evaluator (and may be used by other evaluators) to evaluate the LLM / RAG model.

  • /inputs/actual_output
    • Actual output.

  • /inputs/actual_duration
    • Duration of the request to RAG / LLM model.

  • /inputs/model_key
    • Key of the model which generated the actual data - must be valid key of a model when running the evaluation process.

Python code which can be used to create the LLM dataset:

# LLM dataset
dataset = datasets.LlmDataset()

# add row(s) to the dataset
dataset.add_input(
    i="What is the target column of the model?",
    actual_output="default payment next month",
)