Evaluating RAGs and LLMs ======================== H2O Sonar can evaluate standalone LLMs (Large Language Models) and LLMs used by RAG (Retrieval-augmented generation) systems. .. image:: images/evaluation-high-level.png :alt: Evaluation diagram LLMs are evaluated as follows: 1. :ref:`Test Data Preparation` 2. :ref:`System Under Evaluation Connection` 3. :ref:`Test Data Resolution` 4. :ref:`LLM Evaluation` 5. :ref:`Evaluation Results Analysis` Test Data Preparation --------------------- The first step is the preparation of test data - a set of test cases. Each test case is a dictionary with the **query** (prompt) and the **expected answer** in the context of given **corpus** (in case of RAG - LLM evaluation has empty corpus). Test cases which use the same corpus are grouped into a **test**. Tests are grouped into a **test suite**. (see :ref:`Test Case, Suite, Lab and LLM Dataset` for more details). Test suite example: .. code-block:: json { "name": "Test Suite Example", "description": "This is a test suite example.", "tests": [ { "key": "da503caf-a2dc-4c0a-b5f7-618bebd7860d", "documents": [ "https://public-data.s3.amazonaws.com/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf" ], "test_cases": [ { "key": "d981bf23-9790-4c5e-905e-2ab56cba7ece", "prompt": "What was the revenue of Brazil?", "categories": [ "question_answering" ], "relationships": [], "expected_output": "Brazil revenue was 15,969 million.", "condition": "\"15,969\" AND \"million\"" } ] } ] } System Under Evaluation Connection ---------------------------------- The next step is to provide connect information of the system to be evaluated using the test suite. The set of supported systems is documented by :ref:`RAG and LLM Hosts` section. Example of the Enterprise h2oGPTe LLM host connection configuration: .. code-block:: python h2o_gpte_connection = h2o_sonar_config.ConnectionConfig( connection_type=h2o_sonar_config.ConnectionConfigType.H2O_GPT_E.name, name="H2O GPT Enterprise", description="H2O GPT Enterprise LLM host example.", server_url="https://h2ogpte.genai-training.h2o.ai/", token="sk-IZQ9ioZBdRFMv6o31MAmkHzk5AHf8Bjs9q08lRbRLalNYHcT", token_use_type=h2o_sonar_config.TokenUseType.API_KEY.name, ) Test Data Resolution -------------------- Having test suite and connection to the system to be evaluated, the next step is to resolve it to the **test lab** - that is a set of test cases with resolved data - **actual** answers, duration, cost, etc. for each test case. Example of the test lab resolution for an LLM host connection and test suite: .. code-block:: python test_lab = testing.RagTestLab.from_llm_test_suite( llm_host_connection=h2o_gpte_connection, llm_test_suite=test_suite, llm_model_type=models.ExplainableModelType.h2ogpte_llm, llm_model_names=llm_model_names, ) # create collections and upload corpus documents (RAG only) test_lab.build() # resolve test lab test_lab.complete_dataset() Argument `llm_model_names` is a list of LLM model names to be evaluated - for example `["h2oai/h2ogpt-4096-llama2-70b-chat", "h2oai/h2ogpt-4096-llama2-13b-chat"]`. Example of the **test lab** with resolved data: .. code-block:: json { "name": "TestLab", "description": "Test lab for RAG evaluation.", "raw_dataset": { "inputs": [ { "input": "What is the purpose of the document?", "corpus": [ "https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf" ], "context": [], "categories": [ "question_answering" ], "relationships": [], "expected_output": "The purpose of this document is to provide comprehensive guid...", "output_condition": "\"guidance\" AND \"model risk management\"", "actual_output": "", "actual_duration": 0.0, "cost": 0.0, "model_key": "0c34cdae-d2a4-4557-a305-d182adb01f6f" } ] }, "dataset": { "inputs": [ { "input": "What is the purpose of the document?", "corpus": [ "https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf" ], "context": [ "Changes in\nregulation have spurred some of the recent developments, parti...", "Analysis of in-\nsample fit and of model performance in holdout samples (d...", "Validation reports should articulate model aspects that were reviewed, hig...", "Page 1\nSR Letter 11-7\nAttachment\nBoard of Governors of the Federal Rese...", "Documentation of model development and validation should be sufficiently\n...", ], "categories": [ "question_answering" ], "relationships": [], "expected_output": "The purpose of this document is to provide comprehensive gu...", "output_condition": "\"guidance\" AND \"model risk management\"", "actual_output": "The purpose of the document is to provide comprehensive guida...", "actual_duration": 8.280992269515991, "cost": 0.0036560000000000013, "model_key": "0c34cdae-d2a4-4557-a305-d182adb01f6f" } ] }, "models": [ { "connection": "612db877-cba2-4ab8-bacf-88d84b396450", "model_type": "h2ogpte", "name": "RAG model h2oai/h2ogpt-4096-llama2-70b-chat (docs: ['sr1107a1.pdf'])", "collection_id": "263c58e6-7424-4f9e-aa02-347d7281439b", "collection_name": "RAG collection (docs: ['sr1107a1.pdf'])", "llm_model_name": "h2oai/h2ogpt-4096-llama2-70b-chat", "documents": [ "https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf" ], "key": "0c34cdae-d2a4-4557-a305-d182adb01f6f" }, { "connection": "612db877-cba2-4ab8-bacf-88d84b396450", "model_type": "h2ogpte", "name": "RAG model h2oai/h2ogpt-4096-llama2-13b-chat (docs: ['sr1107a1.pdf'])", "collection_id": "263c58e6-7424-4f9e-aa02-347d7281439b", "collection_name": "RAG collection (docs: ['sr1107a1.pdf'])", "llm_model_name": "h2oai/h2ogpt-4096-llama2-13b-chat", "documents": [ "https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf" ], "key": "14ac0946-6547-48d1-b19a-41bc90d34849" } ], "llm_model_names": [ "h2oai/h2ogpt-4096-llama2-70b-chat", "h2oai/h2ogpt-4096-llama2-13b-chat" ], "docs_cache": { "https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf": "/tmp/pytest-of-me/sr1107a1.pdf" } } LLM Evaluation -------------- Test lab with resolved data has everything needed to evaluate the LLMs by :ref:`Evaluators`. Example of the LLM evaluation: .. code-block:: python evaluation = evaluate.run_evaluation( # test lab provides resolved test data for evaluation as dataset dataset=test_lab.dataset, # models to be evaluated models=list(test_lab.evaluated_models.values()), # evaluators evaluators=evaluators, # where to save the report results_location=result_dir, ) Evaluation Results Analysis --------------------------- The last step is the analysis of the evaluation results. The results are saved in the `results_location` directory as a HTML report (for humans), JSon report (for machines) and per-evaluator directory with all the data which were used for the evaluation and created by evaluators - see :ref:`Report and Results` and :ref:`Evaluators` sections for more details. Example of the evaluation results analysis: .. code-block:: python # HTML report html_report_path = evaluation.result.get_html_report_location() # evaluator results evaluator_result = evaluation.get_evaluator_result( rag_tokens_presence_evaluator.RagStrStrEvaluator().evaluator_id() ) evaluator_result.summary() evaluator_result.data() evaluator_result.params() evaluator_result.plot() evaluator_result.log()