BYOJ: Bring Your Own Judge ========================== H2O Sonar can be configured to use custom evaluation LLM judges. For instance in order to ensure privacy and avoid sending of the sensitive data to a 3rd party and/or to the cloud. There are two ways how to configure the custom judges: - **Forced from the H2O Sonar configuration** - the custom judge is forced from the H2O Sonar configuration in order to be used by evaluators which need an LLM judge to evaluate the model. - **Specified in the evaluator parameters** - the custom judge is specified in the evaluator parameters. Custom LLM judge can be configured for the following evaluators: - :ref:`Answer Correctness Evaluator` - :ref:`Answer Semantic Similarity Evaluator` - :ref:`Context Relevancy Evaluator` - :ref:`RAGAS Evaluator` - :ref:`Context Precision Evaluator` - :ref:`Faithfulness Evaluator` - :ref:`Context Recall Evaluator` - :ref:`Answer Relevancy Evaluator` This feature includes also reconfiguration of embeddings provider from the same reasons. Force Custom Judge from H2O Sonar Configuration ----------------------------------------------- In order to force the custom judge from the H2O Sonar configuration, the custom judge must be configured in H2O Sonar configuration and the ``force_eval_judge`` must be set either to its key or to ``true``. .. code-block:: python from h2o_sonar import config as h2o_sonar_config # create connection where judge's LLM is hosted my_h2ogpt_connection = h2o_sonar_config.ConnectionConfig( connection_type=h2o_sonar_config.ConnectionConfigType.H2O_GPT.name, name="H2O GPT", description="My H2O GPT server.", server_url="https://gpt.host.ai/", token=os.getenv(KEY_H2OGPT_API_KEY), token_use_type=h2o_sonar_config.TokenUseType.API_KEY.name, ) # register the connection in H2O Sonar configuration h2o_sonar_config.config.add_connection(my_h2ogpt_connection) # create custom judge configuration - note LLM model name judge_config = h2o_sonar_config.EvaluationJudgeConfig( name="My custom judge", description="My custom LLM judge to be used by evaluators.", judge_type=judge_type, connection=judge_connection, llm_model_name="h2oai/h2ogpt-4096-llama2-70b-chat", ) # register the custom judge in H2O Sonar configuration h2o_sonar_config.config.add_evaluation_judge(judge_config) # force the custom judge in H2O Sonar configuration h2o_sonar_config.config.force_eval_judge = "true" ... # run evaluation ... In the example above will be used the first LLM evaluation judge from the H2O Sonar configuration. In order to use particular LLM evaluation judge custom judge key must be specified instead of ``true``. LLM Judge Specified in the Evaluator Parameters ----------------------------------------------- In order to specify the custom judge in the evaluator parameters, the custom judge must be configured in H2O Sonar configuration and the judge key must be specified in the evaluator parameter. .. code-block:: python from h2o_sonar import config as h2o_sonar_config # create connection where judge's LLM is hosted my_h2ogpt_connection = h2o_sonar_config.ConnectionConfig( connection_type=h2o_sonar_config.ConnectionConfigType.H2O_GPT.name, name="H2O GPT", description="My H2O GPT server.", server_url="https://gpt.host.ai/", token=os.getenv(KEY_H2OGPT_API_KEY), token_use_type=h2o_sonar_config.TokenUseType.API_KEY.name, ) # register the connection in H2O Sonar configuration h2o_sonar_config.config.add_connection(my_h2ogpt_connection) # create custom judge configuration - note LLM model name judge_config = h2o_sonar_config.EvaluationJudgeConfig( name="My custom judge", description="My custom LLM judge to be used by evaluators.", judge_type=judge_type, connection=judge_connection, llm_model_name="h2oai/h2ogpt-4096-llama2-70b-chat", ) # register the custom judge in H2O Sonar configuration h2o_sonar_config.config.add_evaluation_judge(judge_config) # use evaluator parameter to specify the custom judge evaluators = [ commons.EvaluatorToRun( evaluator_id=ContextRelevancyEvaluator.evaluator_id(), params={ ContextRelevancyEvaluator.PARAM_EVAL_JUDGE_CFG_KEY: judge_config.key, }, ) ] # run evaluation evaluation = evaluate.run_evaluation( # dataset w/ prompts, conditions and model keys dataset=test_lab.dataset, # models to be evaluated / compared to get leaderboard models=list(test_lab.evaluated_models.values()), # evaluators evaluators=evaluators, # where to save the report results_location=result_dir, ) ... In the example above will be used the custom judge specified in the evaluator parameters. If the custom judge is specified, then ``ragas`` evaluators are also reconfigured to use privacy safe embeddings provider which is run on the same machine as the custom judge. Privacy Safe LLM Models ----------------------- The decision which LLM model to use for the evaluation is crucial. The user must consider the quality of the model, the privacy, the cost, the availability and performance. Different models may have different trade-offs. The user can choose the most suitable model for the evaluation of the particular problem or task which is being solved. The following leaderboard of LLM models gives an rough overview of the quality of the models and how much precision / performance might be sacrificed when changing the judge for instance in order to ensure privacy. +--------------+-----------------------------+-------------+-------+ | Privacy safe | LLM | Correlation | MAE | +==============+=============================+=============+=======+ | ❌ | gpt-4-1106-preview | 0.964 | 0.024 | +--------------+-----------------------------+-------------+-------+ | ❌ | claude-3-opus-20240229 | 0.891 | 0.078 | +--------------+-----------------------------+-------------+-------+ | ❌ | mistral-large-latest | 0.827 | 0.160 | +--------------+-----------------------------+-------------+-------+ | ❌ | mistral-medium | 0.823 | 0.128 | +--------------+-----------------------------+-------------+-------+ | ✓ | Mixtral-8x7B-Instruct-v0.1 | 0.819 | 0.179 | +--------------+-----------------------------+-------------+-------+ | ❌ | claude-3-sonnet-20240229 | 0.795 | 0.155 | +--------------+-----------------------------+-------------+-------+ | ✓ | Nous-Capybara-34B | 0.794 | 0.109 | +--------------+-----------------------------+-------------+-------+ | ✓ | Mistral-7B-Instruct-v0.2 | 0.784 | 0.190 | +--------------+-----------------------------+-------------+-------+ | ❌ | claude-2.1 | 0.779 | 0.184 | +--------------+-----------------------------+-------------+-------+ | ✓ | openchat-3.5-1210 | 0.736 | 0.221 | +--------------+-----------------------------+-------------+-------+ | ✓ | gemma-7b-it | 0.696 | 0.226 | +--------------+-----------------------------+-------------+-------+ | ❌ | gpt-3.5-turbo-0613 | 0.630 | 0.258 | +--------------+-----------------------------+-------------+-------+ | ✓ | h2ogpt-4096-llama2-70b-chat | 0.604 | 0.347 | +--------------+-----------------------------+-------------+-------+ | ✓ | zephyr-7b-beta | 0.513 | 0.378 | +--------------+-----------------------------+-------------+-------+ | ✓ | h2ogpt-4096-llama2-13b-chat | 0.343 | 0.441 | +--------------+-----------------------------+-------------+-------+ The comparison of LLM models in the above table is based on a H2O.ai KGM benchmark as of 2024/3/15. The benchmark uses a custom prompt with detailed description of how to construct the LLM model score. The score is then used to calculate the metrics in the table. The table shows both commercial and non-commercial (open) models side-by-side to compare the quality of the models. Leaderboards which compare LLM models from various perspectives - like `Chatbot Arena Leaderboard `_ - are available on the internet and can be used to choose the most suitable model for the evaluation. +--------------+-----------------------------+-------------+-------+ | Privacy safe | LLM | Correlation | MAE | +==============+=============================+=============+=======+ +--------------+-----------------------------+-------------+-------+ | ✓ | Mixtral-8x7B-Instruct-v0.1 | 0.819 | 0.179 | +--------------+-----------------------------+-------------+-------+ | ✓ | Nous-Capybara-34B | 0.794 | 0.109 | +--------------+-----------------------------+-------------+-------+ | ✓ | Mistral-7B-Instruct-v0.2 | 0.784 | 0.190 | +--------------+-----------------------------+-------------+-------+ | ✓ | openchat-3.5-1210 | 0.736 | 0.221 | +--------------+-----------------------------+-------------+-------+ | ✓ | gemma-7b-it | 0.696 | 0.226 | +--------------+-----------------------------+-------------+-------+ | ✓ | h2ogpt-4096-llama2-70b-chat | 0.604 | 0.347 | +--------------+-----------------------------+-------------+-------+ | ✓ | zephyr-7b-beta | 0.513 | 0.378 | +--------------+-----------------------------+-------------+-------+ | ✓ | h2ogpt-4096-llama2-13b-chat | 0.343 | 0.441 | +--------------+-----------------------------+-------------+-------+ The table above shows only privacy safe LLM models. The privacy safe LLM models are models which can be hosted on-premise / within the organization. As was described above the privacy safe LLM models are suitable for the evaluation of the sensitive data in order to ensure privacy and avoid sending of the sensitive data to a 3rd party and/or to the cloud. See also: - :ref:`Library Connections Configuration`