ChatCompletionRequest
Properties
Name |
Type |
Description |
Notes |
---|---|---|---|
message |
str |
A query or an instruction from the end user to the LLM. |
|
system_prompt |
str |
A text sent to models which support system prompts. It gives the model overall context in how to respond. Use ‘auto’ for the model default. Don’t specify for no system prompt. |
[optional] |
pre_prompt_query |
str |
A text that is prepended before the contextual document chunks. The default can be customized per environment. |
[optional] [default to ‘Pay attention and remember the information below, which will help to answer the question or imperative after the context ends.\\n’] |
prompt_query |
str |
A text that is appended to the beginning of the user’s message. The default can be customized per environment. |
[optional] [default to ‘According to only the information in the document sources provided within the context above,’] |
image_batch_final_prompt |
str |
A prompt for each image batch for vision models. |
[optional] |
image_batch_image_prompt |
str |
A prompt to reduce all answers each image batch for vision models |
[optional] |
llm |
str |
LLM name to send the query. Use "auto" for automatic model routing, set cost_controls of llm_args for detailed control over automatic routing. |
[optional] |
llm_args |
Dict[str, object] |
A map of arguments sent to LLM with query. * `temperature` (type=double, default=0.0) - A value used to modulate the next token probabilities. 0 is the most deterministic and 1 is most creative. * `top_k` (type=integer, default=1) - A number of highest probability vocabulary tokens to keep for top-k-filtering. * `top_p` (type=double, default=0.0) - If set to a value < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. * `seed` (type=integer, default=0) - A seed for the random number generator when sampling during generation (if temp>0 or top_k>1 or top_p<1), seed=0 picks a random seed. * `repetition_penalty` (type=double, default=1.07) - A parameter for repetition penalty. 1.0 means no penalty. * `max_new_tokens` (type=double, default=1024) - A maximum number of new tokens to generate. This limit applies to each (map+reduce) step during summarization and each (map) step during extraction. * `min_max_new_tokens` (type=integer, default=512) - A minimum value for max_new_tokens when auto-adjusting for content of prompt, docs, etc. * `response_format` (type=enum[text, json_object, json_code], default=text) - An output type of LLM * `guided_json` (type=map) - If specified, the output will follow the JSON schema. * `guided_regex` (type=string) - If specified, the output will follow the regex pattern. Only for models that support guided generation. * `guided_choice` (type=array[string]) - If specified, the output will be exactly one of the choices. Only for models that support guided generation. * `guided_grammar` (type=string) - If specified, the output will follow the context free grammar. Only for models that support guided generation. * `guided_whitespace_pattern` (type=string) - If specified, will override the default whitespace pattern for guided json decoding. Only for models that support guided generation. * `enable_vision` (type=enum[on, off, auto], default=auto) - Controls vision mode, send images to the LLM in addition to text chunks. * `visible_vision_models` (type=array[string], default=[auto]) - Controls which vision model to use when processing images. Must provide exactly one model. [auto] for automatic. * `cost_controls` (type=map) A map with cost controls settings: * `max_cost` (type=double) - Sets the maximum allowed cost in USD per LLM call when doing Automatic model routing. If the estimated cost based on input and output token counts is higher than this limit, the request will fail as early as possible. * `max_cost_per_million_tokens` (type=double) - Only consider models that cost less than this value in USD per million tokens when doing automatic routing. Using the max of input and output cost. * `model` (type=array[string]) - Optional subset of models to consider when doing automatic routing. If not specified, all models are considered. * `willingness_to_pay` (type=double) - Controls the willingness to pay extra for a more accurate model for every LLM call when doing automatic routing, in units of USD per +10% increase in accuracy. We start with the least accurate model. For each more accurate model, we accept it if the increase in estimated cost divided by the increase in estimated accuracy is no more than this value divided by 10%, up to the upper limit specified above. Lower values will try to keep the cost as low as possible, higher values will approach the cost limit to increase accuracy. 0 means unlimited. * `willingness_to_wait` (type=double) - Controls the willingness to wait longer for a more accurate model for every LLM call when doing automatic routing, in units of seconds per +10% increase in accuracy. We start with the least accurate model. For each more accurate model, we accept it if the increase in estimated time divided by the increase in estimated accuracy is no more than this value divided by 10%. Lower values will try to keep the time as low as possible, higher values will take longer to increase accuracy. 0 means unlimited. |
[optional] |
self_reflection_config |
Dict[str, object] |
A map with self reflection settings: * `llm_reflection` (type=string, example=gpt-4-0613) * `prompt_reflection` (type=string, example=\"\"\"Prompt:\\n%s\\n\"\"\"\\n\\n\"\"\") * `system_prompt_reflection` (type=string) * `llm_args_reflection` (type=string, example={}) |
[optional] |
rag_config |
object |
A map with arguments to control RAG (retrieval-augmented-generation) types.: * `rag_type` (type=enum[auto, llm_only, rag, hyde1, hyde2, rag+, all_data]) RAG type options: * `auto` - Automatically select the best rag_type. * `llm_only` LLM Only - Answer the query without any supporting document contexts. Requires 1 LLM call. * `rag` RAG (Retrieval Augmented Generation) - Use supporting document contexts to answer the query. Requires 1 LLM call. * `hyde1` LLM Only + RAG composite - HyDE RAG (Hypothetical Document Embedding). Use ‘LLM Only’ response to find relevant contexts from a collection for generating a response. Requires 2 LLM calls. * `hyde2` HyDE + RAG composite - Use the ‘HyDE RAG’ response to find relevant contexts from a collection for generating a response. Requires 3 LLM calls. * `rag+` Summary RAG - Like RAG, but uses more context and recursive summarization to overcome LLM context limits. Keeps all retrieved chunks, puts them in order, adds neighboring chunks, then uses the summary API to get the answer. Can require several LLM calls. * `all_data` All Data RAG - Like Summary RAG, but includes all document chunks. Uses recursive summarization to overcome LLM context limits. Can require several LLM calls. * `hyde_no_rag_llm_prompt_extension` (type=string, example=\\nKeep the answer brief, and list the 5 most relevant key words at the end.) - Add this prompt to every user’s prompt, when generating answers to be used for subsequent retrieval during HyDE. Only used when rag_type is `hyde1` or `hyde2`. * `num_neighbor_chunks_to_include` (type=integer, default=1) - A number of neighboring chunks to include for every retrieved relevant chunk. It helps to keep surrounding context together. Only enabled for rag_type `rag+`. * `meta_data_to_include` (type=map) - A map with flags that indicate whether each piece of document metadata is to be included as part of the context for a chat with a collection. * `name` (type: boolean, default=True) * `text` (type: boolean, default=True) * `page` (type: boolean, default=True) * `captions` (type: boolean, default=True) * `uri` (type: boolean, default=False) * `connector` (type: boolean, default=False) * `original_mtime` (type: boolean, default=False) * `age` (type: boolean, default=False) * `score` (type: boolean, default=False) * `rag_max_chunks` (type=integer, default=-1) - Maximum number of document chunks to retrieve for RAG. Actual number depends on rag_type and admin configuration. Set to >0 values to enable. Can be combined with rag_min_chunk_score. * `rag_min_chunk_score` (type=double, default=0.0) - Minimum score of document chunks to retrieve for RAG. Set to >0 values to enable. Can be combined with rag_max_chunks. |
[optional] |
include_chat_history |
str |
Whether to include chat history. Includes previous questions and answers for the current chat session for each new chat request. Disable if require deterministic answers for a given question. |
[optional] |
tags |
List[str] |
A list of tags from which to pull the context for RAG. |
[optional] |
Example
from h2ogpte.rest.models.chat_completion_request import ChatCompletionRequest
# TODO update the JSON string below
json = "{}"
# create an instance of ChatCompletionRequest from a JSON string
chat_completion_request_instance = ChatCompletionRequest.from_json(json)
# print the JSON string representation of the object
print(ChatCompletionRequest.to_json())
# convert the object into a dict
chat_completion_request_dict = chat_completion_request_instance.to_dict()
# create an instance of ChatCompletionRequest from a dict
chat_completion_request_from_dict = ChatCompletionRequest.from_dict(chat_completion_request_dict)