CollectionSettings
Properties
Name |
Type |
Description |
Notes |
---|---|---|---|
max_tokens_per_chunk |
int |
Approximate max. number of tokens per chunk for text-dominated document pages. For images, chunks can be larger. |
[optional] |
chunk_overlap_tokens |
int |
Approximate number of tokens that are overlapping between successive chunks. |
[optional] |
gen_doc_summaries |
bool |
Whether to auto-generate document summaries (uses LLM) |
[optional] [default to False] |
gen_doc_questions |
bool |
Whether to auto-generate sample questions for each document (uses LLM) |
[optional] [default to False] |
audio_input_language |
str |
Language of audio files. Defaults to "auto" language detection. Pass empty string to see choices. |
[optional] [default to ‘auto’] |
ocr_model |
str |
Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. Pass empty string to see choices. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. auto - Automatic will auto-select the best OCR model for every page. off - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). |
[optional] [default to ‘auto’] |
tesseract_lang |
str |
Which language to use when using ocr_model="tesseract". Pass empty string to see choices. |
[optional] |
keep_tables_as_one_chunk |
bool |
When tables are identified by the table parser the table tokens will be kept in a single chunk. |
[optional] |
chunk_by_page |
bool |
Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. |
[optional] |
handwriting_check |
bool |
Check pages for handwriting. Will use specialized models if handwriting is found. |
[optional] |
follow_links |
bool |
Whether to import all web pages linked from this URL will be imported. External links will be ignored. Links to other pages on the same domain will be followed as long as they are at the same level or below the URL you specify. Each page will be transformed into a PDF document. |
[optional] |
max_depth |
int |
Max depth of recursion when following links, only when follow_links is `true`. Max_depth of 0 means don’t follow any links, max_depth of 1 means follow only top-level links, etc. Use -1 for automatic (system settings). |
[optional] |
max_documents |
int |
Max number of documents when following links, only when follow_links is `true`. Use None for automatic (system defaults). Use -1 for max (system limit). |
[optional] |
root_dir |
str |
Root directory for document storage |
[optional] |
copy_document |
bool |
Whether to copy the document when importing an existing document |
[optional] |
guardrails_settings |
[optional] |
Example
from h2ogpte.rest.models.collection_settings import CollectionSettings
# TODO update the JSON string below
json = "{}"
# create an instance of CollectionSettings from a JSON string
collection_settings_instance = CollectionSettings.from_json(json)
# print the JSON string representation of the object
print(CollectionSettings.to_json())
# convert the object into a dict
collection_settings_dict = collection_settings_instance.to_dict()
# create an instance of CollectionSettings from a dict
collection_settings_from_dict = CollectionSettings.from_dict(collection_settings_dict)