CollectionSettings

Properties

Name

Type

Description

Notes

max_tokens_per_chunk

int

Approximate max. number of tokens per chunk for text-dominated document pages. For images, chunks can be larger.

[optional]

chunk_overlap_tokens

int

Approximate number of tokens that are overlapping between successive chunks.

[optional]

gen_doc_summaries

bool

Whether to auto-generate document summaries (uses LLM)

[optional] [default to False]

gen_doc_questions

bool

Whether to auto-generate sample questions for each document (uses LLM)

[optional] [default to False]

audio_input_language

str

Language of audio files. Defaults to "auto" language detection. Pass empty string to see choices.

[optional] [default to ‘auto’]

ocr_model

str

Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. Pass empty string to see choices. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. auto - Automatic will auto-select the best OCR model for every page. off - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).

[optional] [default to ‘auto’]

tesseract_lang

str

Which language to use when using ocr_model="tesseract". Pass empty string to see choices.

[optional]

keep_tables_as_one_chunk

bool

When tables are identified by the table parser the table tokens will be kept in a single chunk.

[optional]

chunk_by_page

bool

Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.

[optional]

handwriting_check

bool

Check pages for handwriting. Will use specialized models if handwriting is found.

[optional]

follow_links

bool

Whether to import all web pages linked from this URL will be imported. External links will be ignored. Links to other pages on the same domain will be followed as long as they are at the same level or below the URL you specify. Each page will be transformed into a PDF document.

[optional]

max_depth

int

Max depth of recursion when following links, only when follow_links is `true`. Max_depth of 0 means don’t follow any links, max_depth of 1 means follow only top-level links, etc. Use -1 for automatic (system settings).

[optional]

max_documents

int

Max number of documents when following links, only when follow_links is `true`. Use None for automatic (system defaults). Use -1 for max (system limit).

[optional]

root_dir

str

Root directory for document storage

[optional]

copy_document

bool

Whether to copy the document when importing an existing document

[optional]

guardrails_settings

GuardrailsSettings

[optional]

Example

from h2ogpte.rest.models.collection_settings import CollectionSettings

# TODO update the JSON string below
json = "{}"
# create an instance of CollectionSettings from a JSON string
collection_settings_instance = CollectionSettings.from_json(json)
# print the JSON string representation of the object
print(CollectionSettings.to_json())

# convert the object into a dict
collection_settings_dict = collection_settings_instance.to_dict()
# create an instance of CollectionSettings from a dict
collection_settings_from_dict = CollectionSettings.from_dict(collection_settings_dict)