DocumentIngestionApi

Method	HTTP request	Description
create_ingest_agent_only_to_standard_job	POST /ingest/agent-only-to-standard/job	Creates a job to parse files uploaded in "agent-only" ingest mode.
create_ingest_from_azure_blob_storage_job	POST /ingest/azure-blob-storage/job	Creates a job to add files from the Azure Blob Storage into a collection.
create_ingest_from_confluence_job	POST /ingest/confluence/job	Creates a job to ingest confluence pages into collection.
create_ingest_from_file_system_job	POST /ingest/file-system/job	Creates a job to add files from the local system into a collection.
create_ingest_from_gcs_job	POST /ingest/gcs/job	Creates a job to add files from the Google Cloud Storage into a collection.
create_ingest_from_plain_text_job	POST /ingest/plain-text/job	Creates a job to add plain text to a collection.
create_ingest_from_s3_job	POST /ingest/s3/job	Creates a job to add files from the AWS S3 storage into a collection.
create_ingest_from_website_job	POST /ingest/website/job	Creates a job to crawl and ingest a URL into a collection.
create_ingest_upload_job	POST /uploads/{upload-ids}/ingest/job	Creates a job to ingest uploaded document
ingest_agent_only_to_standard	POST /ingest/agent-only-to-standard	Converts files uploaded in "agent-only" ingest mode to PDF and parses them.
ingest_from_azure_blob_storage	POST /ingest/azure-blob-storage	Adds files from the Azure Blob Storage into a collection.
ingest_from_confluence	POST /ingest/confluence	Ingests confluence pages into collection.
ingest_from_file_system	POST /ingest/file-system	Adds files from the local system into a collection.
ingest_from_gcs	POST /ingest/gcs	Adds files from the Google Cloud Storage into a collection.
ingest_from_plain_text	POST /ingest/plain-text	Adds plain text to a collection.
ingest_from_s3	POST /ingest/s3	Adds files from the AWS S3 storage into a collection.
ingest_from_website	POST /ingest/website	Crawls and ingest a URL into a collection.
ingest_upload	POST /uploads/{upload-ids}/ingest	Ingest uploaded document
upload_file	PUT /uploads

create_ingest_agent_only_to_standard_job

JobDetails create_ingest_agent_only_to_standard_job(collection_id, document_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, restricted=restricted, permissions=permissions, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout)

Creates a job to parse files uploaded in “agent_only” ingest mode.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.job_details import JobDetails
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    document_id = 'document_id_example' # str | String id of the document to be parsed.
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    restricted = False # bool | Whether the document should be restricted only to certain users. (optional) (default to False)
    permissions = ['permissions_example'] # List[str] | The list of usernames having permissions to the document. (optional)
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)

    try:
        # Creates a job to parse files uploaded in \"agent_only\" ingest mode.
        api_response = await api_instance.create_ingest_agent_only_to_standard_job(collection_id, document_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, restricted=restricted, permissions=permissions, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout)
        print("The response of DocumentIngestionApi->create_ingest_agent_only_to_standard_job:\n")
        pprint(api_response)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->create_ingest_agent_only_to_standard_job: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
document_id	str	String id of the document to be parsed.
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
restricted	bool	Whether the document should be restricted only to certain users.	[optional] [default to False]
permissions	List[str]	The list of usernames having permissions to the document.	[optional]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
timeout	float	Timeout in seconds	[optional]

Return type

JobDetails

HTTP request headers

Content-Type: Not defined
Accept: application/json

HTTP response details

Status code	Description	Response headers
201	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

create_ingest_from_azure_blob_storage_job

JobDetails create_ingest_from_azure_blob_storage_job(collection_id, ingest_from_azure_blob_storage_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)

Creates a job to add files from the Azure Blob Storage into a collection.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.ingest_from_azure_blob_storage_body import IngestFromAzureBlobStorageBody
from h2ogpte.rest_async.models.job_details import JobDetails
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    ingest_from_azure_blob_storage_body = h2ogpte.rest_async.IngestFromAzureBlobStorageBody() # IngestFromAzureBlobStorageBody | 
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    ingest_mode = 'ingest_mode_example' # str | Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)

    try:
        # Creates a job to add files from the Azure Blob Storage into a collection.
        api_response = await api_instance.create_ingest_from_azure_blob_storage_job(collection_id, ingest_from_azure_blob_storage_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)
        print("The response of DocumentIngestionApi->create_ingest_from_azure_blob_storage_job:\n")
        pprint(api_response)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->create_ingest_from_azure_blob_storage_job: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
ingest_from_azure_blob_storage_body	IngestFromAzureBlobStorageBody
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
ingest_mode	str	Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents.	[optional]
timeout	float	Timeout in seconds	[optional]

Return type

JobDetails

HTTP request headers

Content-Type: application/json
Accept: application/json

HTTP response details

Status code	Description	Response headers
201	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

create_ingest_from_confluence_job

JobDetails create_ingest_from_confluence_job(collection_id, ingest_from_confluence_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)

Creates a job to ingest confluence pages into collection.

Creates a job to confluence pages into collection. If an ingested page has sub-pages, the subpages are also ingested.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.ingest_from_confluence_body import IngestFromConfluenceBody
from h2ogpte.rest_async.models.job_details import JobDetails
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    ingest_from_confluence_body = h2ogpte.rest_async.IngestFromConfluenceBody() # IngestFromConfluenceBody | 
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    ingest_mode = 'ingest_mode_example' # str | Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)

    try:
        # Creates a job to ingest confluence pages into collection.
        api_response = await api_instance.create_ingest_from_confluence_job(collection_id, ingest_from_confluence_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)
        print("The response of DocumentIngestionApi->create_ingest_from_confluence_job:\n")
        pprint(api_response)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->create_ingest_from_confluence_job: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
ingest_from_confluence_body	IngestFromConfluenceBody
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
ingest_mode	str	Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents.	[optional]
timeout	float	Timeout in seconds	[optional]

Return type

JobDetails

HTTP request headers

Content-Type: application/json
Accept: application/json

HTTP response details

Status code	Description	Response headers
201	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

create_ingest_from_file_system_job

JobDetails create_ingest_from_file_system_job(collection_id, ingest_from_file_system_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)

Creates a job to add files from the local system into a collection.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.ingest_from_file_system_body import IngestFromFileSystemBody
from h2ogpte.rest_async.models.job_details import JobDetails
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    ingest_from_file_system_body = h2ogpte.rest_async.IngestFromFileSystemBody() # IngestFromFileSystemBody | 
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    ingest_mode = 'ingest_mode_example' # str | Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)

    try:
        # Creates a job to add files from the local system into a collection.
        api_response = await api_instance.create_ingest_from_file_system_job(collection_id, ingest_from_file_system_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)
        print("The response of DocumentIngestionApi->create_ingest_from_file_system_job:\n")
        pprint(api_response)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->create_ingest_from_file_system_job: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
ingest_from_file_system_body	IngestFromFileSystemBody
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
ingest_mode	str	Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents.	[optional]
timeout	float	Timeout in seconds	[optional]

Return type

JobDetails

HTTP request headers

Content-Type: application/json
Accept: application/json

HTTP response details

Status code	Description	Response headers
201	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

create_ingest_from_gcs_job

JobDetails create_ingest_from_gcs_job(collection_id, ingest_from_gcs_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)

Creates a job to add files from the Google Cloud Storage into a collection.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.ingest_from_gcs_body import IngestFromGcsBody
from h2ogpte.rest_async.models.job_details import JobDetails
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    ingest_from_gcs_body = h2ogpte.rest_async.IngestFromGcsBody() # IngestFromGcsBody | 
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    ingest_mode = 'ingest_mode_example' # str | Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)

    try:
        # Creates a job to add files from the Google Cloud Storage into a collection.
        api_response = await api_instance.create_ingest_from_gcs_job(collection_id, ingest_from_gcs_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)
        print("The response of DocumentIngestionApi->create_ingest_from_gcs_job:\n")
        pprint(api_response)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->create_ingest_from_gcs_job: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
ingest_from_gcs_body	IngestFromGcsBody
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
ingest_mode	str	Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents.	[optional]
timeout	float	Timeout in seconds	[optional]

Return type

JobDetails

HTTP request headers

Content-Type: application/json
Accept: application/json

HTTP response details

Status code	Description	Response headers
201	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

create_ingest_from_plain_text_job

JobDetails create_ingest_from_plain_text_job(collection_id, file_name, body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, metadata=metadata, timeout=timeout)

Creates a job to add plain text to a collection.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.job_details import JobDetails
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    file_name = 'file_name_example' # str | String of the file name to use for the document.
    body = 'body_example' # str | The text that will ingested into a collection.
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    metadata = 'metadata_example' # str | String with json-encoded metadata for the document. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)

    try:
        # Creates a job to add plain text to a collection.
        api_response = await api_instance.create_ingest_from_plain_text_job(collection_id, file_name, body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, metadata=metadata, timeout=timeout)
        print("The response of DocumentIngestionApi->create_ingest_from_plain_text_job:\n")
        pprint(api_response)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->create_ingest_from_plain_text_job: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
file_name	str	String of the file name to use for the document.
body	str	The text that will ingested into a collection.
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
metadata	str	String with json-encoded metadata for the document.	[optional]
timeout	float	Timeout in seconds	[optional]

Return type

JobDetails

HTTP request headers

Content-Type: text/plain
Accept: application/json

HTTP response details

Status code	Description	Response headers
201	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

create_ingest_from_s3_job

JobDetails create_ingest_from_s3_job(collection_id, ingest_from_s3_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)

Creates a job to add files from the AWS S3 storage into a collection.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.ingest_from_s3_body import IngestFromS3Body
from h2ogpte.rest_async.models.job_details import JobDetails
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    ingest_from_s3_body = h2ogpte.rest_async.IngestFromS3Body() # IngestFromS3Body | 
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    ingest_mode = 'ingest_mode_example' # str | Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)

    try:
        # Creates a job to add files from the AWS S3 storage into a collection.
        api_response = await api_instance.create_ingest_from_s3_job(collection_id, ingest_from_s3_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)
        print("The response of DocumentIngestionApi->create_ingest_from_s3_job:\n")
        pprint(api_response)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->create_ingest_from_s3_job: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
ingest_from_s3_body	IngestFromS3Body
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
ingest_mode	str	Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents.	[optional]
timeout	float	Timeout in seconds	[optional]

Return type

JobDetails

HTTP request headers

Content-Type: application/json
Accept: application/json

HTTP response details

Status code	Description	Response headers
201	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

create_ingest_from_website_job

JobDetails create_ingest_from_website_job(collection_id, ingest_from_website_body, follow_links=follow_links, max_depth=max_depth, max_documents=max_documents, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)

Creates a job to crawl and ingest a URL into a collection.

Creates a job to crawl and ingest a URL into a collection. The web page or document linked from this URL will be imported.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.ingest_from_website_body import IngestFromWebsiteBody
from h2ogpte.rest_async.models.job_details import JobDetails
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    ingest_from_website_body = h2ogpte.rest_async.IngestFromWebsiteBody() # IngestFromWebsiteBody | 
    follow_links = False # bool | Whether to import all web pages linked from this URL will be imported. External links will be ignored. Links to other pages on the same domain will be followed as long as they are at the same level or below the URL you specify. Each page will be transformed into a PDF document. (optional) (default to False)
    max_depth = -1 # int | Max depth of recursion when following links, only when follow_links is `true`. Max_depth of 0 means don't follow any links, max_depth of 1 means follow only top-level links, etc. Use -1 for automatic (system settings). (optional) (default to -1)
    max_documents = 56 # int | Max number of documents when following links, only when follow_links is `true`. Use None for automatic (system defaults). Use -1 for max (system limit). (optional)
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    ingest_mode = 'ingest_mode_example' # str | Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)

    try:
        # Creates a job to crawl and ingest a URL into a collection.
        api_response = await api_instance.create_ingest_from_website_job(collection_id, ingest_from_website_body, follow_links=follow_links, max_depth=max_depth, max_documents=max_documents, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)
        print("The response of DocumentIngestionApi->create_ingest_from_website_job:\n")
        pprint(api_response)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->create_ingest_from_website_job: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
ingest_from_website_body	IngestFromWebsiteBody
follow_links	bool	Whether to import all web pages linked from this URL will be imported. External links will be ignored. Links to other pages on the same domain will be followed as long as they are at the same level or below the URL you specify. Each page will be transformed into a PDF document.	[optional] [default to False]
max_depth	int	Max depth of recursion when following links, only when follow_links is `true`. Max_depth of 0 means don’t follow any links, max_depth of 1 means follow only top-level links, etc. Use -1 for automatic (system settings).	[optional] [default to -1]
max_documents	int	Max number of documents when following links, only when follow_links is `true`. Use None for automatic (system defaults). Use -1 for max (system limit).	[optional]
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
ingest_mode	str	Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents.	[optional]
timeout	float	Timeout in seconds	[optional]

Return type

JobDetails

HTTP request headers

Content-Type: application/json
Accept: application/json

HTTP response details

Status code	Description	Response headers
201	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

create_ingest_upload_job

JobDetails create_ingest_upload_job(upload_ids, collection_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, restricted=restricted, permissions=permissions, timeout=timeout, ingest_upload_body=ingest_upload_body)

Creates a job to ingest uploaded document

Creates a job to ingest uploaded document identified to a given collection

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.ingest_upload_body import IngestUploadBody
from h2ogpte.rest_async.models.job_details import JobDetails
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    upload_ids = ['upload_ids_example'] # List[str] | Id of uploaded document
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    ingest_mode = 'ingest_mode_example' # str | Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents. (optional)
    restricted = False # bool | Whether the document should be restricted only to certain users. (optional) (default to False)
    permissions = ['permissions_example'] # List[str] | The list of usernames having permissions to the document. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)
    ingest_upload_body = h2ogpte.rest_async.IngestUploadBody() # IngestUploadBody |  (optional)

    try:
        # Creates a job to ingest uploaded document
        api_response = await api_instance.create_ingest_upload_job(upload_ids, collection_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, restricted=restricted, permissions=permissions, timeout=timeout, ingest_upload_body=ingest_upload_body)
        print("The response of DocumentIngestionApi->create_ingest_upload_job:\n")
        pprint(api_response)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->create_ingest_upload_job: %s\n" % e)

Parameters

Name	Type	Description	Notes
upload_ids	List[str]	Id of uploaded document
collection_id	str	String id of the collection to add the ingested documents into.
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
ingest_mode	str	Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents.	[optional]
restricted	bool	Whether the document should be restricted only to certain users.	[optional] [default to False]
permissions	List[str]	The list of usernames having permissions to the document.	[optional]
timeout	float	Timeout in seconds	[optional]
ingest_upload_body	IngestUploadBody		[optional]

Return type

JobDetails

HTTP request headers

Content-Type: application/json
Accept: application/json

HTTP response details

Status code	Description	Response headers
201	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

ingest_agent_only_to_standard

ingest_agent_only_to_standard(collection_id, document_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, restricted=restricted, permissions=permissions, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout)

Converts files uploaded in “agent_only” ingest mode to PDF and parses them.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    document_id = 'document_id_example' # str | String id of the document to be parsed.
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    restricted = False # bool | Whether the document should be restricted only to certain users. (optional) (default to False)
    permissions = ['permissions_example'] # List[str] | The list of usernames having permissions to the document. (optional)
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)

    try:
        # Converts files uploaded in \"agent_only\" ingest mode  to PDF and parses them.
        await api_instance.ingest_agent_only_to_standard(collection_id, document_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, restricted=restricted, permissions=permissions, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->ingest_agent_only_to_standard: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
document_id	str	String id of the document to be parsed.
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
restricted	bool	Whether the document should be restricted only to certain users.	[optional] [default to False]
permissions	List[str]	The list of usernames having permissions to the document.	[optional]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
timeout	float	Timeout in seconds	[optional]

Return type

void (empty response body)

HTTP request headers

Content-Type: Not defined
Accept: application/json

HTTP response details

Status code	Description	Response headers
204	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

ingest_from_azure_blob_storage

ingest_from_azure_blob_storage(collection_id, ingest_from_azure_blob_storage_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)

Adds files from the Azure Blob Storage into a collection.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.ingest_from_azure_blob_storage_body import IngestFromAzureBlobStorageBody
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    ingest_from_azure_blob_storage_body = h2ogpte.rest_async.IngestFromAzureBlobStorageBody() # IngestFromAzureBlobStorageBody | 
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    ingest_mode = 'ingest_mode_example' # str | Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)

    try:
        # Adds files from the Azure Blob Storage into a collection.
        await api_instance.ingest_from_azure_blob_storage(collection_id, ingest_from_azure_blob_storage_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->ingest_from_azure_blob_storage: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
ingest_from_azure_blob_storage_body	IngestFromAzureBlobStorageBody
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
ingest_mode	str	Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents.	[optional]
timeout	float	Timeout in seconds	[optional]

Return type

void (empty response body)

HTTP request headers

Content-Type: application/json
Accept: application/json

HTTP response details

Status code	Description	Response headers
204	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

ingest_from_confluence

ingest_from_confluence(collection_id, ingest_from_confluence_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)

Ingests confluence pages into collection.

Ingests confluence pages into collection. If an ingested page has sub-pages, the subpages are also ingested.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.ingest_from_confluence_body import IngestFromConfluenceBody
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    ingest_from_confluence_body = h2ogpte.rest_async.IngestFromConfluenceBody() # IngestFromConfluenceBody | 
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    ingest_mode = 'ingest_mode_example' # str | Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)

    try:
        # Ingests confluence pages into collection.
        await api_instance.ingest_from_confluence(collection_id, ingest_from_confluence_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->ingest_from_confluence: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
ingest_from_confluence_body	IngestFromConfluenceBody
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
ingest_mode	str	Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents.	[optional]
timeout	float	Timeout in seconds	[optional]

Return type

void (empty response body)

HTTP request headers

Content-Type: application/json
Accept: application/json

HTTP response details

Status code	Description	Response headers
204	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

ingest_from_file_system

ingest_from_file_system(collection_id, ingest_from_file_system_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)

Adds files from the local system into a collection.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.ingest_from_file_system_body import IngestFromFileSystemBody
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    ingest_from_file_system_body = h2ogpte.rest_async.IngestFromFileSystemBody() # IngestFromFileSystemBody | 
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    ingest_mode = 'ingest_mode_example' # str | Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)

    try:
        # Adds files from the local system into a collection.
        await api_instance.ingest_from_file_system(collection_id, ingest_from_file_system_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->ingest_from_file_system: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
ingest_from_file_system_body	IngestFromFileSystemBody
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
ingest_mode	str	Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents.	[optional]
timeout	float	Timeout in seconds	[optional]

Return type

void (empty response body)

HTTP request headers

Content-Type: application/json
Accept: application/json

HTTP response details

Status code	Description	Response headers
204	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

ingest_from_gcs

ingest_from_gcs(collection_id, ingest_from_gcs_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)

Adds files from the Google Cloud Storage into a collection.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.ingest_from_gcs_body import IngestFromGcsBody
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    ingest_from_gcs_body = h2ogpte.rest_async.IngestFromGcsBody() # IngestFromGcsBody | 
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    ingest_mode = 'ingest_mode_example' # str | Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)

    try:
        # Adds files from the Google Cloud Storage into a collection.
        await api_instance.ingest_from_gcs(collection_id, ingest_from_gcs_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->ingest_from_gcs: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
ingest_from_gcs_body	IngestFromGcsBody
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
ingest_mode	str	Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents.	[optional]
timeout	float	Timeout in seconds	[optional]

Return type

void (empty response body)

HTTP request headers

Content-Type: application/json
Accept: application/json

HTTP response details

Status code	Description	Response headers
204	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

ingest_from_plain_text

ingest_from_plain_text(collection_id, file_name, body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, timeout=timeout, metadata=metadata)

Adds plain text to a collection.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    file_name = 'file_name_example' # str | String of the file name to use for the document.
    body = 'body_example' # str | The text that will ingested into a collection.
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    timeout = 3.4 # float | Timeout in seconds (optional)
    metadata = 'metadata_example' # str | String with json-encoded metadata for the document. (optional)

    try:
        # Adds plain text to a collection.
        await api_instance.ingest_from_plain_text(collection_id, file_name, body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, timeout=timeout, metadata=metadata)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->ingest_from_plain_text: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
file_name	str	String of the file name to use for the document.
body	str	The text that will ingested into a collection.
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
timeout	float	Timeout in seconds	[optional]
metadata	str	String with json-encoded metadata for the document.	[optional]

Return type

void (empty response body)

HTTP request headers

Content-Type: text/plain
Accept: application/json

HTTP response details

Status code	Description	Response headers
204	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

ingest_from_s3

ingest_from_s3(collection_id, ingest_from_s3_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)

Adds files from the AWS S3 storage into a collection.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.ingest_from_s3_body import IngestFromS3Body
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    ingest_from_s3_body = h2ogpte.rest_async.IngestFromS3Body() # IngestFromS3Body | 
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    ingest_mode = 'ingest_mode_example' # str | Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)

    try:
        # Adds files from the AWS S3 storage into a collection.
        await api_instance.ingest_from_s3(collection_id, ingest_from_s3_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->ingest_from_s3: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
ingest_from_s3_body	IngestFromS3Body
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
ingest_mode	str	Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents.	[optional]
timeout	float	Timeout in seconds	[optional]

Return type

void (empty response body)

HTTP request headers

Content-Type: application/json
Accept: application/json

HTTP response details

Status code	Description	Response headers
204	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

ingest_from_website

ingest_from_website(collection_id, ingest_from_website_body, follow_links=follow_links, max_depth=max_depth, max_documents=max_documents, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)

Crawls and ingest a URL into a collection.

Crawls and ingest a URL into a collection. The web page or document linked from this URL will be imported.

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.ingest_from_website_body import IngestFromWebsiteBody
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    ingest_from_website_body = h2ogpte.rest_async.IngestFromWebsiteBody() # IngestFromWebsiteBody | 
    follow_links = False # bool | Whether to import all web pages linked from this URL will be imported. External links will be ignored. Links to other pages on the same domain will be followed as long as they are at the same level or below the URL you specify. Each page will be transformed into a PDF document. (optional) (default to False)
    max_depth = -1 # int | Max depth of recursion when following links, only when follow_links is `true`. Max_depth of 0 means don't follow any links, max_depth of 1 means follow only top-level links, etc. Use -1 for automatic (system settings). (optional) (default to -1)
    max_documents = 56 # int | Max number of documents when following links, only when follow_links is `true`. Use None for automatic (system defaults). Use -1 for max (system limit). (optional)
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    ingest_mode = 'ingest_mode_example' # str | Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)

    try:
        # Crawls and ingest a URL into a collection.
        await api_instance.ingest_from_website(collection_id, ingest_from_website_body, follow_links=follow_links, max_depth=max_depth, max_documents=max_documents, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, timeout=timeout)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->ingest_from_website: %s\n" % e)

Parameters

Name	Type	Description	Notes
collection_id	str	String id of the collection to add the ingested documents into.
ingest_from_website_body	IngestFromWebsiteBody
follow_links	bool	Whether to import all web pages linked from this URL will be imported. External links will be ignored. Links to other pages on the same domain will be followed as long as they are at the same level or below the URL you specify. Each page will be transformed into a PDF document.	[optional] [default to False]
max_depth	int	Max depth of recursion when following links, only when follow_links is `true`. Max_depth of 0 means don’t follow any links, max_depth of 1 means follow only top-level links, etc. Use -1 for automatic (system settings).	[optional] [default to -1]
max_documents	int	Max number of documents when following links, only when follow_links is `true`. Use None for automatic (system defaults). Use -1 for max (system limit).	[optional]
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
ingest_mode	str	Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents.	[optional]
timeout	float	Timeout in seconds	[optional]

Return type

void (empty response body)

HTTP request headers

Content-Type: application/json
Accept: application/json

HTTP response details

Status code	Description	Response headers
204	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

ingest_upload

ingest_upload(upload_ids, collection_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, restricted=restricted, permissions=permissions, timeout=timeout, ingest_upload_body=ingest_upload_body)

Ingest uploaded document

Ingests uploaded document identified to a given collection

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.ingest_upload_body import IngestUploadBody
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    upload_ids = ['upload_ids_example'] # List[str] | Id of uploaded document
    collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
    gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
    gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
    audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
    ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
    tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
    keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
    chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
    handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
    ingest_mode = 'ingest_mode_example' # str | Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents. (optional)
    restricted = False # bool | Whether the document should be restricted only to certain users. (optional) (default to False)
    permissions = ['permissions_example'] # List[str] | The list of usernames having permissions to the document. (optional)
    timeout = 3.4 # float | Timeout in seconds (optional)
    ingest_upload_body = h2ogpte.rest_async.IngestUploadBody() # IngestUploadBody |  (optional)

    try:
        # Ingest uploaded document
        await api_instance.ingest_upload(upload_ids, collection_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_mode=ingest_mode, restricted=restricted, permissions=permissions, timeout=timeout, ingest_upload_body=ingest_upload_body)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->ingest_upload: %s\n" % e)

Parameters

Name	Type	Description	Notes
upload_ids	List[str]	Id of uploaded document
collection_id	str	String id of the collection to add the ingested documents into.
gen_doc_summaries	bool	Whether to auto-generate document summaries (uses LLM).	[optional] [default to False]
gen_doc_questions	bool	Whether to auto-generate sample questions for each document (uses LLM).	[optional] [default to False]
audio_input_language	str	Language of audio files.	[optional] [default to ‘auto’]
ocr_model	str	Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made).	[optional] [default to ‘auto’]
tesseract_lang	str	Which language to use when using ocr_model="tesseract".	[optional]
keep_tables_as_one_chunk	bool	When tables are identified by the table parser the table tokens will be kept in a single chunk.	[optional]
chunk_by_page	bool	Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`.	[optional]
handwriting_check	bool	Check pages for handwriting. Will use specialized models if handwriting is found.	[optional]
ingest_mode	str	Ingest mode to use. - `standard` - Files will be ingested for use with RAG - `agent_only` - Bypasses standard ingestion. Files can only be used with agents.	[optional]
restricted	bool	Whether the document should be restricted only to certain users.	[optional] [default to False]
permissions	List[str]	The list of usernames having permissions to the document.	[optional]
timeout	float	Timeout in seconds	[optional]
ingest_upload_body	IngestUploadBody		[optional]

Return type

void (empty response body)

HTTP request headers

Content-Type: application/json
Accept: application/json

HTTP response details

Status code	Description	Response headers
204	Successful operation	-
401	Unauthorized - Invalid or missing API key	-

upload_file

UploadedFile upload_file(file=file)

Uploads file to H2OGPTe instance

Example

Bearer Authentication (bearerAuth):

import h2ogpte.rest_async
from h2ogpte.rest_async.models.uploaded_file import UploadedFile
from h2ogpte.rest_async.rest import ApiException
from pprint import pprint

# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest_async.Configuration(
    host = "https://h2ogpte.genai.h2o.ai/api/v1"
)

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest_async.Configuration(
    access_token = os.environ["BEARER_TOKEN"]
)

# Enter a context with an instance of the API client
async with h2ogpte.rest_async.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = h2ogpte.rest_async.DocumentIngestionApi(api_client)
    file = None # bytearray |  (optional)

    try:
        api_response = await api_instance.upload_file(file=file)
        print("The response of DocumentIngestionApi->upload_file:\n")
        pprint(api_response)
    except Exception as e:
        print("Exception when calling DocumentIngestionApi->upload_file: %s\n" % e)

Parameters

Name	Type	Description	Notes
file	bytearray		[optional]

Return type

UploadedFile

HTTP request headers

Content-Type: multipart/form-data
Accept: application/json

HTTP response details

Status code	Description	Response headers
200	Successful operation	-
401	Unauthorized - Invalid or missing API key	-