DocumentIngestionApi
Method |
HTTP request |
Description |
---|---|---|
POST /ingest/azure-blob-storage/job |
Creates a job to add files from the Azure Blob Storage into a collection. |
|
POST /ingest/file-system/job |
Creates a jove to add files from the local system into a collection. |
|
POST /ingest/gcs/job |
Creates a job to add files from the Google Cloud Storage into a collection. |
|
POST /ingest/plain-text/job |
Creates a job to add plain text to a collection. |
|
POST /ingest/s3/job |
Creates a job to add files from the AWS S3 storage into a collection. |
|
POST /ingest/website/job |
Creates a job to crawl and ingest a URL into a collection. |
|
POST /uploads/{upload-id}/ingest/job |
Creates a job to ingest uploaded document |
|
POST /ingest/azure-blob-storage |
Adds files from the Azure Blob Storage into a collection. |
|
POST /ingest/file-system |
Adds files from the local system into a collection. |
|
POST /ingest/gcs |
Adds files from the Google Cloud Storage into a collection. |
|
POST /ingest/plain-text |
Adds plain text to a collection. |
|
POST /ingest/s3 |
Adds files from the AWS S3 storage into a collection. |
|
POST /ingest/website |
Crawls and ingest a URL into a collection. |
|
POST /uploads/{upload-id}/ingest |
Ingest uploaded document |
|
PUT /uploads |
create_ingest_from_azure_blob_storage_job
JobDetails create_ingest_from_azure_blob_storage_job(collection_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_from_azure_blob_storage_body=ingest_from_azure_blob_storage_body)
Creates a job to add files from the Azure Blob Storage into a collection.
Creates a job to add files from the Azure Blob Storage into a collection.
Example
Bearer Authentication (bearerAuth):
import h2ogpte.rest
from h2ogpte.rest.models.ingest_from_azure_blob_storage_body import IngestFromAzureBlobStorageBody
from h2ogpte.rest.models.job_details import JobDetails
from h2ogpte.rest.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest.Configuration(
host = "https://h2ogpte.genai.h2o.ai/api/v1"
)
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest.Configuration(
access_token = os.environ["BEARER_TOKEN"]
)
# Enter a context with an instance of the API client
with h2ogpte.rest.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = h2ogpte.rest.DocumentIngestionApi(api_client)
collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
ingest_from_azure_blob_storage_body = h2ogpte.rest.IngestFromAzureBlobStorageBody() # IngestFromAzureBlobStorageBody | (optional)
try:
# Creates a job to add files from the Azure Blob Storage into a collection.
api_response = api_instance.create_ingest_from_azure_blob_storage_job(collection_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, ingest_from_azure_blob_storage_body=ingest_from_azure_blob_storage_body)
print("The response of DocumentIngestionApi->create_ingest_from_azure_blob_storage_job:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling DocumentIngestionApi->create_ingest_from_azure_blob_storage_job: %s\n" % e)
Parameters
Name |
Type |
Description |
Notes |
---|---|---|---|
collection_id |
str |
String id of the collection to add the ingested documents into. |
|
gen_doc_summaries |
bool |
Whether to auto-generate document summaries (uses LLM). |
[optional] [default to False] |
gen_doc_questions |
bool |
Whether to auto-generate sample questions for each document (uses LLM). |
[optional] [default to False] |
audio_input_language |
str |
Language of audio files. |
[optional] [default to ‘auto’] |
ocr_model |
str |
Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). |
[optional] [default to ‘auto’] |
tesseract_lang |
str |
Which language to use when using ocr_model="tesseract". |
[optional] |
keep_tables_as_one_chunk |
bool |
When tables are identified by the table parser the table tokens will be kept in a single chunk. |
[optional] |
chunk_by_page |
bool |
Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. |
[optional] |
handwriting_check |
bool |
Check pages for handwriting. Will use specialized models if handwriting is found. |
[optional] |
ingest_from_azure_blob_storage_body |
[optional] |
Return type
HTTP request headers
Content-Type: application/json
Accept: application/json
HTTP response details
Status code |
Description |
Response headers |
---|---|---|
201 |
Successful operation |
- |
401 |
Unauthorized - Invalid or missing API key |
- |
create_ingest_from_file_system_job
JobDetails create_ingest_from_file_system_job(collection_id, ingest_from_file_system_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check)
Creates a jove to add files from the local system into a collection.
Creates a jove to add files from the local system into a collection.
Example
Bearer Authentication (bearerAuth):
import h2ogpte.rest
from h2ogpte.rest.models.ingest_from_file_system_body import IngestFromFileSystemBody
from h2ogpte.rest.models.job_details import JobDetails
from h2ogpte.rest.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest.Configuration(
host = "https://h2ogpte.genai.h2o.ai/api/v1"
)
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest.Configuration(
access_token = os.environ["BEARER_TOKEN"]
)
# Enter a context with an instance of the API client
with h2ogpte.rest.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = h2ogpte.rest.DocumentIngestionApi(api_client)
collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
ingest_from_file_system_body = h2ogpte.rest.IngestFromFileSystemBody() # IngestFromFileSystemBody |
gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
try:
# Creates a jove to add files from the local system into a collection.
api_response = api_instance.create_ingest_from_file_system_job(collection_id, ingest_from_file_system_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check)
print("The response of DocumentIngestionApi->create_ingest_from_file_system_job:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling DocumentIngestionApi->create_ingest_from_file_system_job: %s\n" % e)
Parameters
Name |
Type |
Description |
Notes |
---|---|---|---|
collection_id |
str |
String id of the collection to add the ingested documents into. |
|
ingest_from_file_system_body |
|||
gen_doc_summaries |
bool |
Whether to auto-generate document summaries (uses LLM). |
[optional] [default to False] |
gen_doc_questions |
bool |
Whether to auto-generate sample questions for each document (uses LLM). |
[optional] [default to False] |
audio_input_language |
str |
Language of audio files. |
[optional] [default to ‘auto’] |
ocr_model |
str |
Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). |
[optional] [default to ‘auto’] |
tesseract_lang |
str |
Which language to use when using ocr_model="tesseract". |
[optional] |
keep_tables_as_one_chunk |
bool |
When tables are identified by the table parser the table tokens will be kept in a single chunk. |
[optional] |
chunk_by_page |
bool |
Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. |
[optional] |
handwriting_check |
bool |
Check pages for handwriting. Will use specialized models if handwriting is found. |
[optional] |
Return type
HTTP request headers
Content-Type: application/json
Accept: application/json
HTTP response details
Status code |
Description |
Response headers |
---|---|---|
201 |
Successful operation |
- |
401 |
Unauthorized - Invalid or missing API key |
- |
create_ingest_from_gcs_job
JobDetails create_ingest_from_gcs_job(collection_id, ingest_from_gcs_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check)
Creates a job to add files from the Google Cloud Storage into a collection.
Creates a job to add files from the Google Cloud Storage into a collection.
Example
Bearer Authentication (bearerAuth):
import h2ogpte.rest
from h2ogpte.rest.models.ingest_from_gcs_body import IngestFromGcsBody
from h2ogpte.rest.models.job_details import JobDetails
from h2ogpte.rest.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest.Configuration(
host = "https://h2ogpte.genai.h2o.ai/api/v1"
)
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest.Configuration(
access_token = os.environ["BEARER_TOKEN"]
)
# Enter a context with an instance of the API client
with h2ogpte.rest.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = h2ogpte.rest.DocumentIngestionApi(api_client)
collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
ingest_from_gcs_body = h2ogpte.rest.IngestFromGcsBody() # IngestFromGcsBody |
gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
try:
# Creates a job to add files from the Google Cloud Storage into a collection.
api_response = api_instance.create_ingest_from_gcs_job(collection_id, ingest_from_gcs_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check)
print("The response of DocumentIngestionApi->create_ingest_from_gcs_job:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling DocumentIngestionApi->create_ingest_from_gcs_job: %s\n" % e)
Parameters
Name |
Type |
Description |
Notes |
---|---|---|---|
collection_id |
str |
String id of the collection to add the ingested documents into. |
|
ingest_from_gcs_body |
|||
gen_doc_summaries |
bool |
Whether to auto-generate document summaries (uses LLM). |
[optional] [default to False] |
gen_doc_questions |
bool |
Whether to auto-generate sample questions for each document (uses LLM). |
[optional] [default to False] |
audio_input_language |
str |
Language of audio files. |
[optional] [default to ‘auto’] |
ocr_model |
str |
Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). |
[optional] [default to ‘auto’] |
tesseract_lang |
str |
Which language to use when using ocr_model="tesseract". |
[optional] |
keep_tables_as_one_chunk |
bool |
When tables are identified by the table parser the table tokens will be kept in a single chunk. |
[optional] |
chunk_by_page |
bool |
Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. |
[optional] |
handwriting_check |
bool |
Check pages for handwriting. Will use specialized models if handwriting is found. |
[optional] |
Return type
HTTP request headers
Content-Type: application/json
Accept: application/json
HTTP response details
Status code |
Description |
Response headers |
---|---|---|
201 |
Successful operation |
- |
401 |
Unauthorized - Invalid or missing API key |
- |
create_ingest_from_plain_text_job
JobDetails create_ingest_from_plain_text_job(collection_id, file_name, body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions)
Creates a job to add plain text to a collection.
Creates a job to add plain text to a collection.
Example
Bearer Authentication (bearerAuth):
import h2ogpte.rest
from h2ogpte.rest.models.job_details import JobDetails
from h2ogpte.rest.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest.Configuration(
host = "https://h2ogpte.genai.h2o.ai/api/v1"
)
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest.Configuration(
access_token = os.environ["BEARER_TOKEN"]
)
# Enter a context with an instance of the API client
with h2ogpte.rest.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = h2ogpte.rest.DocumentIngestionApi(api_client)
collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
file_name = 'file_name_example' # str | String of the file name to use for the document.
body = 'body_example' # str | The text that will ingested into a collection.
gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
try:
# Creates a job to add plain text to a collection.
api_response = api_instance.create_ingest_from_plain_text_job(collection_id, file_name, body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions)
print("The response of DocumentIngestionApi->create_ingest_from_plain_text_job:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling DocumentIngestionApi->create_ingest_from_plain_text_job: %s\n" % e)
Parameters
Name |
Type |
Description |
Notes |
---|---|---|---|
collection_id |
str |
String id of the collection to add the ingested documents into. |
|
file_name |
str |
String of the file name to use for the document. |
|
body |
str |
The text that will ingested into a collection. |
|
gen_doc_summaries |
bool |
Whether to auto-generate document summaries (uses LLM). |
[optional] [default to False] |
gen_doc_questions |
bool |
Whether to auto-generate sample questions for each document (uses LLM). |
[optional] [default to False] |
Return type
HTTP request headers
Content-Type: text/plain
Accept: application/json
HTTP response details
Status code |
Description |
Response headers |
---|---|---|
201 |
Successful operation |
- |
401 |
Unauthorized - Invalid or missing API key |
- |
create_ingest_from_s3_job
JobDetails create_ingest_from_s3_job(collection_id, ingest_from_s3_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check)
Creates a job to add files from the AWS S3 storage into a collection.
Creates a job to add files from the AWS S3 storage into a collection.
Example
Bearer Authentication (bearerAuth):
import h2ogpte.rest
from h2ogpte.rest.models.ingest_from_s3_body import IngestFromS3Body
from h2ogpte.rest.models.job_details import JobDetails
from h2ogpte.rest.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest.Configuration(
host = "https://h2ogpte.genai.h2o.ai/api/v1"
)
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest.Configuration(
access_token = os.environ["BEARER_TOKEN"]
)
# Enter a context with an instance of the API client
with h2ogpte.rest.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = h2ogpte.rest.DocumentIngestionApi(api_client)
collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
ingest_from_s3_body = h2ogpte.rest.IngestFromS3Body() # IngestFromS3Body |
gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
try:
# Creates a job to add files from the AWS S3 storage into a collection.
api_response = api_instance.create_ingest_from_s3_job(collection_id, ingest_from_s3_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check)
print("The response of DocumentIngestionApi->create_ingest_from_s3_job:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling DocumentIngestionApi->create_ingest_from_s3_job: %s\n" % e)
Parameters
Name |
Type |
Description |
Notes |
---|---|---|---|
collection_id |
str |
String id of the collection to add the ingested documents into. |
|
ingest_from_s3_body |
|||
gen_doc_summaries |
bool |
Whether to auto-generate document summaries (uses LLM). |
[optional] [default to False] |
gen_doc_questions |
bool |
Whether to auto-generate sample questions for each document (uses LLM). |
[optional] [default to False] |
audio_input_language |
str |
Language of audio files. |
[optional] [default to ‘auto’] |
ocr_model |
str |
Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). |
[optional] [default to ‘auto’] |
tesseract_lang |
str |
Which language to use when using ocr_model="tesseract". |
[optional] |
keep_tables_as_one_chunk |
bool |
When tables are identified by the table parser the table tokens will be kept in a single chunk. |
[optional] |
chunk_by_page |
bool |
Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. |
[optional] |
handwriting_check |
bool |
Check pages for handwriting. Will use specialized models if handwriting is found. |
[optional] |
Return type
HTTP request headers
Content-Type: application/json
Accept: application/json
HTTP response details
Status code |
Description |
Response headers |
---|---|---|
201 |
Successful operation |
- |
401 |
Unauthorized - Invalid or missing API key |
- |
create_ingest_from_website_job
JobDetails create_ingest_from_website_job(collection_id, ingest_from_website_body, follow_links=follow_links, max_depth=max_depth, max_documents=max_documents, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check)
Creates a job to crawl and ingest a URL into a collection.
Creates a job to crawl and ingest a URL into a collection. The web page or document linked from this URL will be imported.
Example
Bearer Authentication (bearerAuth):
import h2ogpte.rest
from h2ogpte.rest.models.ingest_from_website_body import IngestFromWebsiteBody
from h2ogpte.rest.models.job_details import JobDetails
from h2ogpte.rest.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest.Configuration(
host = "https://h2ogpte.genai.h2o.ai/api/v1"
)
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest.Configuration(
access_token = os.environ["BEARER_TOKEN"]
)
# Enter a context with an instance of the API client
with h2ogpte.rest.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = h2ogpte.rest.DocumentIngestionApi(api_client)
collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
ingest_from_website_body = h2ogpte.rest.IngestFromWebsiteBody() # IngestFromWebsiteBody |
follow_links = False # bool | Whether to import all web pages linked from this URL will be imported. External links will be ignored. Links to other pages on the same domain will be followed as long as they are at the same level or below the URL you specify. Each page will be transformed into a PDF document. (optional) (default to False)
max_depth = -1 # int | Max depth of recursion when following links, only when follow_links is `true`. Max_depth of 0 means don't follow any links, max_depth of 1 means follow only top-level links, etc. Use -1 for automatic (system settings). (optional) (default to -1)
max_documents = 56 # int | Max number of documents when following links, only when follow_links is `true`. Use None for automatic (system defaults). Use -1 for max (system limit). (optional)
gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
try:
# Creates a job to crawl and ingest a URL into a collection.
api_response = api_instance.create_ingest_from_website_job(collection_id, ingest_from_website_body, follow_links=follow_links, max_depth=max_depth, max_documents=max_documents, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check)
print("The response of DocumentIngestionApi->create_ingest_from_website_job:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling DocumentIngestionApi->create_ingest_from_website_job: %s\n" % e)
Parameters
Name |
Type |
Description |
Notes |
---|---|---|---|
collection_id |
str |
String id of the collection to add the ingested documents into. |
|
ingest_from_website_body |
|||
follow_links |
bool |
Whether to import all web pages linked from this URL will be imported. External links will be ignored. Links to other pages on the same domain will be followed as long as they are at the same level or below the URL you specify. Each page will be transformed into a PDF document. |
[optional] [default to False] |
max_depth |
int |
Max depth of recursion when following links, only when follow_links is `true`. Max_depth of 0 means don’t follow any links, max_depth of 1 means follow only top-level links, etc. Use -1 for automatic (system settings). |
[optional] [default to -1] |
max_documents |
int |
Max number of documents when following links, only when follow_links is `true`. Use None for automatic (system defaults). Use -1 for max (system limit). |
[optional] |
gen_doc_summaries |
bool |
Whether to auto-generate document summaries (uses LLM). |
[optional] [default to False] |
gen_doc_questions |
bool |
Whether to auto-generate sample questions for each document (uses LLM). |
[optional] [default to False] |
audio_input_language |
str |
Language of audio files. |
[optional] [default to ‘auto’] |
ocr_model |
str |
Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). |
[optional] [default to ‘auto’] |
tesseract_lang |
str |
Which language to use when using ocr_model="tesseract". |
[optional] |
keep_tables_as_one_chunk |
bool |
When tables are identified by the table parser the table tokens will be kept in a single chunk. |
[optional] |
chunk_by_page |
bool |
Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. |
[optional] |
handwriting_check |
bool |
Check pages for handwriting. Will use specialized models if handwriting is found. |
[optional] |
Return type
HTTP request headers
Content-Type: application/json
Accept: application/json
HTTP response details
Status code |
Description |
Response headers |
---|---|---|
201 |
Successful operation |
- |
401 |
Unauthorized - Invalid or missing API key |
- |
create_ingest_upload_job
JobDetails create_ingest_upload_job(upload_id, collection_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check)
Creates a job to ingest uploaded document
Creates a job to ingest uploaded document identified to a given collection
Example
Bearer Authentication (bearerAuth):
import h2ogpte.rest
from h2ogpte.rest.models.job_details import JobDetails
from h2ogpte.rest.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest.Configuration(
host = "https://h2ogpte.genai.h2o.ai/api/v1"
)
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest.Configuration(
access_token = os.environ["BEARER_TOKEN"]
)
# Enter a context with an instance of the API client
with h2ogpte.rest.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = h2ogpte.rest.DocumentIngestionApi(api_client)
upload_id = 'upload_id_example' # str | Id of uploaded document
collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
try:
# Creates a job to ingest uploaded document
api_response = api_instance.create_ingest_upload_job(upload_id, collection_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check)
print("The response of DocumentIngestionApi->create_ingest_upload_job:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling DocumentIngestionApi->create_ingest_upload_job: %s\n" % e)
Parameters
Name |
Type |
Description |
Notes |
---|---|---|---|
upload_id |
str |
Id of uploaded document |
|
collection_id |
str |
String id of the collection to add the ingested documents into. |
|
gen_doc_summaries |
bool |
Whether to auto-generate document summaries (uses LLM). |
[optional] [default to False] |
gen_doc_questions |
bool |
Whether to auto-generate sample questions for each document (uses LLM). |
[optional] [default to False] |
audio_input_language |
str |
Language of audio files. |
[optional] [default to ‘auto’] |
ocr_model |
str |
Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). |
[optional] [default to ‘auto’] |
tesseract_lang |
str |
Which language to use when using ocr_model="tesseract". |
[optional] |
keep_tables_as_one_chunk |
bool |
When tables are identified by the table parser the table tokens will be kept in a single chunk. |
[optional] |
chunk_by_page |
bool |
Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. |
[optional] |
handwriting_check |
bool |
Check pages for handwriting. Will use specialized models if handwriting is found. |
[optional] |
Return type
HTTP request headers
Content-Type: Not defined
Accept: application/json
HTTP response details
Status code |
Description |
Response headers |
---|---|---|
201 |
Successful operation |
- |
401 |
Unauthorized - Invalid or missing API key |
- |
ingest_from_azure_blob_storage
ingest_from_azure_blob_storage(collection_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout, ingest_from_azure_blob_storage_body=ingest_from_azure_blob_storage_body)
Adds files from the Azure Blob Storage into a collection.
Adds files from the Azure Blob Storage into a collection.
Example
Bearer Authentication (bearerAuth):
import h2ogpte.rest
from h2ogpte.rest.models.ingest_from_azure_blob_storage_body import IngestFromAzureBlobStorageBody
from h2ogpte.rest.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest.Configuration(
host = "https://h2ogpte.genai.h2o.ai/api/v1"
)
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest.Configuration(
access_token = os.environ["BEARER_TOKEN"]
)
# Enter a context with an instance of the API client
with h2ogpte.rest.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = h2ogpte.rest.DocumentIngestionApi(api_client)
collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
timeout = 300 # int | Timeout in seconds (optional) (default to 300)
ingest_from_azure_blob_storage_body = h2ogpte.rest.IngestFromAzureBlobStorageBody() # IngestFromAzureBlobStorageBody | (optional)
try:
# Adds files from the Azure Blob Storage into a collection.
api_instance.ingest_from_azure_blob_storage(collection_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout, ingest_from_azure_blob_storage_body=ingest_from_azure_blob_storage_body)
except Exception as e:
print("Exception when calling DocumentIngestionApi->ingest_from_azure_blob_storage: %s\n" % e)
Parameters
Name |
Type |
Description |
Notes |
---|---|---|---|
collection_id |
str |
String id of the collection to add the ingested documents into. |
|
gen_doc_summaries |
bool |
Whether to auto-generate document summaries (uses LLM). |
[optional] [default to False] |
gen_doc_questions |
bool |
Whether to auto-generate sample questions for each document (uses LLM). |
[optional] [default to False] |
audio_input_language |
str |
Language of audio files. |
[optional] [default to ‘auto’] |
ocr_model |
str |
Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). |
[optional] [default to ‘auto’] |
tesseract_lang |
str |
Which language to use when using ocr_model="tesseract". |
[optional] |
keep_tables_as_one_chunk |
bool |
When tables are identified by the table parser the table tokens will be kept in a single chunk. |
[optional] |
chunk_by_page |
bool |
Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. |
[optional] |
handwriting_check |
bool |
Check pages for handwriting. Will use specialized models if handwriting is found. |
[optional] |
timeout |
int |
Timeout in seconds |
[optional] [default to 300] |
ingest_from_azure_blob_storage_body |
[optional] |
Return type
void (empty response body)
HTTP request headers
Content-Type: application/json
Accept: application/json
HTTP response details
Status code |
Description |
Response headers |
---|---|---|
204 |
Successful operation |
- |
401 |
Unauthorized - Invalid or missing API key |
- |
ingest_from_file_system
ingest_from_file_system(collection_id, ingest_from_file_system_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout)
Adds files from the local system into a collection.
Adds files from the local system into a collection.
Example
Bearer Authentication (bearerAuth):
import h2ogpte.rest
from h2ogpte.rest.models.ingest_from_file_system_body import IngestFromFileSystemBody
from h2ogpte.rest.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest.Configuration(
host = "https://h2ogpte.genai.h2o.ai/api/v1"
)
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest.Configuration(
access_token = os.environ["BEARER_TOKEN"]
)
# Enter a context with an instance of the API client
with h2ogpte.rest.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = h2ogpte.rest.DocumentIngestionApi(api_client)
collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
ingest_from_file_system_body = h2ogpte.rest.IngestFromFileSystemBody() # IngestFromFileSystemBody |
gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
timeout = 300 # int | Timeout in seconds (optional) (default to 300)
try:
# Adds files from the local system into a collection.
api_instance.ingest_from_file_system(collection_id, ingest_from_file_system_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout)
except Exception as e:
print("Exception when calling DocumentIngestionApi->ingest_from_file_system: %s\n" % e)
Parameters
Name |
Type |
Description |
Notes |
---|---|---|---|
collection_id |
str |
String id of the collection to add the ingested documents into. |
|
ingest_from_file_system_body |
|||
gen_doc_summaries |
bool |
Whether to auto-generate document summaries (uses LLM). |
[optional] [default to False] |
gen_doc_questions |
bool |
Whether to auto-generate sample questions for each document (uses LLM). |
[optional] [default to False] |
audio_input_language |
str |
Language of audio files. |
[optional] [default to ‘auto’] |
ocr_model |
str |
Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). |
[optional] [default to ‘auto’] |
tesseract_lang |
str |
Which language to use when using ocr_model="tesseract". |
[optional] |
keep_tables_as_one_chunk |
bool |
When tables are identified by the table parser the table tokens will be kept in a single chunk. |
[optional] |
chunk_by_page |
bool |
Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. |
[optional] |
handwriting_check |
bool |
Check pages for handwriting. Will use specialized models if handwriting is found. |
[optional] |
timeout |
int |
Timeout in seconds |
[optional] [default to 300] |
Return type
void (empty response body)
HTTP request headers
Content-Type: application/json
Accept: application/json
HTTP response details
Status code |
Description |
Response headers |
---|---|---|
204 |
Successful operation |
- |
401 |
Unauthorized - Invalid or missing API key |
- |
ingest_from_gcs
ingest_from_gcs(collection_id, ingest_from_gcs_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout)
Adds files from the Google Cloud Storage into a collection.
Adds files from the Google Cloud Storage into a collection.
Example
Bearer Authentication (bearerAuth):
import h2ogpte.rest
from h2ogpte.rest.models.ingest_from_gcs_body import IngestFromGcsBody
from h2ogpte.rest.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest.Configuration(
host = "https://h2ogpte.genai.h2o.ai/api/v1"
)
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest.Configuration(
access_token = os.environ["BEARER_TOKEN"]
)
# Enter a context with an instance of the API client
with h2ogpte.rest.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = h2ogpte.rest.DocumentIngestionApi(api_client)
collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
ingest_from_gcs_body = h2ogpte.rest.IngestFromGcsBody() # IngestFromGcsBody |
gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
timeout = 300 # int | Timeout in seconds (optional) (default to 300)
try:
# Adds files from the Google Cloud Storage into a collection.
api_instance.ingest_from_gcs(collection_id, ingest_from_gcs_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout)
except Exception as e:
print("Exception when calling DocumentIngestionApi->ingest_from_gcs: %s\n" % e)
Parameters
Name |
Type |
Description |
Notes |
---|---|---|---|
collection_id |
str |
String id of the collection to add the ingested documents into. |
|
ingest_from_gcs_body |
|||
gen_doc_summaries |
bool |
Whether to auto-generate document summaries (uses LLM). |
[optional] [default to False] |
gen_doc_questions |
bool |
Whether to auto-generate sample questions for each document (uses LLM). |
[optional] [default to False] |
audio_input_language |
str |
Language of audio files. |
[optional] [default to ‘auto’] |
ocr_model |
str |
Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). |
[optional] [default to ‘auto’] |
tesseract_lang |
str |
Which language to use when using ocr_model="tesseract". |
[optional] |
keep_tables_as_one_chunk |
bool |
When tables are identified by the table parser the table tokens will be kept in a single chunk. |
[optional] |
chunk_by_page |
bool |
Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. |
[optional] |
handwriting_check |
bool |
Check pages for handwriting. Will use specialized models if handwriting is found. |
[optional] |
timeout |
int |
Timeout in seconds |
[optional] [default to 300] |
Return type
void (empty response body)
HTTP request headers
Content-Type: application/json
Accept: application/json
HTTP response details
Status code |
Description |
Response headers |
---|---|---|
204 |
Successful operation |
- |
401 |
Unauthorized - Invalid or missing API key |
- |
ingest_from_plain_text
ingest_from_plain_text(collection_id, file_name, body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, timeout=timeout)
Adds plain text to a collection.
Adds plain text to a collection.
Example
Bearer Authentication (bearerAuth):
import h2ogpte.rest
from h2ogpte.rest.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest.Configuration(
host = "https://h2ogpte.genai.h2o.ai/api/v1"
)
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest.Configuration(
access_token = os.environ["BEARER_TOKEN"]
)
# Enter a context with an instance of the API client
with h2ogpte.rest.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = h2ogpte.rest.DocumentIngestionApi(api_client)
collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
file_name = 'file_name_example' # str | String of the file name to use for the document.
body = 'body_example' # str | The text that will ingested into a collection.
gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
timeout = 300 # int | Timeout in seconds (optional) (default to 300)
try:
# Adds plain text to a collection.
api_instance.ingest_from_plain_text(collection_id, file_name, body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, timeout=timeout)
except Exception as e:
print("Exception when calling DocumentIngestionApi->ingest_from_plain_text: %s\n" % e)
Parameters
Name |
Type |
Description |
Notes |
---|---|---|---|
collection_id |
str |
String id of the collection to add the ingested documents into. |
|
file_name |
str |
String of the file name to use for the document. |
|
body |
str |
The text that will ingested into a collection. |
|
gen_doc_summaries |
bool |
Whether to auto-generate document summaries (uses LLM). |
[optional] [default to False] |
gen_doc_questions |
bool |
Whether to auto-generate sample questions for each document (uses LLM). |
[optional] [default to False] |
timeout |
int |
Timeout in seconds |
[optional] [default to 300] |
Return type
void (empty response body)
HTTP request headers
Content-Type: text/plain
Accept: application/json
HTTP response details
Status code |
Description |
Response headers |
---|---|---|
204 |
Successful operation |
- |
401 |
Unauthorized - Invalid or missing API key |
- |
ingest_from_s3
ingest_from_s3(collection_id, ingest_from_s3_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout)
Adds files from the AWS S3 storage into a collection.
Adds files from the AWS S3 storage into a collection.
Example
Bearer Authentication (bearerAuth):
import h2ogpte.rest
from h2ogpte.rest.models.ingest_from_s3_body import IngestFromS3Body
from h2ogpte.rest.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest.Configuration(
host = "https://h2ogpte.genai.h2o.ai/api/v1"
)
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest.Configuration(
access_token = os.environ["BEARER_TOKEN"]
)
# Enter a context with an instance of the API client
with h2ogpte.rest.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = h2ogpte.rest.DocumentIngestionApi(api_client)
collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
ingest_from_s3_body = h2ogpte.rest.IngestFromS3Body() # IngestFromS3Body |
gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
timeout = 300 # int | Timeout in seconds (optional) (default to 300)
try:
# Adds files from the AWS S3 storage into a collection.
api_instance.ingest_from_s3(collection_id, ingest_from_s3_body, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout)
except Exception as e:
print("Exception when calling DocumentIngestionApi->ingest_from_s3: %s\n" % e)
Parameters
Name |
Type |
Description |
Notes |
---|---|---|---|
collection_id |
str |
String id of the collection to add the ingested documents into. |
|
ingest_from_s3_body |
|||
gen_doc_summaries |
bool |
Whether to auto-generate document summaries (uses LLM). |
[optional] [default to False] |
gen_doc_questions |
bool |
Whether to auto-generate sample questions for each document (uses LLM). |
[optional] [default to False] |
audio_input_language |
str |
Language of audio files. |
[optional] [default to ‘auto’] |
ocr_model |
str |
Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). |
[optional] [default to ‘auto’] |
tesseract_lang |
str |
Which language to use when using ocr_model="tesseract". |
[optional] |
keep_tables_as_one_chunk |
bool |
When tables are identified by the table parser the table tokens will be kept in a single chunk. |
[optional] |
chunk_by_page |
bool |
Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. |
[optional] |
handwriting_check |
bool |
Check pages for handwriting. Will use specialized models if handwriting is found. |
[optional] |
timeout |
int |
Timeout in seconds |
[optional] [default to 300] |
Return type
void (empty response body)
HTTP request headers
Content-Type: application/json
Accept: application/json
HTTP response details
Status code |
Description |
Response headers |
---|---|---|
204 |
Successful operation |
- |
401 |
Unauthorized - Invalid or missing API key |
- |
ingest_from_website
ingest_from_website(collection_id, ingest_from_website_body, follow_links=follow_links, max_depth=max_depth, max_documents=max_documents, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout)
Crawls and ingest a URL into a collection.
Crawls and ingest a URL into a collection. The web page or document linked from this URL will be imported.
Example
Bearer Authentication (bearerAuth):
import h2ogpte.rest
from h2ogpte.rest.models.ingest_from_website_body import IngestFromWebsiteBody
from h2ogpte.rest.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest.Configuration(
host = "https://h2ogpte.genai.h2o.ai/api/v1"
)
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest.Configuration(
access_token = os.environ["BEARER_TOKEN"]
)
# Enter a context with an instance of the API client
with h2ogpte.rest.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = h2ogpte.rest.DocumentIngestionApi(api_client)
collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
ingest_from_website_body = h2ogpte.rest.IngestFromWebsiteBody() # IngestFromWebsiteBody |
follow_links = False # bool | Whether to import all web pages linked from this URL will be imported. External links will be ignored. Links to other pages on the same domain will be followed as long as they are at the same level or below the URL you specify. Each page will be transformed into a PDF document. (optional) (default to False)
max_depth = -1 # int | Max depth of recursion when following links, only when follow_links is `true`. Max_depth of 0 means don't follow any links, max_depth of 1 means follow only top-level links, etc. Use -1 for automatic (system settings). (optional) (default to -1)
max_documents = 56 # int | Max number of documents when following links, only when follow_links is `true`. Use None for automatic (system defaults). Use -1 for max (system limit). (optional)
gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
timeout = 300 # int | Timeout in seconds (optional) (default to 300)
try:
# Crawls and ingest a URL into a collection.
api_instance.ingest_from_website(collection_id, ingest_from_website_body, follow_links=follow_links, max_depth=max_depth, max_documents=max_documents, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout)
except Exception as e:
print("Exception when calling DocumentIngestionApi->ingest_from_website: %s\n" % e)
Parameters
Name |
Type |
Description |
Notes |
---|---|---|---|
collection_id |
str |
String id of the collection to add the ingested documents into. |
|
ingest_from_website_body |
|||
follow_links |
bool |
Whether to import all web pages linked from this URL will be imported. External links will be ignored. Links to other pages on the same domain will be followed as long as they are at the same level or below the URL you specify. Each page will be transformed into a PDF document. |
[optional] [default to False] |
max_depth |
int |
Max depth of recursion when following links, only when follow_links is `true`. Max_depth of 0 means don’t follow any links, max_depth of 1 means follow only top-level links, etc. Use -1 for automatic (system settings). |
[optional] [default to -1] |
max_documents |
int |
Max number of documents when following links, only when follow_links is `true`. Use None for automatic (system defaults). Use -1 for max (system limit). |
[optional] |
gen_doc_summaries |
bool |
Whether to auto-generate document summaries (uses LLM). |
[optional] [default to False] |
gen_doc_questions |
bool |
Whether to auto-generate sample questions for each document (uses LLM). |
[optional] [default to False] |
audio_input_language |
str |
Language of audio files. |
[optional] [default to ‘auto’] |
ocr_model |
str |
Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). |
[optional] [default to ‘auto’] |
tesseract_lang |
str |
Which language to use when using ocr_model="tesseract". |
[optional] |
keep_tables_as_one_chunk |
bool |
When tables are identified by the table parser the table tokens will be kept in a single chunk. |
[optional] |
chunk_by_page |
bool |
Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. |
[optional] |
handwriting_check |
bool |
Check pages for handwriting. Will use specialized models if handwriting is found. |
[optional] |
timeout |
int |
Timeout in seconds |
[optional] [default to 300] |
Return type
void (empty response body)
HTTP request headers
Content-Type: application/json
Accept: application/json
HTTP response details
Status code |
Description |
Response headers |
---|---|---|
204 |
Successful operation |
- |
401 |
Unauthorized - Invalid or missing API key |
- |
ingest_upload
ingest_upload(upload_id, collection_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout)
Ingest uploaded document
Ingests uploaded document identified to a given collection
Example
Bearer Authentication (bearerAuth):
import h2ogpte.rest
from h2ogpte.rest.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest.Configuration(
host = "https://h2ogpte.genai.h2o.ai/api/v1"
)
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest.Configuration(
access_token = os.environ["BEARER_TOKEN"]
)
# Enter a context with an instance of the API client
with h2ogpte.rest.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = h2ogpte.rest.DocumentIngestionApi(api_client)
upload_id = 'upload_id_example' # str | Id of uploaded document
collection_id = 'collection_id_example' # str | String id of the collection to add the ingested documents into.
gen_doc_summaries = False # bool | Whether to auto-generate document summaries (uses LLM). (optional) (default to False)
gen_doc_questions = False # bool | Whether to auto-generate sample questions for each document (uses LLM). (optional) (default to False)
audio_input_language = 'auto' # str | Language of audio files. (optional) (default to 'auto')
ocr_model = 'auto' # str | Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). (optional) (default to 'auto')
tesseract_lang = 'tesseract_lang_example' # str | Which language to use when using ocr_model=\"tesseract\". (optional)
keep_tables_as_one_chunk = True # bool | When tables are identified by the table parser the table tokens will be kept in a single chunk. (optional)
chunk_by_page = True # bool | Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. (optional)
handwriting_check = True # bool | Check pages for handwriting. Will use specialized models if handwriting is found. (optional)
timeout = 300 # int | Timeout in seconds (optional) (default to 300)
try:
# Ingest uploaded document
api_instance.ingest_upload(upload_id, collection_id, gen_doc_summaries=gen_doc_summaries, gen_doc_questions=gen_doc_questions, audio_input_language=audio_input_language, ocr_model=ocr_model, tesseract_lang=tesseract_lang, keep_tables_as_one_chunk=keep_tables_as_one_chunk, chunk_by_page=chunk_by_page, handwriting_check=handwriting_check, timeout=timeout)
except Exception as e:
print("Exception when calling DocumentIngestionApi->ingest_upload: %s\n" % e)
Parameters
Name |
Type |
Description |
Notes |
---|---|---|---|
upload_id |
str |
Id of uploaded document |
|
collection_id |
str |
String id of the collection to add the ingested documents into. |
|
gen_doc_summaries |
bool |
Whether to auto-generate document summaries (uses LLM). |
[optional] [default to False] |
gen_doc_questions |
bool |
Whether to auto-generate sample questions for each document (uses LLM). |
[optional] [default to False] |
audio_input_language |
str |
Language of audio files. |
[optional] [default to ‘auto’] |
ocr_model |
str |
Which method to use to extract text from images using AI-enabled optical character recognition (OCR) models. docTR is best for Latin text, PaddleOCR is best for certain non-Latin languages, Tesseract covers a wide range of languages. Mississippi works well on handwriting. - `auto` - Automatic will auto-select the best OCR model for every page. - `off` - Disable OCR for speed, but all images will then be skipped (also no image captions will be made). |
[optional] [default to ‘auto’] |
tesseract_lang |
str |
Which language to use when using ocr_model="tesseract". |
[optional] |
keep_tables_as_one_chunk |
bool |
When tables are identified by the table parser the table tokens will be kept in a single chunk. |
[optional] |
chunk_by_page |
bool |
Each page will be a chunk. `keep_tables_as_one_chunk` will be ignored if this is `true`. |
[optional] |
handwriting_check |
bool |
Check pages for handwriting. Will use specialized models if handwriting is found. |
[optional] |
timeout |
int |
Timeout in seconds |
[optional] [default to 300] |
Return type
void (empty response body)
HTTP request headers
Content-Type: Not defined
Accept: application/json
HTTP response details
Status code |
Description |
Response headers |
---|---|---|
204 |
Successful operation |
- |
401 |
Unauthorized - Invalid or missing API key |
- |
upload_file
UploadedFile upload_file(file=file)
Uploads file to H2OGPTe instance
Example
Bearer Authentication (bearerAuth):
import h2ogpte.rest
from h2ogpte.rest.models.uploaded_file import UploadedFile
from h2ogpte.rest.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to https://h2ogpte.genai.h2o.ai/api/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = h2ogpte.rest.Configuration(
host = "https://h2ogpte.genai.h2o.ai/api/v1"
)
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: bearerAuth
configuration = h2ogpte.rest.Configuration(
access_token = os.environ["BEARER_TOKEN"]
)
# Enter a context with an instance of the API client
with h2ogpte.rest.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = h2ogpte.rest.DocumentIngestionApi(api_client)
file = None # bytearray | (optional)
try:
api_response = api_instance.upload_file(file=file)
print("The response of DocumentIngestionApi->upload_file:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling DocumentIngestionApi->upload_file: %s\n" % e)
Parameters
Name |
Type |
Description |
Notes |
---|---|---|---|
file |
bytearray |
[optional] |
Return type
HTTP request headers
Content-Type: multipart/form-data
Accept: application/json
HTTP response details
Status code |
Description |
Response headers |
---|---|---|
200 |
Successful operation |
- |
401 |
Unauthorized - Invalid or missing API key |
- |