Supported data connectors and format
Data connectors
H2O LLM Studio supports the following data connectors to access or upload external data sources.
- Upload: Upload a local dataset from your machine.
- Local: Specify the file location of the dataset on your machine.
- AWS S3 (Amazon AWS S3): Connect to an Amazon AWS S3 data bucket.
- Azure Datalake: Connect to a dataset in Azure Datalake.
- H2O Drive: Upload a dataset from H2O Drive.
- Kaggle: Connect to a dataset hosted on Kaggle.
- Hugging Face: Connect to a dataset on Hugging Face.
Data format
Each data connector requires either a single
.csv
or.pq
file, or the data to be in a.zip
file for a successful import.H2O LLM studio requires a
.csv
file with a minimum of two columns, where one contains the instructions and the other has the model’s expected output. You can also include an additional validation dataframe in the same format or allow for an automatic train/validation split to assess the model’s performance.Optionally, a Parent Id can be used for training nested data prompts that are linked to a parent question.
During an experiment you can adapt the data representation with the following settings:
- Prompt Column: The column in the dataset containing the user prompt.
- Answer Column: The column in the dataset containing the expected output.
- Parent Id Column: An optional column specifying the parent id to be used for chained conversations. The value of this column needs to match an additional column with the name
id
. If provided, the prompt will be concatenated after preceding parent rows.
To train a chatbot style model, you need to convert your data into a question and answer format. There are other enterprise solutions by H2O.ai that may help you prep your data. For more information, see H2O.ai's Generative AI page and this blogpost about H2O LLM DataStudio: Streamlining Data Curation and Data Preparation for LLMs related tasks.
Example data
H2O LLM Studio provides a sample dataset (converted dataset from OpenAssistant/oasst2)
that can be downloaded here. It is recommended to use train_full.csv
for training. This dataset is also downloaded and prepared by default when first starting the GUI. Multiple dataframes can be uploaded into a single dataset by uploading a .zip
archive.
- Submit and view feedback for this page
- Send feedback about H2O LLM Studio | Docs to cloud-feedback@h2o.ai