Experiment settings
The settings for creating an experiment are grouped into the following sections:
- General settings
- Dataset settings
- Tokenizer settings
- Architecture settings
- Training settings
- Augmentation settings
- Prediction settings
- Environment settings
- Logging settings
The settings under each category are listed and described below.
General settings
Dataset
Select the dataset for the experiment.
Problem type
Defines the problem type of the experiment, which also defines the settings H2O LLM Studio displays for the experiment.
Causal Language Modeling: Used to fine-tune large language models
Causal Classification Modeling: Used to fine-tune causal classification models
Causal Regression Modeling: Used to fine-tune causal regression models
Sequence To Sequence Modeling: Used to fine-tune large sequence to sequence models
DPO Modeling: Used to fine-tune large language models using Direct Preference Optimization
Import config from YAML
Defines the .yml
file that defines the experiment settings.
- H2O LLM Studio supports a
.yml
file import and export functionality. You can download the config settings of finished experiments, make changes, and re-upload them when starting a new experiment in any instance of H2O LLM Studio.
Experiment name
It defines the name of the experiment.
LLM backbone
The LLM Backbone option is the most important setting as it sets the pretrained model weights.
- Use smaller models for quicker experiments and larger models for higher accuracy
- Aim to leverage models pre-trained on tasks similar to your use case when possible
- Select a model from the dropdown list or type in the name of a Hugging Face model of your preference
Dataset settings
Train dataframe
Defines a .csv
or .pq
file containing a dataframe with training records that H2O LLM Studio uses to train the model.
- The records are combined into mini-batches when training the model.
Validation strategy
Specifies the validation strategy H2O LLM Studio uses for the experiment.
To properly assess the performance of your trained models, it is common practice to evaluate it on separate holdout data that the model has not seen during training. H2O LLM Studio allows you to specify different strategies for this task fitting your needs.
Options
- Custom holdout validation
- Specifies a separate holdout dataframe.
- Automatic holdout validation
- Allows to specify a holdout validation sample size that is automatically generated.
Validation size
Defines an optional relative size of the holdout validation set. H2O LLM Studio do automatically sample the selected percentage from the full training data, and build a holdout dataset that the model is validated on.
Data sample
Defines the percentage of the data to use for the experiment. The default percentage is 100% (1).
Changing the default value can significantly increase the training speed. Still, it might lead to a substantially poor accuracy value. Using 100% (1) of the data for final models is highly recommended.
System column
The column in the dataset containing the system input which is always prepended for a full sample.
Prompt column
One column or multiple columns in the dataset containing the user prompt. If multiple columns are selected, the columns are concatenated with a separator defined in Prompt Column Separator.
Prompt column separator
If multiple prompt columns are selected, the columns are concatenated with the separator defined here. If only a single prompt column is selected, this setting is ignored.
Answer column
The column in the dataset containing the expected output.
For classification, this needs to be an integer column starting from zero containing the class label, while for regression, it needs to be a float column.
Multiple target columns can be selected for classification and regression supporting multilabel problems. In detail, we support the following cases:
- Multi-class classification requires a single column containing the class label
- Binary classification requires a single column containing a binary integer label
- Multilabel classification requires each column to refer to one label encoded with a binary integer label
- For regression, each target column requires a float value
Parent ID column
An optional column specifying the parent id to be used for chained conversations. The value of this column needs to match an additional column with the name id
. If provided, the prompt will be concatenated after preceding parent rows.
Text prompt start
Optional text to prepend to each prompt.
Text answer separator
Optional text to append to each prompt / prepend to each answer.
Add EOS token to prompt
Adds EOS token at end of prompt.
Add EOS token to answer
Adds EOS token at end of answer.
Mask prompt labels
Whether to mask the prompt labels during training and only train on the loss of the answer.
Num classes
The number of possible classes for the classification task. For binary classification, a single class should be selected.
The Num classes field should be set to the total number of classes in the answer column of the dataset.
Tokenizer settings
Max length
Defines the maximum length of the input sequence H2O LLM Studio uses during model training. In other words, this setting specifies the maximum number of tokens an input text is transformed for model training.
A higher token count leads to higher memory usage that slows down training while increasing the probability of obtaining a higher accuracy value.
In case of Causal Language Modeling, this includes both prompt and answer, or all prompts and answers in case of chained samples.
In Sequence to Sequence Modeling, this refers to the length of the prompt, or the length of a full chained sample.
Add prompt answer tokens
Adds system, prompt and answer tokens as new tokens to the tokenizer. It is recommended to also set Force Embedding Gradients
in this case.
Padding quantile
Defines the padding quantile H2O LLM Studio uses to select the maximum token length per batch. H2O LLM Studio performs padding of shorter sequences up to the specified padding quantile instead of the selected Max length. H2O LLM Studio truncates longer sequences.
- Lowering the quantile can significantly increase training runtime and reduce memory usage in unevenly distributed sequence lengths but can hurt performance
- The setting depends on the batch size and should be adjusted accordingly
- No padding is done in inference, and the selected Max Length is guaranteed
- Setting to 0 disables padding
- In case of distributed training, the quantile will be calculated across all GPUs
Architecture settings
Backbone Dtype
The datatype of the weights in the LLM backbone.
Gradient Checkpointing
Determines whether H2O LLM Studio activates gradient checkpointing (GC) when training the model. Starting GC reduces the video random access memory (VRAM) footprint at the cost of a longer runtime (an additional forward pass). Turning On GC enables it during the training process.
Caution Gradient checkpointing is an experimental setting that is not compatible with all backbones or all other settings.
Activating GC comes at the cost of a longer training time; for that reason, try training without GC first and only activate when experiencing GPU out-of-memory (OOM) errors.
Intermediate dropout
Defines the custom dropout rate H2O LLM Studio uses for intermediate layers in the transformer model.
Pretrained weights
Allows you to specify a local path to the pretrained weights.
Training settings
Loss function
Defines the loss function H2O LLM Studio utilizes during model training. The loss function is a differentiable function measuring the prediction error. The model utilizes gradients of the loss function to update the model weights during training. The options depend on the selected Problem Type.
For multiclass classification problems, set the loss function to Cross-entropy.
Optimizer
Defines the algorithm or method (optimizer) to use for model training. The selected algorithm or method defines how the model should change the attributes of the neural network, such as weights and learning rate. Optimizers solve optimization problems and make more accurate updates to attributes to reduce learning losses.
Options:
- Adadelta
- To learn about Adadelta, see ADADELTA: An Adaptive Learning Rate Method.
- Adam
- To learn about Adam, see Adam: A Method for Stochastic Optimization.
- AdamW
- To learn about AdamW, see Decoupled Weight Decay Regularization.
- AdamW8bit
- To learn about AdamW, see Decoupled Weight Decay Regularization.
- RMSprop
- To learn about RMSprop, see Neural Networks for Machine Learning.
- SGD
- H2O LLM Studio uses a stochastic gradient descent optimizer.
Learning rate
Defines the learning rate H2O LLM Studio uses when training the model, specifically when updating the neural network's weights. The learning rate is the speed at which the model updates its weights after processing each mini-batch of data.
- Learning rate is an important setting to tune as it balances under- and overfitting.
- The number of epochs highly impacts the optimal value of the learning rate.
Differential learning rate layers
Defines the learning rate to apply to certain layers of a model. H2O LLM Studio applies the regular learning rate to layers without a specified learning rate.
- Backbone
- H2O LLM Studio applies a different learning rate to a body of the neural network architecture.
- Value Head
- H2O LLM Studio applies a different learning rate to a value head of the neural network architecture.
A common strategy is to apply a lower learning rate to the backbone of a model for better convergence and training stability.
By default, H2O LLM Studio applies Differential learning rate Layers, with the learning rate for the classification_head
being 10 times smaller than the learning rate for the rest of the model.
Freeze layers
An optional list of layers to freeze during training. Full layer names will be matched against selected substrings. Only available without LoRA training.
Attention Implementation
Allows to change the utilized attention implementation.
- Auto selection will automatically choose the implementation based on system availability.
- Eager relies on vanilla attention implementation in Python.
- SDPA uses scaled dot product attention in PyTorch.
- Flash Attention 2 explicitly uses FA2 which requires the flash_attn package.
Batch size
Defines the number of training examples a mini-batch uses during an iteration of the training model to estimate the error gradient before updating the model weights. Batch size defines the batch size used per a single GPU.
During model training, the training data is packed into mini-batches of a fixed size.
Epochs
Defines the number of epochs to train the model. In other words, it specifies the number of times the learning algorithm goes through the entire training dataset.
- The Epochs setting is an important setting to tune because it balances under- and overfitting.
- The learning rate highly impacts the optimal value of the epochs.
- H2O LLM Studio enables you to utilize a pre-trained model trained on zero epochs (where H2O LLM Studio does not train the model and the pretrained model (experiment) can be evaluated as-is):
Schedule
Defines the learning rate schedule H2O LLM Studio utilizes during model training. Specifying a learning rate schedule prevents the learning rate from staying the same. Instead, a learning rate schedule causes the learning rate to change over iterations, typically decreasing the learning rate to achieve a better model performance and training convergence.
Options
- Constant
- H2O LLM Studio applies a constant learning rate during the training process.
- Cosine
- H2O LLM Studio applies a cosine learning rate that follows the values of the cosine function.
- Linear
- H2O LLM Studio applies a linear learning rate that decreases the learning rate linearly.
Min Learning Rate Ratio
The minimum learning rate ratio determines the lowest learning rate that will be used during training as a fraction of the initial learning rate. This is particularly useful when using learning rate schedules like "Cosine" or "Linear" that decrease the learning rate over time.
For example, if the initial learning rate is 0.001 and the min_learning_rate_ratio is set to 0.1, the learning rate will never drop below 0.0001 (0.001 * 0.1) during training.
Setting this to a value greater than 0 can help prevent the learning rate from becoming too small, which might slow down training or cause the model to get stuck in local optima.
- A value of 0.0 allows the learning rate to potentially reach zero by the end of training.
- Typical values range from 0.01 to 0.1, depending on the specific task and model.
This parameter cannot be set when using the Constant learning rate schedule.
Warmup epochs
Defines the number of epochs to warm up the learning rate where the learning rate should increase linearly from 0 to the desired learning rate. Can be a fraction of an epoch.
Weight decay
Defines the weight decay that H2O LLM Studio uses for the optimizer during model training.
Weight decay is a regularization technique that adds an L2 norm of all model weights to the loss function while increasing the probability of improving the model generalization.
Gradient clip
Defines the maximum norm of the gradients H2O LLM Studio specifies during model training. Defaults to 0, no clipping. When a value greater than 0 is specified, H2O LLM Studio modifies the gradients during model training. H2O LLM Studio uses the specified value as an upper limit for the norm of the gradients, calculated using the Euclidean norm over all gradients per batch.
This setting can help model convergence when extreme gradient values cause high volatility of weight updates.
Grad accumulation
Defines the number of gradient accumulations before H2O LLM Studio updates the neural network weights during model training.
- Grad accumulation can be beneficial if only small batches are selected for training. With gradient accumulation, the loss and gradients are calculated after each batch, but it waits for the selected accumulations before updating the model weights. You can control the batch size through the Batch size setting.
- Changing the default value of Grad Accumulation might require adjusting the learning rate and batch size.
Lora
Whether to use low rank approximations (LoRA) during training.
Use Dora
Enables Weight-Decomposed Low-Rank Adaptation (DoRA) to be used instead of low rank approximations (LoRA) during training. This parameter efficient training method is built on top of LoRA and has shown promising results. Especially at lower ranks (e.g. r=4), it is expected to perform superior to LoRA.
Lora R
The dimension of the matrix decomposition used in LoRA.
Lora Alpha
The scaling factor for the lora weights.
Lora dropout
The probability of applying dropout to the LoRA weights during training.
Use RS Lora
When active, H2O LLM Studio uses Rank-Stabilized LoRA which sets the LoRA adapter scaling factor to lora_alpha/math.sqrt(lora_r). The creators suggest that this works especially better for very large ranks. Otherwise, it will use the original default value of lora_alpha/lora_r.
Lora target modules
The modules in the model to apply the LoRA approximation to. Defaults to all linear layers.
Lora unfreeze layers
An optional list of backbone layers to unfreeze during training. By default, all backbone layers are frozen when training with LoRA, here certain layers can be additionally trained, such as embedding or head layer. Full layer names will be matched against selected substrings. Only available with LoRA training.
Save checkpoint
Specifies how H2O LLM Studio should save the model checkpoints.
When set to Last it will always save the last checkpoint, this is the recommended setting.
When set to Best it saves the model weights for the epoch exhibiting the best validation metric.
- This setting should be turned on with care as it has the potential to lead to overfitting of the validation data.
- The default goal should be to attempt to tune models so that the last epoch is the best epoch.
- Suppose an evident decline for later epochs is observed in logging. In that case, it is usually better to adjust hyperparameters, such as reducing the number of epochs or increasing regularization, instead of turning this setting on.
When set to Each evaluation epoch it will save the model weights for each evaluation epoch.
- This can be useful for debugging and experimenting, but will consume more disk space.
- Models uploaded to Hugging Face Hub will only contain the last checkpoint.
- Local downloads will contain all checkpoints.
When set to Disable it will not save the checkpoint at all. This can be useful for debugging and experimenting in order to save disk space, but will disable certain functionalities like chatting or pushing to HF.
Evaluation epochs
Defines the number of epochs H2O LLM Studio uses before each validation loop for model training. In other words, it determines the frequency (in a number of epochs) to run the model evaluation on the validation data.
- Increasing the number of Evaluation Epochs can speed up an experiment.
- The Evaluation epochs setting is available only if the following setting is turned Off: Save Best Checkpoint.
- Can be a fraction of an epoch
Evaluate before training
This option lets you evaluate the model before training, which can help you judge the quality of the LLM backbone before fine-tuning.
Train validation data
Defines whether the model should use the entire train and validation dataset during model training. When turned On, H2O LLM Studio uses the whole train dataset and validation data to train the model.
- H2O LLM Studio also evaluates the model on the provided validation fold. Validation is always only on the provided validation fold.
- H2O LLM Studio uses both datasets for model training if you provide a train and validation dataset.
- To define a training dataset, use the Train dataframe setting.
- To define a validation dataset, use the Validation dataframe setting.
- Turning On the Train validation data setting should produce a model that you can expect to perform better because H2O LLM Studio trained the model on more data. Though, also note that using the entire train dataset and validation dataset generally causes the model's accuracy to be overstated as information from the validation data is incorporated into the model during the training process.
Augmentation settings
Token mask probability
Defines the random probability of the input text tokens to be randomly masked during training.
- Increasing this setting can be helpful to avoid overfitting and apply regularization
- Each token is randomly replaced by a masking token based on the specified probability
Skip parent probability
If Parent Column
is set, this random augmentation will skip parent concatenation during training at each parent with this specified probability.
Random parent probability
While training, each sample will be concatenated to a random other sample simulating unrelated chained conversations. Can be specified without using a Parent Column
.
Neftune noise alpha
Will add noise to the input embeddings as proposed by https://arxiv.org/abs/2310.05914 (NEFTune: Noisy Embeddings Improve Instruction Finetuning)
Prediction settings
Metric
Defines the metric to evaluate the model's performance.
We provide several metric options for evaluating the performance of your model. The options depend on the selected Problem Type:
Causal Language Modeling, DPO Modeling, Sequence to Sequence Modeling
- In addition to the BLEU and the Perplexity score, we offer GPT metrics that utilize the OpenAI API to determine whether the predicted answer is more favorable than the ground truth answer.
- To use these metrics, you can either export your OpenAI API key as an environment variable before starting LLM Studio, or you can specify it in the Settings Menu within the UI.
Causal Classification Modeling
- AUC: Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC).
- Accuracy: Compute the accuracy of the model.
- LogLoss: Compute the log loss of the model.
Causal Regression Modeling
- MSE: Compute Mean Squared Error of the model.
- MAE: Compute Mean Absolute Error of the model.
Metric GPT model
Defines the OpenAI model endpoint for the GPT metric.
Metric GPT template
The template to use for GPT-based evaluation. Note that for mt-bench, the validation dataset will be replaced accordingly; to approximate the original implementation as close as possible, we suggest to use gpt-4-0613 as the gpt judge model and use 1024 for the max length inference.
Min length inference
Defines the min length value H2O LLM Studio uses for the generated text.
- This setting impacts the evaluation metrics and should depend on the dataset and average output sequence length that is expected to be predicted.
Max length inference
Defines the max length value H2O LLM Studio uses for the generated text.
- Similar to the Max Length setting in the tokenizer settings section, this setting specifies the maximum number of tokens to predict for a given prediction sample.
- This setting impacts the evaluation metrics and should depend on the dataset and average output sequence length that is expected to be predicted.
Batch size inference
Defines the size of a mini-batch uses during an iteration of the inference. Batch size defines the batch size used per GPU.
Do sample
Determines whether to sample from the next token distribution instead of choosing the token with the highest probability. If turned On, the next token in a predicted sequence is sampled based on the probabilities. If turned Off, the highest probability is always chosen.
Num beams
Defines the number of beams to use for beam search. Num Beams default value is 1 (a single beam); no beam search.
A higher Num Beams value can increase prediction runtime while potentially improving accuracy.
Temperature
Defines the temperature to use for sampling from the next token distribution during validation and inference. In other words, the defined temperature controls the randomness of predictions by scaling the logits before applying softmax. A higher temperature makes the distribution more random.
- Modify the temperature value if you have the Do Sample setting enabled (On).
- To learn more about this setting, refer to the following article: How to generate text: using different decoding methods for language generation with Transformers.
Repetition penalty
The parameter for repetition penalty. 1.0 means no penalty. See https://arxiv.org/pdf/1909.05858.pdf for more details.
Stop tokens
Will stop generation at occurrence of these additional tokens; multiple tokens should be split by comma ,
.
Top K
If > 0, only keep the top k tokens with the highest probability (top-k filtering).
Top P
If < 1.0, only keep the top tokens with cumulative probability >= top_p (nucleus filtering).
Environment settings
GPUs
Determines the list of GPUs H2O LLM Studio can use for the experiment. GPUs are listed by name, referring to their system ID (starting from 1).
Mixed precision
Determines whether to use mixed-precision. When turned Off, H2O LLM Studio does not use mixed-precision.
Mixed-precision is a technique that helps decrease memory consumption and increases training speed.
Compile model
Compiles the model with Torch. Experimental!
Find unused parameters
In Distributed Data Parallel (DDP) mode, prepare_for_backward()
is called at the end of DDP forward pass. It traverses the autograd graph to find unused parameters when find_unused_parameters
is set to True in DDP constructor.
Note that traversing the autograd graph introduces extra overheads, so applications should only set to True when necessary.
Trust remote code
Trust remote code. This can be necessary for some models that use code which is not (yet) part of the transformers
package. Should always be checked with this option being switched Off first.
Hugging Face branch
The Hugging Face Branch defines which branch to use in a Hugging Face repository. The default value is "main".
Number of workers
Defines the number of workers H2O LLM Studio uses for the DataLoader. In other words, it defines the number of CPU processes to use when reading and loading data to GPUs during model training.
Seed
Defines the random seed value that H2O LLM Studio uses during model training. It defaults to -1, an arbitrary value. When the value is modified (not -1), the random seed allows results to be reproducible—defining a seed aids in obtaining predictable and repeatable results every time. Otherwise, not modifying the default seed value (-1) leads to random numbers at every invocation.
Logging settings
Log step size
Specifies the interval for logging during training. Two options are available:
- Absolute: The default setting. Uses the total number of training samples processed as the x-axis for logging.
- Relative: Uses the proportion of training data seen so far as the x-axis for logging.
Log all ranks
If used, the local logging will include the output of all ranks (DDP mode).
Logger
Defines the logger type that H2O LLM Studio uses for model training
Options
- None
- H2O LLM Studio does not use any logger.
- Neptune
- H2O LLM Studio uses Neptune as a logger to track the experiment. To use Neptune, you must specify a Neptune API token in the settings or as a
NEPTUNE_API_TOKEN
environment variable and a Neptune project.
- H2O LLM Studio uses Neptune as a logger to track the experiment. To use Neptune, you must specify a Neptune API token in the settings or as a
- W&B
- H2O LLM Studio uses W&B as a logger to track the experiment. To use W&B, you must specify a W&B API key in the settings or as a
WANDB_API_KEY
environment variable and a W&B project and W&B entity.
- H2O LLM Studio uses W&B as a logger to track the experiment. To use W&B, you must specify a W&B API key in the settings or as a
Neptune project
Defines the Neptune project to access if you selected Neptune in the Logger setting.
W&B project
This is the name of the project in your W&B account.
W&B entity
This is the name of the entity (user name or organization name) in your W&B account. If you are using W&B as a logger, you will need to set this.
- Submit and view feedback for this page
- Send feedback about H2O LLM Studio | Docs to cloud-feedback@h2o.ai