Skip to main content

Set Up Hyperparameters

Alt text

Hyperparameters control how the model’s weights are updated during the training process. To make configuration easier, we categorize hyperparameters into 5 distinct groups based on their function and relevance:

Group 1 - General

The core settings of your training process.

NameDescriptionTypeSupported value
Batch sizeThe number of examples the model processes in one forward and backward pass before updating its weight. Large batches slow down training, but may produce more stable results.
In case of distributed training, this is batch size on each device.Int[1, +∞)
EpochsAn epoch is a single complete pass through your entire training data during model training. You will typically run multiple epochs so the model can iteratively refine its weights.Int[1, +∞)
Learning rateAdjusts the size of changes made to the model’s learned parameters.Float(0, 1)
Max sequence lengthMax input length, longer sequences will be cut off to this value.Int[1, +∞)
Distributed backendBackend to use for distributed training.Enum[string]DDP, DeepSpeed
ZeRO stageStage to apply DeepSpeed ZeRO algorithm. Only apply when Distributed backend = DeepSpeed.Enum[int]1, 2, 3
Training typeWhich parameter mode to use.Enum[string]Full, LoRA
Resume from checkpointRelative path of the checkpoint that the training engine will resume from.Union[bool, string]No, Last checkpoint, Path/to/checkpoint

Group 2 - Training runtime

Optimize the efficiency and performance of your training.

NameDescriptionTypeSupported value
Gradient accumulation stepsNumber of update steps to accumulate the gradients for, before performing a backward/update pass.Int[1, +∞)
Mixed precisionType of mixed precision to use.Enum[string]Bf16, Fp16, None
Quantization bitThe number of bits to quantize the model using on-the-fly quantization. Currently only applicable when Training type = LoRA.Enum[string]None
OptimizerOptimizer to use for training.Enum[string]Adamw, Sgd
Weight decayWeight decay to apply to the optimizer.Float[0, +∞)
Max gradient normMaximum norm for gradient clipping.Float[0, +∞)
Disable gradient checkpointingWhether or not to disable gradient checkpointing.BoolTrue, False
Flash attention v2Whether to use flash attention version 2. Currently only support false.BoolFalse
LR warmup stepsNumber of steps used for a linear warmup from 0 to Learning rate.Int[0, +∞)
LR warmup ratioRatio of total training steps used for a linear warmup.Float[0, 1)
LR schedulerLearning rate scheduler to use.Enum[string]Linear, Cosine, Constant
Full determinismEnsure reproducible results in distributed training. Important: this will negatively impact the performance, so only use it for debugging.
If True, setting Seed will not take effect.BoolTrue, False
SeedRandom seed for reproducibility.Int[0, +∞)

Group 3 - DPO

Enable this group when using trainer = DPO.

NameDescriptionTypeSupported value
DPO label smoothingThe robust DPO label smoothing parameter in DPO should be between 0 and 0.5.Float[0, 0.5]
Preference betaThe beta parameter in the preference loss.Float[0, 1]
Preference fine-tuning mixThe SFT loss coefficient in DPO training.Float[0, 10]
Preference lossThe type of DPO loss to use.Enum[string]Sigmoid, Hinge, Ipo, Kto pair, Orpo, Simpo
SimPO gammaThe target reward margin in SimPO loss. Used only when applicable.Float(0, +∞)

Group 4 - LoRA

Enable this group when using Training type = LoRA.

NameDescriptionTypeSupported value
Merge adapterWhether or not to merge the LoRA adapter into the base model to provide the final model. If not, only the LoRA adapter will be saved after training is done.BoolTrue, False
LoRA alphaAlpha parameter for LoRA.Int[1, +∞)
LoRA dropoutDropout rate for LoRA.Float[0, 1]
LoRA rankRank of the LoRA matrices.Int[1, +∞)
Target modulesTarget modules for quantization or fine-tuning.StringAll linear

Group 5 - Others

Control how fine-tuning progress is tracked and saved.

NameDescriptionTypeSupported value
Checkpoint strategyThe checkpoint save strategy to adopt during training.
"best" only applicable when Evaluation strategy is not "no".Enum[string]No, Epoch, Steps
Checkpoint stepsNumber of training steps before two checkpoint saves if Checkpoint strategy = step.Int[1, +∞)
Evaluation strategyThe evaluation strategy to adopt during training.Enum[string]No, Epoch, Steps
Evaluation stepsNumber of update steps between two evaluations if Evaluation strategy = steps.
Will default to the same value as Logging steps if not set.Int[1, +∞)
No. of checkpointsIf a value is passed, it will limit the total amount of checkpoint.Int[1, +∞)
Save best checkpointWhether or not to track and keep the best checkpoint. Currently only supports False.BoolFalse
Logging stepsNumber of steps between logging events including stdout logs and MLflow data points.
Logging steps = -1 means log on every step.Int[0, +∞)
Or you can set up quickly hyperparameters by switching toggle JSON :

Alt text