num_tokens
. Instruction fine-tuning and downstream task evaluations are potential use cases for loss scaling by num_tokens
if the sequences sequences are “unpacked” i.e. each sequence contains one prompt-response pair.
Since the attention mask has a batch dimension, in runs with gradient accumulation or on multiple CSX devices, the attention mask will be appropriately reduced over all boxes and gradient accumulation micro-batches. That is, the num_tokens
value used in the run will correctly reflect the global batch size of the model.
model
section by changing the loss_scaling
and loss_weight
parameters.
num_tokens
in the model’s configuration allows for more accurate and balanced model training, particularly crucial in scenarios with diverse sequence lengths, thereby enhancing the model’s overall performance and robustness.