fp16_type
is either float16
or cbfloat16
. It is not needed for bfloat16
. For supported precision formats on Cerebras Wafer-Scale cluster, see Control numerical precision level.
cbfloat16
precision. This can improve training speed and stability.
To activate the Dynamic Loss Scaling functionality, an optimizer parameter available within the Model Zoo, set the value of the loss_scaling_factor
YAML hyperparameter to “dynamic”:
bfloat16
training without loss scaling, you need to include the --load_checkpoint_states
flag (or its equivalent in your run configuration) to make sure the parameters are loaded correctly from the params.yaml
file.
Once you’ve loaded your model and trained it with the new dynamic loss scaling, any checkpoints you save afterwards will automatically include this feature and won’t require the special flag anymore.
cstorch.amp.GradScaler
module. Supported parameters include:
loss_scale
: Set to “dynamic” to activate dynamic scaling.
initial_loss_scale
: Defines the starting scaling factor. Default value: 2e15
.
steps_per_increase
: Controls the frequency of scaling factor increments. Default value: 2000
.
min_loss_scale
: Sets the lower bound for the scaling factor. Default value: 2e-14
.
max_loss_scale
: Sets the upper bound for the scaling factor. Default value: 2e15
.
max_gradient_norm
: For dynamic loss scaling with global gradient clipping.
amp.GradScaler
constructor during initialization by passing the appropriate arguments:
amp.GradScaler
to automatically adjust the loss value (scaling up or down) before feeding it to the optimizer. This helps maintain numerical stability and can improve training speed. See the code below:
amp.GradScaler
constructor to build your own training pipeline. We encourage using DLS when training with cbfloat for increased throughput and faster training.