Dynamic Loss Scaling
Learn how to enable dynamic loss scaling to improve stability and performance.
Mixed-precision training offers substantial performance gains through FP16 computation; however, it introduces the challenge of gradient vanishing during backpropagation. As gradients transition from FP32 to FP16, their magnitude can shrink drastically, leading to stalled training progress.
Dynamic Loss Scaling (DLS) addresses this issue by dynamically adjusting the loss value before and after backpropagation. Here’s the breakdown:
-
Loss Scaling: The loss value is multiplied by a scaling factor before backpropagation. This artificially inflates the gradients, preventing them from shrinking to zero in FP16.
-
Backpropagation: Gradients are computed using the scaled loss value, ensuring their magnitude remains high during backpropagation.
-
Unscaling: After backpropagation, the weight updates are divided by the same scaling factor used in step 1. This reverses the artificial inflation, ensuring accurate updates to the network weights.
DLS automates the scaling process, eliminating the need for manual tuning of the scaling factor. This simplifies mixed-precision training and improves its stability.
Key benefits of DLS:
-
Prevents gradient vanishing: Maintains gradient information during backpropagation, leading to improved training progress.
-
Improves training stability: Reduces divergence and stalling, leading to smoother convergence.
-
Simplifies mixed-precision training: Eliminates the need for manual loss scale tuning.
-
Boosts performance: Can achieve faster training times with less memory usage compared to full FP32 training.
Supported Precision
Dynamic Loss Scaling should be used when the fp16_type
is either float16
or cbfloat16
. It is not needed for bfloat16
. For supported precision formats on Cerebras Wafer-Scale cluster, see Control numerical precision level.
Enable Dynamic Loss Scaling
Dynamic Loss Scaling is available for training models with cbfloat16
precision. This can improve training speed and stability.
To activate the Dynamic Loss Scaling functionality, set the value of the loss_scaling_factor
in the Trainer YAML configuration under the precision settings::
If you’re loading a model from an older checkpoint (created before version 2.1.0) that used bfloat16
training without loss scaling, you need to include the --load_checkpoint_states
flag (or its equivalent in your run configuration) to make sure the parameters are loaded correctly from the params.yaml
file.
Once you’ve loaded your model and trained it with the new dynamic loss scaling, any checkpoints you save afterwards will automatically include this feature and won’t require the special flag anymore.
Enable Dynamic Loss Scaling with Module
Dynamic Loss Scaling offers flexible configuration through the cstorch.amp.GradScaler
module. Supported parameters include:
-
loss_scale
: Set to “dynamic” to activate dynamic scaling. -
initial_loss_scale
: Defines the starting scaling factor. Default value:2e15
. -
steps_per_increase
: Controls the frequency of scaling factor increments. Default value:2000
. -
min_loss_scale
: Sets the lower bound for the scaling factor. Default value:2e-14
. -
max_loss_scale
: Sets the upper bound for the scaling factor. Default value:2e15
. -
max_gradient_norm
: For dynamic loss scaling with global gradient clipping.
To activate Dynamic Loss Scaling, directly configure the amp.GradScaler
constructor during initialization by passing the appropriate arguments:
To prevent exploding gradients during training, use amp.GradScaler
to automatically adjust the loss value (scaling up or down) before feeding it to the optimizer. This helps maintain numerical stability and can improve training speed. See the code below: