Dynamic Loss Scaling - Cerebras AI

Mixed-precision training offers substantial performance gains through FP16 computation; however, it introduces the challenge of gradient vanishing during backpropagation. As gradients transition from FP32 to FP16, their magnitude can shrink drastically, leading to stalled training progress. Dynamic Loss Scaling (DLS) addresses this issue by dynamically adjusting the loss value before and after backpropagation. Here’s the breakdown:

Loss Scaling: The loss value is multiplied by a scaling factor before backpropagation. This artificially inflates the gradients, preventing them from shrinking to zero in FP16.
Backpropagation: Gradients are computed using the scaled loss value, ensuring their magnitude remains high during backpropagation.
Unscaling: After backpropagation, the weight updates are divided by the same scaling factor used in step 1. This reverses the artificial inflation, ensuring accurate updates to the network weights.

DLS automates the scaling process, eliminating the need for manual tuning of the scaling factor. This simplifies mixed-precision training and improves its stability. Key benefits of DLS:

Prevents gradient vanishing: Maintains gradient information during backpropagation, leading to improved training progress.
Improves training stability: Reduces divergence and stalling, leading to smoother convergence.
Simplifies mixed-precision training: Eliminates the need for manual loss scale tuning.
Boosts performance: Can achieve faster training times with less memory usage compared to full FP32 training.

Supported Precision

Dynamic Loss Scaling should be used when the fp16_type is either float16 or cbfloat16. It is not needed for bfloat16. For supported precision formats on Cerebras Wafer-Scale cluster, see Control numerical precision level.

Enable Dynamic Loss Scaling

Dynamic Loss Scaling is available for training models with cbfloat16 precision. This can improve training speed and stability. To activate the Dynamic Loss Scaling functionality, set the value of the loss_scaling_factor in the Trainer YAML configuration under the precision settings::

init:
  precision:
    loss_scaling_factor: "dynamic"

If you’re loading a model from an older checkpoint (created before version 2.1.0) that used bfloat16 training without loss scaling, you need to include the --load_checkpoint_states flag (or its equivalent in your run configuration) to make sure the parameters are loaded correctly from the params.yaml file. Once you’ve loaded your model and trained it with the new dynamic loss scaling, any checkpoints you save afterwards will automatically include this feature and won’t require the special flag anymore.

Enable Dynamic Loss Scaling with Module

Dynamic Loss Scaling offers flexible configuration through the cstorch.amp.GradScaler module. Supported parameters include:

loss_scale: Set to “dynamic” to activate dynamic scaling.
initial_loss_scale: Defines the starting scaling factor. Default value: 2e15.
steps_per_increase: Controls the frequency of scaling factor increments. Default value: 2000.
min_loss_scale: Sets the lower bound for the scaling factor. Default value: 2e-14.
max_loss_scale: Sets the upper bound for the scaling factor. Default value: 2e15.
max_gradient_norm: For dynamic loss scaling with global gradient clipping.

To activate Dynamic Loss Scaling, directly configure the amp.GradScaler constructor during initialization by passing the appropriate arguments:

import cerebras.pytorch as cstorch

scaler = cstorch.amp.GradScaler(
    loss_scale="dynamic"
    # DLS optimizer (loss_scale=='dynamic')
    initial_loss_scale=2e15,
    steps_per_increase=2000,
    min_loss_scale=2e-14,
    max_loss_scale=2e15,
    max_gradient_norm=...,
)

To prevent exploding gradients during training, use amp.GradScaler to automatically adjust the loss value (scaling up or down) before feeding it to the optimizer. This helps maintain numerical stability and can improve training speed. See the code below:

import cerebras.pytorch as cstorch

...
scaler = cstorch.amp.GradScaler(...)
...

for inputs in dataloader:
    loss = model(inputs)
    scaler(loss).backwards()

Explore & Learn

​Supported Precision

​Enable Dynamic Loss Scaling

​Enable Dynamic Loss Scaling with Module

Supported Precision

Enable Dynamic Loss Scaling

Enable Dynamic Loss Scaling with Module