Trainer and ways you can mitigate some of the adverse effects of using lower precision.
Prerequisites
Please ensure that you have read through the Trainer Overview beforehand. The rest of this page assumes that you already have at least a cursory understanding of what the Cerebras Model Zoo Trainer is and how to use the python API. Also, read through the Gradient scaling guide for more context on the concepts being used below.Automatic Mixed Precision
Using a lower precision for computating activations while storing weights in a higher precision is a good way to get improved performance whilst keeping some numeric stability. To facilitate this, you can construct aMixedPrecision instance and pass it into the Trainer’s precision argument as follows.
cbfloat16. The supported lower precision values include:
-
float16 -
bfloat16 -
cbfloat16
About CB16 Half-Precision
About CB16 Half-Precision
CB16 is Cerebras’ 16-bit format, also referred to as 
With 1 more bit for the exponent compared to FP16, CB16 provides a bigger range with the following benefits:
cbfloat16. It’s a floating-point format with 6-bit exponent and 9-bit explicit mantissa. This allows for double the dynamic range of FP16.
- Denormals are far less frequent.
- Dynamic loss scaling is not necessary on many networks.
The
cbfloat16 data format is different from the bfloat16 Brain Floating Point format.Gradient Scaling
When using a lower numeric precision, you will often encounter gradient underflow. To mitigate this, you can employ gradient scaling (see Gradient scaling for a more in-depth explanation). To configure gradient scaling, you can pass in theloss_scaling_factor argument to MixedPrecision as follows:
loss_scaling_factor accepts either some float for static loss scaling, or the string "dynamic" to facilitate dynamic loss scaling (see Dynamic loss scaling for more details).
Gradient Clipping
Even with all of the above, you may encounter exploding gradients (inf or NaN gradients). To mitigate this, you can employ the use of gradient clipping.
To configure gradient clipping, you can pass in one of max_gradient_norm or max_gradient_value to MixedPrecision as follows:
max_gradient_norm and max_gradient_value are mutually exclusive. So, only one may be passed in.Precision Optimization Level (POL)
One additional setting you can configure to improve performance is to set the precision optimization level used by the Cerebras compiler. The precision optimization level (POL) to use when compiling the model. The POL determines the level of precision to use for the model’s weights and activations and can thus affect the model’s accuracy and performance. You can set the precision optimization level by passing in theprecision_opt_level argument to MixedPrecision as follows:
[0, 1, 2]. The precision optimization level is set to 1 by default.
Conclusion
That is all there is to know about configuring numeric precision in theTrainer!
Further Reading
To learn more about how you can extend the capabilities of theTrainer class, you can check out:

