Reproducibility
Reproducibility is an essential component of training ML models.
The Trainer
class features a way to enable determinism across runs, if so desired.
On this page, you will learn how to configure the Trainer to ensure reproducibility.
Prerequisites
Make sure to have read through Trainer Overview and Trainer Configuration Overview which provide the basic overview of how to run Model Zoo models. In this document, you will be using the tools and configurations outlined in those pages.
Trainer Seed
The Trainer
supports configuring reproducibility by piggybacking off of torch seed settings. While it is possible to manually set the torch seed outside of the Trainer class, it is strongly recommended to use the seed
argument of the Trainer
class to handle that for you. The following example shows how you can set the seed to 1234:
If the seed is not provided or is None (the default value), determinism across runs is not ensured.
Torch modules initialize their weights upon instantiation. Setting the seed after a Module has already been instantiated may not necessarily ensure determinism. To avoid this pitfall, instead of passing an already-constructed model instance to the Trainer
class, you should pass a callable that returns a torch Module. The Trainer will set the seed before invoking the callback, thus ensuring reproducibility. This is in line with deferred weight initialization, as described in Defer Weight Initialization.
For a given run, the seed settings may affect any of the following:
-
The order of input data;
-
The global seed captured in the graph which may affect the values generated by random ops in the model;
-
The compile hash. For example, a model that has a random op, such as Dropout, may have a different compile hash for different seed settings. To avoid unnecessary recompiles, make sure to set the trainer seed.
Conclusion
Ensuring reproducibility in ML model training is crucial for consistency and reliability of results. By leveraging the seed
argument in the Trainer
class, you can achieve deterministic behavior across runs. This guide has provided step-by-step instructions on configuring the Trainer for reproducibility using both YAML and Python.
Further Reading
To learn more about how you can extend the capabilities of the Trainer
class, you can check out: