The Trainer class features a way to enable determinism across runs, if so desired.

On this page, you will learn how to configure the Trainer to ensure reproducibility.

Prerequisites

Make sure to have read through Trainer Overview and Trainer Configuration Overview which provide the basic overview of how to run Model Zoo models. In this document, you will be using the tools and configurations outlined in those pages.

Trainer Seed The Trainer supports configuring reproducibility by piggybacking off of torch seed settings. While it is possible to manually set the torch seed outside of the Trainer class, it is strongly recommended to use the seed argument of the Trainer class to handle that for you. The following example shows how you can set the seed to 1234:

If the seed is not provided or is None (the default value), determinism across runs is not ensured.

Torch modules initialize their weights upon instantiation. Setting the seed after a Module has already been instantiated may not necessarily ensure determinism. To avoid this pitfall, instead of passing an already-constructed model instance to the Trainer class, you should pass a callable that returns a torch Module. The Trainer will set the seed before invoking the callback, thus ensuring reproducibility. This is in line with deferred weight initialization, as described in Defer Weight Initialization.

For a given run, the seed settings may affect any of the following:

  • The order of input data;

  • The global seed captured in the graph which may affect the values generated by random ops in the model;

  • The compile hash. For example, a model that has a random op, such as Dropout, may have a different compile hash for different seed settings. To avoid unnecessary recompiles, make sure to set the trainer seed.

Conclusion

Ensuring reproducibility in ML model training is crucial for consistency and reliability of results. By leveraging the seed argument in the Trainer class, you can achieve deterministic behavior across runs. This guide has provided step-by-step instructions on configuring the Trainer for reproducibility using both YAML and Python.

Further Reading

To learn more about how you can extend the capabilities of the Trainer class, you can check out: