By deferring the initialization of model weights, you can significantly reduce the time-to-first-loss, leading to faster iteration times and a more efficient training process.
Trainer
class.
By the end of this guide, you will understand how to implement deferred initialization and the advantages it brings to your model training.
torch.nn.Module
into the Trainer
. However, to do this, you need to have a concrete torch.nn.Module
object to pass into the Trainer
. Due to PyTorch’s eager nature, initializing a model can be very time consuming, especially for extremely large models.
To improve your experience, we introduce a mechanism by which you can defer your model’s weight initialization. The way to do this would be to pass in a function to the model
argument of the Trainer
that takes in no arguments and returns a torch.nn.Module
object.
lambda
function as follows.
Trainer
to employ the use of the Efficient weight initialization mechanism built into the Cerebras PyTorch API.
Empirically, deferring model weight initialization can reduce the time-to-first-loss (the amount of time it takes to get the first value back from the Wafer-Scale Cluster) by over 50%. This means, less time waiting around and faster iteration time overall.
Trainer
can also accept a function for the optimizer
argument which is expected to take in a torch.nn.Module
and return a Optimizer
object.
Trainer
can also accept a function for the schedulers
argument which is expected to take in a Optimizer
object and return a Scheduler
object.
Trainer
instance using a YAML configuration file, you can check out:
Trainer
in some core workflows, you can check out:
To learn more about how you can extend the capabilities of the Trainer
class, you can check out: