Efficient Weight Initialization
Learn how to enhance weight initialization efficiency and speed for large-scale models using advanced Cerebras techniques.
This guide explores a method to initialize model parameters more efficiently by leveraging Cerebras’ hardware acceleration. Typically, initializing parameters on a CPU can be slow for large models and might lead to memory constraints - specifically, there’s a risk that parameters won’t fit within the available RAM, leading to potential overflow into swap memory or, in some cases, complete failure to allocate the necessary memory. To mitigate these issues, we introduce an approach using a cerebras.pytorch.backend
instance, facilitating the parameter initialization process on a Cerebras device, similar to how one would use a torch.device
.
For example:
This method automatically moves the parameters to the Cerebras device, optimizing memory usage and enhancing initialization speed. This frees up memory for subsequent parameters and keeps the overall memory usage low.
This approach simplifies the process of achieving more efficient weight initialization. For a deeper understanding of the underlying mechanics, refer to the following subsections.
Lazy Weight Initialization
Lazy Initialization enhances the model initialization process by tracing a model’s initialization. It also removes redundant computations that occur before initialization, significantly decreasing the time to achieve the first loss. This functionality transforms the way models are initialized, fostering a more efficient and resource-aware approach to training.
To invoke lazy initialization, proceed with the following steps:
Initializing the model from within the backend’s device context manager allows us to fully capture the initialization graph to compute at a later time. Doing so creates many opportunities to optimize the initialization, such as:
-
Rapid weight initialization and initiate the model compilation process
-
Minimizes the underlying computational efforts
-
Enhances the parallelization of weight initialization computations
-
Substantially reduces overall memory usage
-
Reduces the number of file write operations needed
Consequently, lazy initialization is the default setting, automatically applied when a model is initialized within the backend device’s context manager.
If you encounter numerical or convergence issues, you can disable this feature to see if its the model initialization that is the cause of the issues:
Disabling lazy initialization shifts the initialization scheme closer to eager model initialization - which substantially lengthens the time needed to initialize the model and delays the receipt of the first result from the cluster.
To learn more about how the benefits described above are achieved, read the following subsections.
Parallelizing Weight Initialization
Upon completing the tracing of the model’s initialization, for efficiency gains we divide the weight graphs into distinct subgraphs that can be executed concurrently. Since they do not have any dependencies between them, they can be executed in parallel.
By default, four initialization subgraphs are executed in parallel. This default was chosen empirically using data from many different experiments.
To change the number of subgraphs that are initialized at once, set the following:
Increasing the parallelization may cause out-of-memory issues. In the worst case, we could potentially be initializing N
extremely large weights at once. This is typically unlikely as smaller weights outnumber large ones, if Out-of-Memory issues occur, reducing the maximum parallelization level using the configuration variable might help.
Deterministic Pseudo-Random Weight Initialization
When initializing weights in parallel using a pseudo-random number generator, the simultaneous initialization of multiple weights involving random sampling brings up concerns regarding the determinism of weight initialization.
The adopted solution is to assign a unique pseudo-random number generator to each random sampling operation. These generators are individually seeded using the default random number generator, allowing for deterministic seeding with torch.manual_seed(seed)
.
Example transformation for deterministic behavior:
We inject a generator so that what is effectively getting executed is:
This is done for every single random operator in the initialization graph, thus making each initialization subgraph completely independent from one another. Consequently, this approach guarantees that even when multiple weights are initialized in parallel, the outcomes are deterministic.
Optimizing Weight Initialization
After capturing and dividing the initialization graph into subgraphs, there’s an opportunity to refine and optimize these subgraphs prior to their execution.
For example, take the initialization of the following weight:
The call to normal_
is rendered redundant as the very next call to uniform_
will overwrite all the values in model.weight
anyways. Eliminating the call to normal_
won’t affect the final result, maintaining the same outcomes while streamlining the process.
By default, this optimization process prunes a specific set of operations known to overwrite all the values of the input tensor. These operations include, but are not limited to:
-
fill
-
normal
-
random
-
uniform
-
zero
If you come across a situation where the numerical values, like the weight distribution, do not align with the expected outcome, turn off the initialization optimization by configuring the following settings:
Implementation Notes
-
Ensure that the Cerebras backend is correctly initialized and accessible.
-
When working with large models, monitor memory usage to prevent overflow.
-
Know which operations are pruned by default and how this affects the initialization process.
Best Practices
-
While lazy initialization can speed up the setup process, be aware of its implications and switch to eager initialization if you face issues related to model convergence or performance.
-
Validate the initialization’s impact on model training to ensure that efficiency gains don’t compromise model performance.
-
Ensure that your system has enough memory to support the chosen level of parallelization without running into out-of-memory issues.