Learn how to train an LLM with maximal update parameterization (μP).
<name_config>_mup.yaml
. You can find SP and μP parametrizations in the Cerebras Model Zoo. For more information, refer to the configuration yaml files.
1. Execute both config files using the run.py
script of the preferred GPT-style model.
2. Specify the path to the config file using the --config
flag. For more information on launching a Cerebras job, visit the Quickstart guide.
μTransferred Across | μTransferable | Not μTransferable |
---|---|---|
They define the training scale | They can transfer from the small to the large model | They do not work with μTransfer, because they depend on model size and data size |
Theoretically demonstrated: width; Empirically demonstrated: depth, batch size,training time, seq length | Optimization related (learning rate (LR), momentum, Adam beta, LR schedule, etc), inititalization (per-layer init. variance),parameter multipliers (multiplicative constants after weight/biases) | Regularization (dropout, weight decay, etc) |
Parameter | Description |
---|---|
--input_yaml or -i | [Required] Configuration Yaml file of the target-model. This can be a standard configuration file, like the ones found in Cerebras Model Zoo |
--base_layer_width or -d_base | [Optional] Proxy-model’s width, defaults to 256 |
--base_lr or -lr_base | [Optional] Proxy-model’s lr determined by hyperparameter sweep, defaults to 6e-3. Currently, we support config generation for sequential Linear learning rate schedules. First lr scheduler should perform linear warm-up and second scheduler should do linear decay. |
--base_init_std or -std_base | [Optional] Proxy-model’s initial standard deviation, defaults to 0.08 |
--m_embed or -m_base | Proxy-model’s embeddings multiplier, defaults to 10.0 |
--output_yaml or -o | [Optional] Output yaml file to save the μP config. if not provided, config will be stored under the same path as the input but with a _mup tag |
params_gpt3_2p7b.yaml
and all of the default arguments as follows:
40M
base model with the best validation loss. Let’s say the goal is to transfer the hyperparameters to 13B model scale, it is advised to first transfer the hyperparameters to an intermediate sized model size such as 256M
or 590M
. At 256M
or 590M
scale, run training with the top-10
configs for 20
tokens per parameter and evaluate the model using a validation set. To pick the best configuration from validation loss metric, it is recommended to not just look at the lowest loss value but also the training dynamics. In our experience, for some configurations you may see very low loss values but few instabilities in the training dynamics (which manifest as loss spikes or exploding gradient norm). In that scenario it is best to pick a configuration with second or third best validation loss which exhibit stable training dynamics.
Consult the base μP configuration for the GPT-3 model, available at 40M config. This configuration was employed internally for our hyperparameter optimization processes. Subsequently, we applied μTransfer techniques to upscale to the 256M model. For different model architectures like Llama and Falcon, it’s necessary to develop a new 40M configuration that aligns with their specific structural requirements.
If there is a significant change in the model architecture such as changing position embeddings from Relative
to RoPE
or ALiBi
, changing activation function from GeLU
to SwiGLU
, it is advised to redo the hyperparameter sweep at 40M scale to pick the best configs. In such cases, using the optimal set of hyperparameters from Cerebras-GPT paper is not recommended.