Learn how to train an LLM with maximal update parameterization (μP).
μTransferred Across (They define the training scale) | μTransferable | Not μTransferable (They depend on model and data size) |
---|---|---|
Width, Depth, Batch Size, Training Time, Sequence Length | Optimization Params, Per-Layer Initialization Variance, Parameter Multipliers | Regularization Params |
mup_base_hidden_size
(Required to enable μP):
The hidden size of the proxy model.
mup_base_filter_size
(Required to enable μP):
The filter size of the proxy model.
mup_base_hidden_size = 256
and mup_base_filter_size = 1024
in the config of the target model since they correspond to the hidden_size
and filter_size
of the 40M model config.
2 * num_(encoder/decoder)_blocks
to aid with transferring across depth.
embeddings_scale
:
Scales the embedding hidden states (i.e. the tensor after embeddings & embedding layer norm are applied). Recommended to tune for stabilizing gradient flow during μP training.
output_logits_alpha
:
Constant applied to the output logits scalar in μP training. The output logits are scaled by output_logits_alpha * mup_base_hidden_size/hidden_size
. Recommended to tune for stabilizing output logits in μP training.
scale_qk_dot_by_d
:
Scales attention QK dot product by d instead of sqrt(d). Must be enabled for muP training.
attention_logits_alpha
:
Scales the attention QK dot product by the specified value. Recommended to tune for stabilizing attention logits in muP training.
scale_output_logits_by_d
:
Scales the output logits in μP by mup_base_hidden_size/hidden_size
if True and sqrt(mup_base_hidden_size/hidden_size)
if False. It is traditionally set to True
in the μP implementation of this model.
embedding
: Targets the embedding weights.
decoder_attention
: Targets the dense layers in the decoder (Q, K, V, Output projections)
decoder_input_ffn
: Targets the first of the two FFN blocks in the decoder.
decoder_output_ffn
: Targets the final FFN block in the decoder.
create_default_lr_adjustment_groups
function in the model.py
of a given model.<custom-value>
is a placeholder for a hyperparameter value you can determine via sweeping or some other means and is different for each parameter.
Appending those params to the final config:
ALiBi
, RoPE
, Relative
and Fixed
position embeddings, with efficient attention architectures like Multi-Query Attention, and activation functions such as SwiGLU
.
Please refer to the following guides for a detailed breakdown of model-specific μP params:
40M
base model with the best validation loss. Let’s say the goal is to transfer the hyperparameters to 13B model scale, it is advised to first transfer the hyperparameters to an intermediate sized model size such as 256M
or 590M
.
At the 256M
or 590M
scale, run training with the top-10
configs for 20
tokens per parameter and evaluate the model using a validation set. To pick the best configuration from validation loss metric, it is recommended to not just look at the lowest loss value but also the training dynamics.
In our experience, for some configurations you may see very low loss values but few instabilities in the training dynamics (which manifest as loss spikes or exploding gradient norm). In that scenario it is best to pick a configuration with second or third best validation loss which exhibit stable training dynamics.
Consult the base μP configuration for the GPT-3 model, available at 40M config. This configuration was employed internally for our hyperparameter optimization processes. Subsequently, we applied μTransfer techniques to upscale to the 256M model. For different model architectures like Llama and Falcon, it’s necessary to develop a new 40M configuration that aligns with their specific structural requirements.
If there is a significant change in the model architecture such as changing position embeddings from Relative
to RoPE
or ALiBi
, changing activation function from GeLU
to SwiGLU
, it is advised to redo the hyperparameter sweep at 40M scale to pick the best configs. In such cases, using the optimal set of hyperparameters from Cerebras-GPT paper is not recommended.