Note: Please see this guide for instructions on how to conduct data preprocessing for Llama 3.3 70B.
null_expert_bias
parameter that represents the model’s uncertainty or “none of the above” option when routing. By including a null expert probability in the weighting calculation, gradient flow back to the router is improved, leading to improved loss, especially in scenarios where only the top single expert (top_k=1
) is selected. Users can continue to choose between normalizing expert weights into a probability distribution or simply using the raw router scores as attention-like weights. The added null expert probability integrates seamlessly with both approaches.