Model Description

Mixtral is a family of decoder-only transformer models that use sparse Mixture of Experts (MoE) to scale model capacity without increasing inference cost. Instead of activating all model parameters for every input, Mixtral routes each token to a subset of experts — specialized feedforward networks — based on a learned or hash-based routing algorithm. This allows Mixtral models to retain training efficiency while significantly expanding total parameter count.

The architecture builds on the Mistral base model, inheriting sliding window attention (SWA), grouped-query attention (GQA), SwiGLU activations, and a 32K maximum sequence length. Each expert block consists of multiple experts, and only a configurable top_k subset is selected per token during forward pass.

Mixtral models are effective for tasks requiring high capacity — such as long-context reasoning, coding, and instruction following — while remaining efficient at inference time.

Code Structure

The code for this model is located in the /mixtral directory within ModelZoo. Here’s how it’s organized:

  • /configs: Contains YAML configuration files.
  • model.py: The implementation of the Mixtral MoE model.

Our implementation of Mixtral is built on top of our GPT-2 implementation. For more details, see gpt2_model.py.

Available Configurations

ConfigurationDescription
params_mixtral_8x7b.yamlMixtral model with 8 experts of size 7B each.
params_mixtral_8x22b.yamlMixtral model with 8 experts of size 22B each.
params_moe_111M_base.yamlSmall-scale MoE model with 111M parameters.
params_moe_111M_with_shared_expert.yaml111M model with shared expert enabled.

Expert Configuration Details

These YAML settings allow flexibility in training with different numbers of experts and levels of specialization:

  • num_experts: Defines the total number of experts in the model.
  • top_k: Specifies how many experts are selected for each token during routing.

Additional Features for Enhanced MoE Training

Our models include extra features built on top of the Mixtral base model to improve Mixture of Experts (MoE) training. These features can be configured in the YAML file:

  • num_shared_experts (Optional[int]):
    Specifies the number of experts shared across all tokens. These shared experts are always activated and help capture common knowledge across different contexts. This concept is inspired by DeepSeekMoE.

  • null_expert_bias (Optional[float]):
    Adds an optional bias to the “null expert” probability in the routing process, which improves loss when top_k=1. The null expert represents the model’s uncertainty or its decision that “none of the above” is the best option. This bias enhances gradient flow back to the router, leading to better performance.

  • routing_algorithm (Literal[hash, learned]):
    Allows users to choose between hash-based routing and learned routing methods for determining which experts to activate.

  • router_selection_nonlinearity (Literal[sigmoid, sinkhorn, softmax]):
    Specifies the type of non-linearity used in the routing algorithm to generate expert probabilities. This option is applicable when using the "learned" routing method.

Workflow

For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning.

For a complete list of Cerebras ModelZoo CLI commands, see the command reference.

References