Mixtral

Model Description

Mixtral is a family of decoder-only transformer models that use sparse Mixture of Experts (MoE) to scale model capacity without increasing inference cost. Instead of activating all model parameters for every input, Mixtral routes each token to a subset of experts — specialized feedforward networks — based on a learned or hash-based routing algorithm. This allows Mixtral models to retain training efficiency while significantly expanding total parameter count. The architecture builds on the Mistral base model, inheriting sliding window attention (SWA), grouped-query attention (GQA), SwiGLU activations, and a 32K maximum sequence length. Each expert block consists of multiple experts, and only a configurable top_k subset is selected per token during forward pass. Mixtral models are effective for tasks requiring high capacity — such as long-context reasoning, coding, and instruction following — while remaining efficient at inference time.

Code Structure

The code for this model is located in the /mixtral directory within ModelZoo. Here’s how it’s organized:

/configs: Contains YAML configuration files.
model.py: The implementation of the Mixtral MoE model.

Our implementation of Mixtral is built on top of our GPT-2 implementation. For more details, see gpt2_model.py.

Available Configurations

Configuration	Description
`params_mixtral_8x7b.yaml`	Mixtral model with 8 experts of size 7B each.
`params_mixtral_8x22b.yaml`	Mixtral model with 8 experts of size 22B each.
`params_moe_111M_base.yaml`	Small-scale MoE model with 111M parameters.
`params_moe_111M_with_shared_expert.yaml`	111M model with shared expert enabled.

Expert Configuration Details

These YAML settings allow flexibility in training with different numbers of experts and levels of specialization:

num_experts: Defines the total number of experts in the model.
top_k: Specifies how many experts are selected for each token during routing.

Additional Features for Enhanced MoE Training

Our models include extra features built on top of the Mixtral base model to improve Mixture of Experts (MoE) training. These features can be configured in the YAML file:

num_shared_experts (Optional[int]):
Specifies the number of experts shared across all tokens. These shared experts are always activated and help capture common knowledge across different contexts. This concept is inspired by DeepSeekMoE.
null_expert_bias (Optional[float]):
Adds an optional bias to the “null expert” probability in the routing process, which improves loss when top_k=1. The null expert represents the model’s uncertainty or its decision that “none of the above” is the best option. This bias enhances gradient flow back to the router, leading to better performance.
routing_algorithm (Literal[hash, learned]):
Allows users to choose between hash-based routing and learned routing methods for determining which experts to activate.
router_selection_nonlinearity (Literal[sigmoid, sinkhorn, softmax]):
Specifies the type of non-linearity used in the routing algorithm to generate expert probabilities. This option is applicable when using the "learned" routing method.

Workflow

For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning. For a complete list of Cerebras ModelZoo CLI commands, see the command reference.

References

Mistral AI. (2023). Mixtral of Experts
Shazeer, Noam, et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
DeepSeek-AI. (2024). DeepSeekMoE

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

Model Description

Code Structure

Available Configurations

Expert Configuration Details

Additional Features for Enhanced MoE Training

Workflow

References

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

​Model Description

​Code Structure

​Available Configurations

​Expert Configuration Details

​Additional Features for Enhanced MoE Training

​Workflow

​References

Model Description

Code Structure

Available Configurations

Expert Configuration Details

Additional Features for Enhanced MoE Training

Workflow

References