Sparse Mixture of Experts models using routing and expert specialization for scalable language modeling.
top_k
subset is selected per token during forward pass.
Mixtral models are effective for tasks requiring high capacity — such as long-context reasoning, coding, and instruction following — while remaining efficient at inference time.
/mixtral
directory within ModelZoo. Here’s how it’s organized:
gpt2_model.py
.Configuration | Description |
---|---|
params_mixtral_8x7b.yaml | Mixtral model with 8 experts of size 7B each. |
params_mixtral_8x22b.yaml | Mixtral model with 8 experts of size 22B each. |
params_moe_111M_base.yaml | Small-scale MoE model with 111M parameters. |
params_moe_111M_with_shared_expert.yaml | 111M model with shared expert enabled. |
num_experts
: Defines the total number of experts in the model.top_k
: Specifies how many experts are selected for each token during routing.num_shared_experts
(Optional[int
]):null_expert_bias
(Optional[float
]):top_k=1
. The null expert represents the model’s uncertainty or its decision that “none of the above” is the best option. This bias enhances gradient flow back to the router, leading to better performance.
routing_algorithm
(Literal[hash
, learned
]):router_selection_nonlinearity
(Literal[sigmoid
, sinkhorn
, softmax
]):"learned"
routing method.