Mistral
Decoder-only transformer models by Mistral, using sliding window attention and grouped-query attention for fast, high-quality language generation.
Model Description
Mistral is a family of decoder-only transformer models optimized for efficiency and throughput while preserving strong general performance. Architecturally, Mistral builds on the transformer decoder backbone with several key enhancements: it adopts grouped-query attention (GQA) for faster inference, replaces absolute positional encodings with sliding window attention for improved scalability, and utilizes SwiGLU activation functions. These models are well-suited for instruction following, reasoning, summarization, and coding tasks.
Mistral is a very similar architecture to LLaMA except that:
- It uses grouped-query attention (GQA), which reduces the number of attention heads for keys and values.
- It applies sliding window attention (SWA) with a 4K window, enabling local attention over long sequences.
- It supports a higher default maximum sequence length (MSL) of 32K, rather than LLaMA’s 4K.
For more details on each technique, see the original papers in the References section.
Code Structure
The code for this model is located in the /mistral
directory within ModelZoo. Here’s how it’s organized:
Our implementation of Mistral is built on top of our GPT-2 implementation. For more details, see gpt2_model.py
.
Available Configurations
Configuration | Description |
---|---|
params_mistral_7B.yaml | 7B parameter Mistral model. |
params_mistral_7B_msl128k.yaml | 7B parameter Mistral model with 128K MSL. |
params_mistral_12b.yaml | 12B parameter Mistral model. |
Workflow
For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning.
For a complete list of Cerebras ModelZoo CLI commands, see the command reference.
References
- Jiang, Albert, et al. (2023). Mistral 7B
- Ainslie, Joshua, et al. (2023). GQA: Training Multi-Query Transformer Models from Multi-Head Checkpoints
- Child, Rewon, et al. (2019). Generating Long Sequences with Sparse Transformers