Model Description

Mistral is a family of decoder-only transformer models optimized for efficiency and throughput while preserving strong general performance. Architecturally, Mistral builds on the transformer decoder backbone with several key enhancements: it adopts grouped-query attention (GQA) for faster inference, replaces absolute positional encodings with sliding window attention for improved scalability, and utilizes SwiGLU activation functions. These models are well-suited for instruction following, reasoning, summarization, and coding tasks.

Mistral is a very similar architecture to LLaMA except that:

  • It uses grouped-query attention (GQA), which reduces the number of attention heads for keys and values.
  • It applies sliding window attention (SWA) with a 4K window, enabling local attention over long sequences.
  • It supports a higher default maximum sequence length (MSL) of 32K, rather than LLaMA’s 4K.

For more details on each technique, see the original papers in the References section.

Code Structure

The code for this model is located in the /mistral directory within ModelZoo. Here’s how it’s organized:

  • /configs: Contains YAML configuration files.
  • model.py: The implementation of the Mistral model.

Our implementation of Mistral is built on top of our GPT-2 implementation. For more details, see gpt2_model.py.

Available Configurations

ConfigurationDescription
params_mistral_7B.yaml7B parameter Mistral model.
params_mistral_7B_msl128k.yaml7B parameter Mistral model with 128K MSL.
params_mistral_12b.yaml12B parameter Mistral model.

Workflow

For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning.

For a complete list of Cerebras ModelZoo CLI commands, see the command reference.

References