Transformer
Implementation of the original Transformer architecture introduced in “Attention Is All You Need”.
Model Description
This implementation reproduces the original Transformer model architecture introduced in Attention Is All You Need. It was first applied to English–German translation on the WMT16 dataset and introduced the now-standard building blocks of modern NLP models: multi-head self-attention, layer normalization, feed-forward networks, residual connections, and positional embeddings.
While this implementation shares much of its foundation with the T5 model, it includes important differences in architecture, datasets, model sizes, and training objectives. In particular, this model uses learned absolute positional embeddings rather than relative encodings, and the training task is translation rather than general sequence-to-sequence learning.
Code Structure
The code for this model is located in the transformer
directory. It reuses shared infrastructure where possible, especially components from the T5 implementation.
configs/
: YAML configuration files for various Transformer model sizes and training setups.data_preparation/nlp/transformer/
: Scripts for preprocessing the WMT16 English–German dataset.
Available Configurations
Configuration | Description |
---|---|
transformer_base.yaml | Base Transformer model with d_kv=64 , num_heads=8 , and encoder_num_hidden_layers=6 . |
transformer_large.yaml | Large Transformer model with d_kv=64 , num_heads=16 , and encoder_num_hidden_layers=6 . |
Workflow
For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning.
For a complete list of Cerebras ModelZoo CLI commands, see the command reference.
Implementation Notes
This implementation includes two deviations from the original Transformer paper:
- Optimizer: Adafactor is not currently supported. We use AdamW instead, which may result in slightly higher final training loss.
- Positional Embeddings: Learned absolute position embeddings are used rather than fixed sinusoidal embeddings. This simplification can slightly degrade performance but simplifies implementation.