Model Description

This implementation reproduces the original Transformer model architecture introduced in Attention Is All You Need. It was first applied to English–German translation on the WMT16 dataset and introduced the now-standard building blocks of modern NLP models: multi-head self-attention, layer normalization, feed-forward networks, residual connections, and positional embeddings.

While this implementation shares much of its foundation with the T5 model, it includes important differences in architecture, datasets, model sizes, and training objectives. In particular, this model uses learned absolute positional embeddings rather than relative encodings, and the training task is translation rather than general sequence-to-sequence learning.

Code Structure

The code for this model is located in the transformer directory. It reuses shared infrastructure where possible, especially components from the T5 implementation.

Available Configurations

ConfigurationDescription
transformer_base.yamlBase Transformer model with d_kv=64, num_heads=8, and encoder_num_hidden_layers=6.
transformer_large.yamlLarge Transformer model with d_kv=64, num_heads=16, and encoder_num_hidden_layers=6.

Workflow

For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning.

For a complete list of Cerebras ModelZoo CLI commands, see the command reference.

Implementation Notes

This implementation includes two deviations from the original Transformer paper:

  1. Optimizer: Adafactor is not currently supported. We use AdamW instead, which may result in slightly higher final training loss.
  2. Positional Embeddings: Learned absolute position embeddings are used rather than fixed sinusoidal embeddings. This simplification can slightly degrade performance but simplifies implementation.

References