T5 - Cerebras AI

Model Description

T5 (Text-To-Text Transfer Transformer) is a sequence-to-sequence model that frames all NLP tasks as text-to-text problems. Originally introduced by Raffel et al. in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, this model enables a single architecture to be applied across translation, summarization, classification, question answering, and more. This implementation follows the T5.1.1 variant, which focuses on self-supervised pretraining using the C4 dataset, excluding supervised datasets during pretraining. T5 modifies the standard Transformer block architecture by reordering normalization and residual connections, as illustrated below. T5’s key contributions include:

Proposing a unified text-to-text format for all NLP tasks (Section 2.4)
Comparing encoder-decoder vs. decoder-only variants (Section 3.2)
Evaluating different training objectives including denoising and language modeling (Section 3.3)

Code Structure

The code for this model is located in the t5 directory and reuses generic components for interfacing with training scripts and configuration systems.

configs/: YAML configuration files specifying training and model hyperparameters.
model.py: Wrapper for initializing and interfacing with the T5 model.
t5_model.py: Main model implementation including encoder-decoder structure and forward logic.
utils.py: Utility functions for config parsing and data handling.

Available Configurations

Configuration	Description
`t5_small.yaml`	T5-Small: `d_kv=64`, `num_heads=6`, `encoder_num_hidden_layers=8`.
`t5_base.yaml`	T5-Base: `d_kv=64`, `num_heads=12`, `encoder_num_hidden_layers=12`.
`t5_3B.yaml`	T5-3B: `d_kv=128`, `num_heads=32`, `encoder_num_hidden_layers=24`.
`t5_11B.yaml`	T5-11B: `d_kv=128`, `num_heads=128`, `encoder_num_hidden_layers=24`.

Workflow

For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning. For a complete list of Cerebras ModelZoo CLI commands, see the command reference.

Implementation Notes

This implementation includes the following deviations from the original T5.1.1 spec:

Optimizer: Adafactor is not currently supported. We use AdamW, which may lead to slightly higher final loss.
Normalization: We use LayerNorm instead of the originally proposed RMSNorm due to hardware support constraints.

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

T5

Model Description

Code Structure

Available Configurations

Workflow

Implementation Notes

References

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

​Model Description

​Code Structure

​Available Configurations

​Workflow

​Implementation Notes

​References

Model Description

Code Structure

Available Configurations

Workflow

Implementation Notes

References