GPT-3

Model Description

GPT-3 is a decoder-only transformer language model architecture designed for large-scale autoregressive pretraining. It extends GPT-2 with significantly more parameters (ranging from 1.3B to 175B) and introduces architectural refinements such as sparse attention layers, used in alternating blocks to reduce compute costs during training. However, this implementation uses the GPT-2-style dense attention in all layers. Training occurs on next-token prediction using large text corpora like The PILE, with inputs represented as token sequences padded and masked to a fixed maximum sequence length.

Code Structure

The code for this model is located in the gpt3 directory within ModelZoo. Here’s how it’s organized:

configs/: Contains YAML configuration files for various GPT-3-sized models.
run.py: Training and evaluation entry point. Accepts CLI arguments for mode, config path, checkpointing, and output directories.

Our implementation of GPT-3 is built on top of our GPT-2 backbone. For more details, see gpt2_model.py.

Available Configurations

Cerebras-GPT

Configuration	Description
`111m.yaml`	111M parameter model using standard parametrization.
`111m_mup.yaml`	111M parameter model with Maximal Update Parametrization (µP).
`256m.yaml`	256M parameter model using standard parametrization.
`256m_mup.yaml`	256M parameter model with µP.
`590m.yaml`	590M parameter model using standard parametrization.
`590m_mup.yaml`	590M parameter model with µP.
`1p3b.yaml`	1.3B parameter model (GPT-3 XL equivalent).
`1p3b_mup.yaml`	1.3B parameter model with µP.
`2p7b.yaml`	2.7B parameter model.
`2p7b_mup.yaml`	2.7B parameter model with µP.
`6p7b.yaml`	6.7B parameter model.
`13b_bs720.yaml`	13B parameter model, batch size 720.
`13b_bs1080.yaml`	13B parameter model, batch size 1080.

Sparsity

Configuration	Description
`params_gpt3_125m_rigl75.yaml`	125M parameter model with 75% sparsity using RigL pruning.
`params_gpt3_125m_set75.yaml`	125M parameter model with 75% sparsity using SET pruning.
`params_gpt3_125m_static75.yaml`	125M parameter model with 75% fixed sparse weights.
`params_gpt3_125m_sparsewide-ift_dense.yaml`	125M dense model for sparsewide-IFT comparison.
`params_gpt3_125m_sparsewide-ift_rigl75.yaml`	125M model with 75% RigL sparsity in sparsewide-IFT setup.
`params_gpt3_125m_sparsewide-ift_static50.yaml`	125M model with 50% static sparsity in sparsewide-IFT setup.
`params_gpt3_6p7b_vspdf_phase1.yaml`	6.7B sparse model for VSPDF Phase 1 training.
`params_gpt3_6p7b_vspdf_phase2.yaml`	6.7B sparse model for VSPDF Phase 2 training.
`params_gpt3_6p7b_vspdf_dart.yaml`	6.7B model with DART sparsity applied for VSPDF fine-tuning.

The 1.3b(xl), 2.7b, 6.7b and 13b configs above show an example of setting micro batch size explicitly in the train_input section of the config. Without this setting, the best micro batch size search will be performed automatically during compilation which could take long time for larger models.

Model Input Tensor Specifications

Input Name	Shape	Data Type	Description
`input_ids`	(batch_size, max_sequence_length)	`torch.int32`	Token IDs, padded to full sequence length.
`attention_mask`	(batch_size, max_sequence_length)	`torch.int32`	1s for valid tokens, 0s for padding.
`labels`	(batch_size, max_sequence_length)	`torch.int32`	Targets for language modeling (same as inputs).

These are generated using GptHDF5DataProcessor.py, which consumes PILE-formatted datasets and outputs .h5 files via preprocess_data.py.

Workflow

For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning. For a complete list of Cerebras ModelZoo CLI commands, see the command reference.

Advanced Features

This implementation supports:

µP (Maximal Update Parametrization): For hyperparameter transfer from small proxy models to large target models.
See µP Tutorial.
Cerebras-GPT Recipes: Prebuilt configs under configs/Cerebras_GPT/ to reproduce results from the Cerebras-GPT blog.

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

Model Description

Code Structure

Available Configurations

Model Input Tensor Specifications

Workflow

Advanced Features

References

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

​Model Description

​Code Structure

​Available Configurations

​Model Input Tensor Specifications

​Workflow

​Advanced Features

​References

Model Description

Code Structure

Available Configurations

Model Input Tensor Specifications

Workflow

Advanced Features

References