Model Description

GPT-3 is a decoder-only transformer language model architecture designed for large-scale autoregressive pretraining. It extends GPT-2 with significantly more parameters (ranging from 1.3B to 175B) and introduces architectural refinements such as sparse attention layers, used in alternating blocks to reduce compute costs during training. However, this implementation uses the GPT-2-style dense attention in all layers. Training occurs on next-token prediction using large text corpora like The PILE, with inputs represented as token sequences padded and masked to a fixed maximum sequence length.

Code Structure

The code for this model is located in the gpt3 directory within ModelZoo. Here’s how it’s organized:
  • configs/: Contains YAML configuration files for various GPT-3-sized models.
  • run.py: Training and evaluation entry point. Accepts CLI arguments for mode, config path, checkpointing, and output directories.
Our implementation of GPT-3 is built on top of our GPT-2 backbone. For more details, see gpt2_model.py.

Available Configurations

ConfigurationDescription
111m.yaml111M parameter model using standard parametrization.
111m_mup.yaml111M parameter model with Maximal Update Parametrization (µP).
256m.yaml256M parameter model using standard parametrization.
256m_mup.yaml256M parameter model with µP.
590m.yaml590M parameter model using standard parametrization.
590m_mup.yaml590M parameter model with µP.
1p3b.yaml1.3B parameter model (GPT-3 XL equivalent).
1p3b_mup.yaml1.3B parameter model with µP.
2p7b.yaml2.7B parameter model.
2p7b_mup.yaml2.7B parameter model with µP.
6p7b.yaml6.7B parameter model.
13b_bs720.yaml13B parameter model, batch size 720.
13b_bs1080.yaml13B parameter model, batch size 1080.
ConfigurationDescription
params_gpt3_125m_rigl75.yaml125M parameter model with 75% sparsity using RigL pruning.
params_gpt3_125m_set75.yaml125M parameter model with 75% sparsity using SET pruning.
params_gpt3_125m_static75.yaml125M parameter model with 75% fixed sparse weights.
params_gpt3_125m_sparsewide-ift_dense.yaml125M dense model for sparsewide-IFT comparison.
params_gpt3_125m_sparsewide-ift_rigl75.yaml125M model with 75% RigL sparsity in sparsewide-IFT setup.
params_gpt3_125m_sparsewide-ift_static50.yaml125M model with 50% static sparsity in sparsewide-IFT setup.
params_gpt3_6p7b_vspdf_phase1.yaml6.7B sparse model for VSPDF Phase 1 training.
params_gpt3_6p7b_vspdf_phase2.yaml6.7B sparse model for VSPDF Phase 2 training.
params_gpt3_6p7b_vspdf_dart.yaml6.7B model with DART sparsity applied for VSPDF fine-tuning.
The 1.3b(xl), 2.7b, 6.7b and 13b configs above show an example of setting micro batch size explicitly in the train_input section of the config. Without this setting, the best micro batch size search will be performed automatically during compilation which could take long time for larger models.

Model Input Tensor Specifications

Input NameShapeData TypeDescription
input_ids(batch_size, max_sequence_length)torch.int32Token IDs, padded to full sequence length.
attention_mask(batch_size, max_sequence_length)torch.int321s for valid tokens, 0s for padding.
labels(batch_size, max_sequence_length)torch.int32Targets for language modeling (same as inputs).
These are generated using GptHDF5DataProcessor.py, which consumes PILE-formatted datasets and outputs .h5 files via preprocess_data.py.

Workflow

For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning. For a complete list of Cerebras ModelZoo CLI commands, see the command reference.

Advanced Features

This implementation supports:

References