A decoder-only transformer language model, scaled to billions of parameters, trained on autoregressive next-token prediction with support for µP scaling and Cerebras-optimized workflows.
gpt3
directory within ModelZoo. Here’s how it’s organized:
configs/
: Contains YAML configuration files for various GPT-3-sized models.run.py
: Training and evaluation entry point. Accepts CLI arguments for mode, config path, checkpointing, and output directories.gpt2_model.py
.Cerebras-GPT
Configuration | Description |
---|---|
111m.yaml | 111M parameter model using standard parametrization. |
111m_mup.yaml | 111M parameter model with Maximal Update Parametrization (µP). |
256m.yaml | 256M parameter model using standard parametrization. |
256m_mup.yaml | 256M parameter model with µP. |
590m.yaml | 590M parameter model using standard parametrization. |
590m_mup.yaml | 590M parameter model with µP. |
1p3b.yaml | 1.3B parameter model (GPT-3 XL equivalent). |
1p3b_mup.yaml | 1.3B parameter model with µP. |
2p7b.yaml | 2.7B parameter model. |
2p7b_mup.yaml | 2.7B parameter model with µP. |
6p7b.yaml | 6.7B parameter model. |
13b_bs720.yaml | 13B parameter model, batch size 720. |
13b_bs1080.yaml | 13B parameter model, batch size 1080. |
Sparsity
Configuration | Description |
---|---|
params_gpt3_125m_rigl75.yaml | 125M parameter model with 75% sparsity using RigL pruning. |
params_gpt3_125m_set75.yaml | 125M parameter model with 75% sparsity using SET pruning. |
params_gpt3_125m_static75.yaml | 125M parameter model with 75% fixed sparse weights. |
params_gpt3_125m_sparsewide-ift_dense.yaml | 125M dense model for sparsewide-IFT comparison. |
params_gpt3_125m_sparsewide-ift_rigl75.yaml | 125M model with 75% RigL sparsity in sparsewide-IFT setup. |
params_gpt3_125m_sparsewide-ift_static50.yaml | 125M model with 50% static sparsity in sparsewide-IFT setup. |
params_gpt3_6p7b_vspdf_phase1.yaml | 6.7B sparse model for VSPDF Phase 1 training. |
params_gpt3_6p7b_vspdf_phase2.yaml | 6.7B sparse model for VSPDF Phase 2 training. |
params_gpt3_6p7b_vspdf_dart.yaml | 6.7B model with DART sparsity applied for VSPDF fine-tuning. |
Input Name | Shape | Data Type | Description |
---|---|---|---|
input_ids | (batch_size, max_sequence_length) | torch.int32 | Token IDs, padded to full sequence length. |
attention_mask | (batch_size, max_sequence_length) | torch.int32 | 1s for valid tokens, 0s for padding. |
labels | (batch_size, max_sequence_length) | torch.int32 | Targets for language modeling (same as inputs). |
GptHDF5DataProcessor.py
, which consumes PILE-formatted datasets and outputs .h5
files via preprocess_data.py
.
configs/Cerebras_GPT/
to reproduce results from the Cerebras-GPT blog.