Learn how to start training models on the Cerebras Wafer-Scale Cluster.

Overview

Get Started with Cerebras

Cerebras AI

Setup and Installation

Follow this guide to pre-train your first model on a Cerebras system.

Pre Train Your First Model

Follow this guide to fine-tune your first model on a Cerebras system.

Fine Tune Your First Model

Current Release Highlights

Cerebras Wafer Scale Cluster

Weight Streaming Execution

Model Zoo Overview

Model Zoo CLI Overview

Trainer Overview

Trainer Configuration Overview

Learn how to convert legacy YAMLs into the new Trainer YAML configuration.

Convert Legacy to Trainer YAML

Learn how to preprocess text-only and multimodal datasets into HDF5 format for pre-training and fine-tuning.

Quickstart: Preprocess Your Data

Pretraining With Upstream Validation

Downstream Validation Using Eleuther Eval Harness

Downstream Validation Using Bigcode Eval Harness

Pretraining With Downstream Validation

Fine Tuning With Validation

Multi Phase Training

Summarize Scalars And Tensors

Extend Context Length Using Position Interpolation

Train An LLM With A Large Or Small Context Window 

 Instruction Fine Tuning For Llms 

This page will cover how to configure the with weight sparsity.

 Train A Model With Weight Sparsity

Optimize Performance with Automatic Microbatching 

 Train An Llm Using Maximal Update Parameterization With Legacy Params

Dynamic Loss Scaling

 Training With Number Of Tokens Loss Scaling 

The Cerebras documentation offers comprehensive guides on several key components crucial for efficient machine learning workflows

Learn how to set up and run a deduplication pipeline on the Cerebras platform.

Data Deduplication Pipeline     

Learn about the model configuration classes in the Cerebras Model Zoo, which are used to manage and customize model settings for efficient training and deployment.

Model Zoo Config Classes

See usage examples for common operations within Model Zoo.

Model Zoo Usage Examples

The Model Zoo registry serves as the central source of truth for all model definitions and their associated data processors.

Model Zoo Registry

Learn how to use the Model Zoo conversion scripts to convert to Hugging Face.

Convert a Trained and Finetuned Model to Hugging Face

Learn how to convert Hugging Face models to the Cerebras Model Zoo format, enabling seamless deployment and optimization on Cerebras's advanced hardware.

Convert a Hugging Face Model to Cerebras Model Zoo     

Learn how to convert existing PyTorch models to run on Cerebras systems.

Port Pytorch Models To Cerebras

Learn how to write a custom training loop for a simple, fully connected model on the MNIST dataset.

Writing a Custom Training Loop

Step Closures

Saving Loading Checkpoint

Learn how to enhance weight initialization efficiency and speed for large-scale models using advanced Cerebras techniques.

Efficient Weight Initialization 

Learn how to write a Cerebras-compliant custom optimizer.

Writing Custom Optimizers 

Learn how to write a custom learning rate scheduler.

Learning Rate Scheduling

Gradient Scaling

Learn strategies for integrating sparsity into Cerebras models to optimize performance and computational efficiency across neural network architectures.

Sparsifying Models 

Restartable Dataloaders 

Profiling the Executor 

Learn to use and create metrics in Cerebras for evaluating PyTorch models, including predefined metrics like AccuracyMetric and custom metrics tailored to specific evaluation needs.

Evaluation Metrics 

Per Layer Precision Optimization Level 

Limitations of Pytorch on Cerebras

Learn more about how jobs are scheduled and monitored on Cerebras systems.

Cerebras Job Scheduling and Monitoring

Learn how to use the csctl CLI tool to manage and monitor jobs.

CLI for Job Monitoring      

 Job Priority 

Learn how to use Grafana to monitor your Cerebras cluster.

 Cluster Monitoring With Grafana 

Running jobs on the Cerebras Wafer-Scale cluster is straightforward and similar to running them on a single device. Here’s a comprehensive guide to get you started.

Launch Your Job

 Cerebras Cluster Settings 

To optimize model performance on Cerebras hardware, we use a mix of pre-written and AutoGen (automatically generated) kernels.

Kernel Autogeneration with Autogen

In a Wafer-Scale Cluster execution, the input pipeline runs on input worker nodes within the cluster, which are separate processes started by the Appliance on different CPU nodes.

Define Environment Variables For Input Workers

If you are developing your own data loader, you might need to provide specific dependency packages in the virtual Python environments to support your data loader functions.

Import User Dependencies In Cerebras

ML Training is often bottenecked at the dataloader stage.

Special Considerations For Cv Dataloaders

Measure Throughput Of Your Model

In scenarios involving multiple groups and users on a Cerebras Wafer-Scale cluster within your organization, certain specific requirements may arise:

Managing Cluster Access Controls

Stay up to date with the latest features, enhancements, bug fixes, and improvements. 

Release Notes

Access documentation for previous releases using the links below.

Previous Releases

Troubleshooting

You trained a model using Cerebras cluster and you would like to reload the model for continued gpu training/inference. 

Cannot Load Cerebras Checkpoints In Gpus

Custom Pt Training Script Spawns Multiple Compile Jobs

 Loss Compilation Issues With Autogen 

 Error Parsing Metadata 

 Error Receiving Activation 

Failed Mount Directory During Execution

When adding a custom checkpoint into `model_dir`, the run command does not automatically pick that up in my model_dir and runs with it. Instead, it reports that no checkpoint is found.

 Failing To Automatically Load Checkpoints 

Failure To Trace Due To Functionalization Error

 Input Starvation 

The Cerebras Wafer-Scale Cluster explicitly enforces limits on Memory and CPU requests to enable parallel compiles and trains. This limits can be overridden depending on the user needs. You can learn more in resource_parallel_compile.

 Out Of Memory Errors And System Resources 

Model Is Too Large To Fit In Memory

 Modulenotfounderror 

 Numerical Issues 

If you notice a throughput spike after saving checkpoints, this is a known artifact of reporting throughput on the user node, caused by the asynchronous nature of execution on Wafer-Scale Cluster. For more details and understanding of this behavior, please refer to [Measure throughput of your model](/fundamentals/measure-throughput-of-your-model).

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

Get Started with Cerebras

Getting Started

Core Concepts

Model Zoo

Cs Torch API