# Cerebras Training ## Docs - [Cerebras Job Scheduling and Monitoring](https://training-docs.cerebras.ai/rel-2.5.0/cluster-monitoring/cerebras-job-scheduling-and-monitoring.md): Learn more about how jobs are scheduled and monitored on Cerebras systems. - [Job Monitoring CLI](https://training-docs.cerebras.ai/rel-2.5.0/cluster-monitoring/cerebras-job-scheduling-and-monitoring/cli-for-job-monitoring-csctl.md): Learn how to use the csctl CLI tool to manage and monitor jobs. - [Grafana Dashboards](https://training-docs.cerebras.ai/rel-2.5.0/cluster-monitoring/cerebras-job-scheduling-and-monitoring/cluster-monitoring-with-grafana.md): Use Grafana to monitor your Cerebras cluster. - [Job Priority](https://training-docs.cerebras.ai/rel-2.5.0/cluster-monitoring/cerebras-job-scheduling-and-monitoring/job-priority.md): Learn how jobs are prioritized on a Cerebras cluster. - [Cerebras Wafer Scale Cluster](https://training-docs.cerebras.ai/rel-2.5.0/concepts/cerebras-wafer-scale-cluster.md) - [Weight Streaming Execution](https://training-docs.cerebras.ai/rel-2.5.0/concepts/weight-streaming-execution.md) - [Writing a Custom Training Loop](https://training-docs.cerebras.ai/rel-2.5.0/cs-torch/writing-a-custom-training-loop.md): Learn how to write a custom training loop for a simple, fully connected model on the MNIST dataset. - [Efficient Weight Initialization ](https://training-docs.cerebras.ai/rel-2.5.0/cs-torch/writing-a-custom-training-loop/efficient-weight-initialization.md): Learn how to enhance weight initialization efficiency and speed for large-scale models using advanced Cerebras techniques. - [Evaluation Metrics ](https://training-docs.cerebras.ai/rel-2.5.0/cs-torch/writing-a-custom-training-loop/evaluation-metrics.md): Learn to use and create metrics in Cerebras for evaluating PyTorch models, including predefined metrics like AccuracyMetric and custom metrics tailored to specific evaluation needs. - [Gradient Scaling](https://training-docs.cerebras.ai/rel-2.5.0/cs-torch/writing-a-custom-training-loop/gradient-scaling.md) - [Learning Rate Scheduling](https://training-docs.cerebras.ai/rel-2.5.0/cs-torch/writing-a-custom-training-loop/learning-rate-scheduling.md): Learn how to write a custom learning rate scheduler. - [Limitations of Pytorch on Cerebras](https://training-docs.cerebras.ai/rel-2.5.0/cs-torch/writing-a-custom-training-loop/limitations-of-pytorch-on-cerebras.md) - [Per Layer Precision Optimization Level ](https://training-docs.cerebras.ai/rel-2.5.0/cs-torch/writing-a-custom-training-loop/per-layer-precision-optimization-level.md) - [Profiling the Executor ](https://training-docs.cerebras.ai/rel-2.5.0/cs-torch/writing-a-custom-training-loop/profiling-the-executor.md) - [Restartable Dataloaders ](https://training-docs.cerebras.ai/rel-2.5.0/cs-torch/writing-a-custom-training-loop/restartable-dataloaders.md) - [Saving Loading Checkpoint](https://training-docs.cerebras.ai/rel-2.5.0/cs-torch/writing-a-custom-training-loop/saving-loading-checkpoints.md) - [Sparsifying Models ](https://training-docs.cerebras.ai/rel-2.5.0/cs-torch/writing-a-custom-training-loop/sparsifying-models.md): Learn strategies for integrating sparsity into Cerebras models to optimize performance and computational efficiency across neural network architectures. - [Step Closures](https://training-docs.cerebras.ai/rel-2.5.0/cs-torch/writing-a-custom-training-loop/step-closures.md) - [Writing Custom Optimizers ](https://training-docs.cerebras.ai/rel-2.5.0/cs-torch/writing-a-custom-training-loop/writing-custom-optimizers.md): Learn how to write a Cerebras-compliant custom optimizer. - [Define Environment Variables For Input Workers](https://training-docs.cerebras.ai/rel-2.5.0/fundamentals/define-environment-variables-for-input-workers.md): In a Wafer-Scale Cluster execution, the input pipeline runs on input worker nodes within the cluster, which are separate processes started by the Appliance on different CPU nodes. - [Import User Dependencies In Cerebras](https://training-docs.cerebras.ai/rel-2.5.0/fundamentals/import-user-dependencies-in-cerebras.md): If you are developing your own data loader, you might need to provide specific dependency packages in the virtual Python environments to support your data loader functions. - [Configure Notifications](https://training-docs.cerebras.ai/rel-2.5.0/fundamentals/job-notifications.md): Learn how to configure email, Slack, or PagerDuty notifications for your jobs. - [Kernel Autogeneration with Autogen](https://training-docs.cerebras.ai/rel-2.5.0/fundamentals/kernel-autogeneration-with-autogen.md): To optimize model performance on Cerebras hardware, we use a mix of pre-written and AutoGen (automatically generated) kernels. - [Launch a Job](https://training-docs.cerebras.ai/rel-2.5.0/fundamentals/launch-your-job.md): Learn how to launch a job on a Cerebras cluster. - [ Cerebras Cluster Settings ](https://training-docs.cerebras.ai/rel-2.5.0/fundamentals/launch-your-job/cerebras-cluster-settings.md) - [Managing Cluster Access Controls](https://training-docs.cerebras.ai/rel-2.5.0/fundamentals/managing-cluster-access-controls.md): In scenarios involving multiple groups and users on a Cerebras Wafer-Scale cluster within your organization, certain specific requirements may arise: - [Measure Model Throughput](https://training-docs.cerebras.ai/rel-2.5.0/fundamentals/measure-throughput-of-your-model.md): Learn how to measure the training throughput of your model to evaluate performance and optimize efficiency. - [Fine-Tune Your First Model](https://training-docs.cerebras.ai/rel-2.5.0/getting-started/fine-tune-your-first-model.md): Follow this guide to fine-tune your first model on a Cerebras system. - [Get Started with Cerebras](https://training-docs.cerebras.ai/rel-2.5.0/getting-started/overview.md): Learn how to start training models on the Cerebras Wafer-Scale Cluster. - [Pretrain Your First Model](https://training-docs.cerebras.ai/rel-2.5.0/getting-started/pre-train-your-first-model.md): Follow this guide to pretrain your first model on a Cerebras system. - [Setup and Installation](https://training-docs.cerebras.ai/rel-2.5.0/getting-started/setup-and-installation.md): Learn how to install Model Zoo. - [Automatic Job Restart](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/automatic-job-restart.md): Learn how to configure automatic job restart in your Trainer config. - [CLI Overview](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/cli-overview.md): Learn how to use the ModelZoo CLI. - [Data Deduplication Pipeline ](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/data-deduplication-pipeline.md): Learn how to set up and run a deduplication pipeline on the Cerebras platform. - [Custom Tokenizer ](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/data-preprocessing/custom-tokenizer.md): Learn how to initialize a custom tokenizer to better handle domain-specific vocabulary or capture nuances that standard tokenizers miss. - [Dataset Splitting and Preprocessing](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/data-preprocessing/data-context-splits.md): Learn how to split and preprocesses datasets for LLM training. - [Configure Data Preprocessing](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/data-preprocessing/data-preprocessing.md): Learn how to preprocess your data into HDF5 format for pretraining, finetuning, and custom processing tasks. - [Configure Input Data](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/data-preprocessing/input-data-configuration.md): Learn how to configure your input data for preprocessing—whether you're working with a single directory of data or organizing large datasets into subsets. - [On-The-Fly Data Processing](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/data-preprocessing/on-the-fly-data-processing.md): Learn how to enable on-the-fly (OTF) data preprocessing during training and/or evaluation. - [Read Hooks](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/data-preprocessing/read-hooks.md): This guide details various read hooks you can use to convert different types of raw input data into HDF5 format for machine learning tasks on Cerebras Systems. - [Write a Custom Read Hook](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/data-preprocessing/semantic-regions.md): Learn how to create custom read hooks for processing different data formats into semantic data arrays, enabling flexible data preprocessing for AI model training. - [Token Generators ](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/data-preprocessing/token-generators.md): Learn about supported Token Generators for data preprocessing. - [ Visualization and Debugging](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/data-preprocessing/visualization-and-debugging.md): Learn how to use our TokenFlow tool to visualize and debug your preprocessed data. - [Creating Custom Dataloaders](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/dataloaders/creating-custom-dataloaders.md): Learn how to create and optimize custom PyTorch dataloaders for Cerebras systems. - [ Dataloaders for Pytorch](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/dataloaders/dataloaders-for-pytorch.md) - [Model Zoo Config Classes](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/model-zoo-config-classes.md): Learn about the model configuration classes in the Cerebras Model Zoo, which are used to manage and customize model settings for efficient training and deployment. - [Model Zoo Registry](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/model-zoo-registry.md): The Model Zoo registry serves as the central source of truth for all model definitions and their associated data processors. - [Usage Examples](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/model-zoo-usage-examples.md): See usage examples for common operations within Model Zoo. - [Backend](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/trainer-components/backend.md): Learn how to set up a backend or device for the Trainer class. - [ Checkpointing ](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/trainer-components/checkpointing.md) - [Customizing the Trainer with Callbacks](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/trainer-components/customizing-the-trainer-with-callbacks.md) - [Defer Weight Initialization](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/trainer-components/defer-weight-initialization.md): By deferring the initialization of model weights, you can significantly reduce the time-to-first-loss, leading to faster iteration times and a more efficient training process. - [Logging](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/trainer-components/logging.md) - [Loop](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/trainer-components/loop.md): Learn how to configure the training and validation loops of the Trainer using two `LoopCallback` subclasses. - [Model](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/trainer-components/model.md): Learn how to pass a model to the Trainer class. - [Model Directory](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/trainer-components/model-directory.md) - [Numeric Precision](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/trainer-components/numeric-precision.md) - [Optimizer And Scheduler](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/trainer-components/optimizer-and-scheduler.md) - [Performance Flags](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/trainer-components/performance-flags.md) - [Reproducibility](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/components/trainer-components/reproducibility.md): Reproducibility is an essential component of training ML models. - [Downstream Validation Using Bigcode Eval Harness](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/core-workflows/downstream-validation-using-bigcode-eval-harness.md) - [Downstream Validation Using Eleuther Eval Harness](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/core-workflows/downstream-validation-using-eleuther-eval-harness.md): Learn how to run downstream validation using EleutherAI’s Evaluation Harness (EEH). - [Fine Tuning With Validation](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/core-workflows/fine-tuning-with-validation.md): Learn how to configure and execute a fine-tuning run with upstream validation. - [Pretraining With Downstream Validation](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/core-workflows/pretraining-with-downstream-validation.md): Learn how to configure downstream validation as part of your pretraining workflow. - [Pretraining With Upstream Validation](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/core-workflows/pretraining-with-upstream-validation.md): Learn how to configure and execute a pretraining run with upstream validation. - [Quickstart: Preprocess Your Data](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/core-workflows/quickstart-guide-for-data-preprocessing.md): Learn how to preprocess text-only and multimodal datasets into HDF5 format for pretraining and fine-tuning. - [Summarize Scalars And Tensors](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/core-workflows/summarize-scalars-and-tensors.md) - [Converter Tool](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/migration/convert-checkpoints-and-model-configs/convert-checkpoints-and-model-configs.md): Learn how to convert checkpoints and config files between Model Zoo and other external code repositories using our converter tool. - [Convert CS Checkpoints for GPUs](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/migration/convert-checkpoints-and-model-configs/work-with-cerebras-checkpoints.md): Learn how to work with Cerebras-format checkpoints, including how to load, convert, and reuse them in your training workflows. - [Convert From Hugging Face](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/migration/port-a-hugging-face-model-to-cerebras-model-zoo.md): Learn how to convert Hugging Face models to the Cerebras Model Zoo format, enabling seamless deployment and optimization on Cerebras's advanced hardware. - [Convert to Hugging Face](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/migration/port-a-trained-and-fine-tuned-model-to-hugging-face.md): Learn how to use the Model Zoo CLI to convert to Hugging Face. - [Port Pytorch Models To Cerebras](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/migration/porting-pytorch-models-to-cerebras.md): Learn how to convert existing PyTorch models to run on Cerebras systems. - [S3 Checkpointing](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/migration/s3-checkpoints.md): Save, load, and manage checkpoints using S3-compatible storage. - [Intro to Model Zoo](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/model-zoo-overview.md): Get an overview of the Cerebras Model Zoo, including model portability, modules, and updated directory paths for enhanced usability. - [LLaVA](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/multimodal/llava.md): Multimodal model that connects a vision encoder to a language model through instruction tuning on GPT-4-generated image-text data. - [Multimodal Simple](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/multimodal/multimodal-simple.md): Cerebras' model library for implementing multimodal models - [BLOOM](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/bloom.md): Multilingual decoder-only language model with ALiBi positional embeddings, designed to generalize across 46 natural and 13 programming languages. - [DPO](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/dpo.md): A simple and stable method for fine-tuning language models using human or synthetic preference data without reinforcement learning. - [DPR](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/dpr.md): Dense Passage Retrieval (DPR) model for open-domain question answering using contrastive loss between question and passage encoders. - [ESM-2](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/esm2.md): Protein language model trained on UniRef50, using a masked language modeling objective to learn evolutionary and structural properties of proteins. - [Falcon](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/falcon.md): Series of decoder-only transformer models by TII, with 7B, 40B, and 180B parameter models - [Gemma 2](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/gemma.md): Decoder-only language models by Google DeepMind, using interleaved attention and GQA for high-quality performance at practical scale. - [GPT-3](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/gpt3.md): A decoder-only transformer language model, scaled to billions of parameters, trained on autoregressive next-token prediction with support for µP scaling and Cerebras-optimized workflows. - [GPT-J & GPT-Neox](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/gptj-neo.md): Decoder-only language models by EleutherAI, designed for high-throughput training and capable zero-shot performance on a range of natural language tasks. - [Jais](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/jais.md): Decoder-only language models optimized for Arabic and English, developed by Inception, MBZUAI, and Cerebras. - [LLaMA](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/llama.md): Series of decoder-only transfomer LLMs from Meta. - [Mistral](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/mistral.md): Decoder-only transformer models by Mistral, using sliding window attention and grouped-query attention for fast, high-quality language generation. - [Mixtral](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/mixtral.md): Sparse Mixture of Experts models using routing and expert specialization for scalable language modeling. - [SantaCoder](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/santacoder.md): Decoder-only language model for code generation by BigCode, trained on Java, JavaScript, and Python with support for fill-in-the-middle and multi-query attention. - [StarCoder](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/starcoder.md): Decoder-only language models for code generation by BigCode, trained on permissively licensed code with support for fill-in-the-middle and multi-query attention. - [T5](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/t5.md): Text-to-text transformer model trained on the C4 dataset using a denoising objective, capable of performing a wide range of NLP tasks in a unified format. - [Transformer](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/nlp/transformer.md): Implementation of the original Transformer architecture introduced in "Attention Is All You Need". - [DINOv2](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/vision/dino-v2.md): Self-supervised vision model that learns general-purpose visual features without labeled data, excelling in diverse image and pixel-level tasks - [Diffusion Transformer](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/vision/dit.md): Vision model based on the Diffusion Transformer architecture - [ViT](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/models/vision/vit.md): Implementation of Vision Transformers (ViT) for image classification on ImageNet-1K. - [Trainer Configuration](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/trainer-configuration-overview.md): Learn how to set up and customize the Trainer using a YAML configuration file. - [Convert Legacy to Trainer YAML](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/trainer-configuration-overview/correspondance-from-legacy-to-trainer.md): Learn how to convert legacy YAMLs into the new Trainer YAML configuration. - [Trainer Essentials](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/trainer-overview.md): ​Learn how to use the Model Zoo Trainer to simplift large-scale model training on the Cerebras Wafer-Scale Cluster. - [Dynamic Loss Scaling](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/tutorials/dynamic-loss-scaling.md): Learn how to enable dynamic loss scaling to improve stability and performance. - [Extend Context Length Using Position Interpolation](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/tutorials/extend-context-length-using-position-interpolation.md) - [Instruction Fine-Tuning for LLMs](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/tutorials/instruction-fine-tuning-for-llms.md) - [Advanced: Understanding and Manually Controlling MBS](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/tutorials/microbatching-advanced.md): Learn how Micro Batch Size (MBS) works under the hood, how the platform picks or overrides it, and how to optimize it manually. - [Beginner: Automatic Microbatching](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/tutorials/microbatching-beginner.md): Learn how to set the global batch size and choose a mode to find the optimal microbatch size. - [Automatic Microbatching](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/tutorials/optimize-performance-with-automatic-microbatching.md): Learn how to optimize performance by configuring the Trainer with automatic microbatching. - [Train a Model with Weight Sparsity](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/tutorials/train-a-model-with-weight-sparsity.md): Learn how to configure the Trainer class with weight sparsity. - [Train LLMs with μP: Legacy Parameters](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/tutorials/train-an-llm-using-maximal-update-parameterization-with-legacy-params.md): Learn how to train an LLM with maximal update parameterization (μP). - [Train an LLM with a Large or Small Context Window](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/tutorials/train-an-llm-with-a-large-or-small-context-window.md) - [Configure μP for Bert Pretrain Beta](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/tutorials/training-an-llm-using-maximal-update-parameterization/configure-mp-for-bert-pretrain--beta.md) - [Configure μP for GPT Style Models](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/tutorials/training-an-llm-using-maximal-update-parameterization/configure-mp-for-gpt-style-models.md) - [Configure μP for T5 Beta](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/tutorials/training-an-llm-using-maximal-update-parameterization/configure-mp-for-t5--beta.md) - [Train LLMs with μP](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/tutorials/training-an-llm-using-maximal-update-parameterization/overview.md): Learn how to train an LLM with maximal update parameterization (μP). - [ Training With Number Of Tokens Loss Scaling ](https://training-docs.cerebras.ai/rel-2.5.0/model-zoo/tutorials/training-with-number-of-tokens-loss-scaling.md) - [Release Notes](https://training-docs.cerebras.ai/rel-2.5.0/release-notes/release-notes-tabs.md): Stay up to date with the latest features, enhancements, bug fixes, and improvements. - [Previous Releases](https://training-docs.cerebras.ai/rel-2.5.0/support/previous-releases.md): Access documentation for previous releases using the links below. - [Common Issues and Workarounds](https://training-docs.cerebras.ai/rel-2.5.0/support/troubleshooting.md): Learn how to fix common errors. - [Out of Memory Errors and System Resources ](https://training-docs.cerebras.ai/rel-2.5.0/support/troubleshooting/out-of-memory-errors-and-system-resources.md): Learn how to identify when the resources needed are larger than the resources available. ## OpenAPI Specs - [openapi](https://training-docs.cerebras.ai/api-reference/openapi.yaml) ## Optional - [Community](https://discord.com/invite/ZqvYS2e2rY) - [API Reference](https://training-api.cerebras.ai)