Cerebras AI home page
rel-2.4.0
Search...
⌘K
Ask AI
Contact Us
Contact Us
Search...
Navigation
Troubleshooting
Troubleshooting
Concepts & Guides
Release Notes
Community
API Reference
Getting Started
Get Started with Cerebras
Setup and Installation
Pre Train Your First Model
Fine Tune Your First Model
Current Release Highlights
Concepts
Cerebras Wafer Scale Cluster
Weight Streaming Execution
Model Zoo
Model Zoo Overview
Model Zoo CLI Overview
Trainer Overview
Trainer Configuration Overview
Core Workflows
Tutorials
Components
Migration
CS Torch
Writing a Custom Training Loop
Cluster Monitoring
Cerebras Job Scheduling and Monitoring
CLI for Job Monitoring
Job Priority
Cluster Monitoring With Grafana
Fundamentals
Launch Your Job
Kernel Autogeneration with Autogen
Define Environment Variables For Input Workers
Import User Dependencies In Cerebras
Special Considerations For Cv Dataloaders
Measure Throughput of Your Model
Managing Cluster Access Controls
Support
Previous Releases
Troubleshooting
Troubleshooting
Cannot Load Cerebras Checkpoints in GPUs
Custom Pt Training Script Spawns Multiple Compile Jobs
Loss Compilation Issues With Autogen
Error Parsing Metadata
Error Receiving Activation
Failed Mount Directory During Execution
Failing To Automatically Load Checkpoints
Failure To Trace Due To Functionalization Error
Input Starvation
Out Of Memory Errors And System Resources
Model Is Too Large To Fit In Memory
Modulenotfounderror
Numerical Issues
Throughput Spike After Saving Checkpoints
Training Fails When Logged In As Root
Vocabulary Size Troubleshooting
Glossary
Troubleshooting
Troubleshooting
Cannot load Cerebras checkpoints in GPUs
Work around
Custom PT training script spawns multiple compile jobs
Observed Error
Explanation
Work around
Loss compilation issues with Autogen
Custom loss functions with AutoGen
Improving loss function performance
Error parsing metadata
Observed Error
Explanation
Work around
Error Receiving Activation
cerebras.appliance.errors.ApplianceUnknownError: Ran into error while receiving activation tensor <custom-call …>
Failed mount directory during execution
Observed Error
Work around
Failing to automatically load checkpoints
Explanation
Work around
Failure to trace due to functionalization error
Observed Error
Explanation
Work around
Input Starvation
Out of memory errors and system resources
Determining if your job is queued
Determining if job failed because of an OOM error
Determining if job failed because of system could not fit requested memory
Troubleshooting OOM errors
Model is too large to fit in memory
Observed Error
Causes and Possible Solutions
ModuleNotFoundError
ModuleNotFoundError: No module named <’_bz2’, ‘_sqlite3’>
ModuleNotFoundError: No module named <…>
Numerical issues
Observed Error
Explanation
Work around
Throughput spike after saving checkpoints
Training fails when logged-in as root
Observed Error
Explanation
Vocabulary Size Troubleshooting
Large vocabulary size
Small vocabulary size
Previous Releases
Cannot Load Cerebras Checkpoints in GPUs
Assistant
Responses are generated using AI and may contain mistakes.