Cerebras AI home pagelight logodark logo
  • Contact Us
  • Contact Us
Concepts & Guides
Release Notes
  • Community
  • API Reference
  • Getting Started
    • Get Started with Cerebras
    • Setup and Installation
    • Pre Train Your First Model
    • Fine Tune Your First Model
    • Current Release Highlights
    Concepts
    • Cerebras Wafer Scale Cluster
    • Weight Streaming Execution
    Model Zoo
    • Model Zoo Overview
    • Model Zoo CLI Overview
    • Trainer Overview
    • Trainer Configuration Overview
    • Core Workflows
    • Tutorials
    • Components
    • Migration
    CS Torch
    • Writing a Custom Training Loop
    Cluster Monitoring
    • Cerebras Job Scheduling and Monitoring
    • CLI for Job Monitoring
    • Job Priority
    • Cluster Monitoring With Grafana
    Fundamentals
    • Launch Your Job
    • Kernel Autogeneration with Autogen
    • Define Environment Variables For Input Workers
    • Import User Dependencies In Cerebras
    • Special Considerations For Cv Dataloaders
    • Measure Throughput of Your Model
    • Managing Cluster Access Controls
    Support
    • Previous Releases
    • Troubleshooting
      • Troubleshooting
      • Cannot Load Cerebras Checkpoints in GPUs
      • Custom Pt Training Script Spawns Multiple Compile Jobs
      • Loss Compilation Issues With Autogen
      • Error Parsing Metadata
      • Error Receiving Activation
      • Failed Mount Directory During Execution
      • Failing To Automatically Load Checkpoints
      • Failure To Trace Due To Functionalization Error
      • Input Starvation
      • Out Of Memory Errors And System Resources
      • Model Is Too Large To Fit In Memory
      • Modulenotfounderror
      • Numerical Issues
      • Throughput Spike After Saving Checkpoints
      • Training Fails When Logged In As Root
      • Vocabulary Size Troubleshooting
    • Glossary
    Troubleshooting

    Troubleshooting

    • Cannot load Cerebras checkpoints in GPUs

      • Work around
    • Custom PT training script spawns multiple compile jobs

      • Observed Error

      • Explanation

      • Work around

    • Loss compilation issues with Autogen

      • Custom loss functions with AutoGen

      • Improving loss function performance

    • Error parsing metadata

      • Observed Error

      • Explanation

      • Work around

    • Error Receiving Activation

      • cerebras.appliance.errors.ApplianceUnknownError: Ran into error while receiving activation tensor <custom-call …>
    • Failed mount directory during execution

      • Observed Error

      • Work around

    • Failing to automatically load checkpoints

      • Explanation

      • Work around

    • Failure to trace due to functionalization error

      • Observed Error

      • Explanation

      • Work around

    • Input Starvation

    • Out of memory errors and system resources

      • Determining if your job is queued

      • Determining if job failed because of an OOM error

      • Determining if job failed because of system could not fit requested memory

      • Troubleshooting OOM errors

    • Model is too large to fit in memory

      • Observed Error

      • Causes and Possible Solutions

    • ModuleNotFoundError

      • ModuleNotFoundError: No module named <’_bz2’, ‘_sqlite3’>

      • ModuleNotFoundError: No module named <…>

    • Numerical issues

      • Observed Error

      • Explanation

      • Work around

    • Throughput spike after saving checkpoints

    • Training fails when logged-in as root

      • Observed Error

      • Explanation

    • Vocabulary Size Troubleshooting

      • Large vocabulary size

      • Small vocabulary size

    Previous ReleasesCannot Load Cerebras Checkpoints in GPUs
    discordgithublinkedinyoutube
    Powered by Mintlify
    Assistant
    Responses are generated using AI and may contain mistakes.