Common Issues and Workarounds
Learn how to fix common errors.
Cannot Load Checkpoints in GPUs
When trying to load a model trained on a Cerebras cluster onto a GPU, there’s an incompatibility between formats.
Models trained on a Cerebras cluster are in HDF5 format, but when attempting to load the model on a GPU, the system expects the checkpoint to be in pickle format.
Workaround
Learn how to convert between checkpoint file formats in our Convert Cerebras Checkpoints for GPUs guide.
Custom PyTorch Script Causes Infinite Loop or Multiple Compilation Jobs
When using a custom PyTorch training/eval script, the script gets stuck in an infinite loop, or multiple compliliation jobs are launched.
Workaround
This issue occurs because the script lacks an if __name__ == “__main__”
guard. During execution, subprocesses may be created (e.g., for weight transfer or surrogate jobs), which can cause the entire module to run unintentionally.
To prevent this, wrap your script’s main logic inside an if __name__ == “__main__”
block.
Error Parsing Metadata
When compiling or running models, you may see this error message intermittently:
Error parsing metadata: error=invalid value key=content-type value=text/html
This error is a bug in GRPC.
Workaround
The error itself does not affect the outcome of a run, but you can disble the error message by setting this environment variable:
This will hide all log messages and remove ALL logs coming from GRPC. This has not been thoroughly validated.
Error Receiving Activation
When trying to run your own model, you may encounter this error:
This error has many possible causes, but one common issue relates to how the dataloader is structured in your run script.
When running custom models, the dataloader must be in a separate file within the same directory as the main execution or model script. If the dataloader is defined within the run script, the input workers may fail to pickle the input function from the __main__
module, leading to this error.
Workaround
Place the dataloader in a separate script rather than being defined within the main training script. Below is an example of an appropriate directory structure:
Failed Mount Directory During Execution
When running a training job, it fails with the following error:
Workaround
This error is under investigation. In some cases, rerunning the command solves the issue. If you are still encountering this error, contact Cerebras for assistance.
Automatic Checkpoint Loading Failure
When adding custom checkpoints in model_dir
, they aren’t automatically loaded during runs. This is because the checkpoint naming convention doesn’t match the expected format.
The auto-load feature searches for files named checkpoint_<step>.mdl
in your model_dir
, loading the one with the highest <step>
value. This feature is enabled by default but can be disabled by setting runconfig.autoload_last_checkpoint
to False
in your params YAML.
Workaround
Either:
- Rename your checkpoint to follow the
checkpoint_<step>.mdl
format - Explicitly specify the checkpoint path using the
--checkpoint_path
flag
Functionalization Error
When tracing a model for Cerebras hardware, you might encounter the following error:
This happens because:
- In-place operations aren’t allowed in the compute graph
- Cerebras uses “functionalization” to convert in-place operations to non-in-place alternatives
- For this to work, all tensors must be on the same device - specifically the device associated with the Cerebras
backend
Workaround
To fix this error, ensure all tensors are on the same backend
device by creating new tensors directly on the backend device:
Or, by moving existing tensors to the backend device:
Input Starvation
If your dataloader isn’t keeping up with your model during a run, you’ll encounter the following error:
If the issue persists, you’ll encounter an additional error:
Workaround
To fix this issue, you’ll need to speed up your data pipeline. See Creating Custom Dataloaders to learn about improving the performance of your dataloader and view examples.
Module Not Found
There are two ModuleNotFound
errors you may encounter:
Core Python Module Errors
When trying to use certain built-in Python modules like bz2, users may receive errors about missing core modules (bz2
, sqlite3
). For example:
This happens because the Python installation was compiled from source without all necessary system dependencies.
User-Installed Package Errors
You may encounter a ModuleNotFoundError
for Python packages that are installed on your local machine but unavailable in the Cerebras environment.
Our Custom Worker Container Workflow attempts to import your dependencies into Cerebras appliances, with a fallback that mounts packages from your virtual environment.
Workaround
For core python module errors, install the missing system packages (bzip2-devel
, sqlite-devel
) and rebuild Python, or use a pre-built Python binary instead.
For user-installed package errors:
- Disable the Custom Worker Container Workflow (see instructions here).
- Install packages in your virtual environment with pip.
- Copy the custom package directory from
venv/lib/python3.8/site-packages/<package_name>
to a NFS-mountable location. Only copy the custom packages, not the entire virtual environment. - Add this location to
--mount_dirs
and its parent to--python_paths
when running jobs.
Loss Complilation Issues with Autogen
When creating custom losses, you might encounter compilation failures.
Workaround
Wrap your custom loss class with the @autogen_loss
decorator, which enables AutoGen to handle the compilation of these custom losses efficiently.
Model is Too Large to Fit in Memory
If you encounter the following error, this means the memory requirements are too large to fit on the device:
Workaround
-
On transformer models, compile again with the batch size set to 1 using one CS-2 system to determine if the specified maximum sequence length is feasible.
-
You can try a smaller batch size per device or enable batch tiling (only on transformer models) by setting the
micro_batch_size
parameter in thetrain_input
oreval_input
section of your model’s yaml file (see working_with_microbatches). -
If you ran with batch tiling with a specific
micro_batch_size
value, you can try compiling with a decreasedmicro_batch_size
. The Using “explore” to Search for a Near-Optimal Microbatch Size flow can recommend performant micro batch sizes that will fit in memory. -
On CNN models where batch tiling isn’t supported, try manually decreasing the batch size and/or the image/volume size.
-
For more information on working with batch tiling and selecting performant
micro_batch_size
values, see our tutorial on automatic microbatching. -
The
batch_size
parameter set on the yaml configuration is the global batch size. This means that the batch size per CS-2 system is computed as the global batch size divided by the number of CS-2s used.
Numerical Issues
During low-precision training (POL=1), particularly with large output vocabularies (30,000-60,000 words), the final projection layer, converting internal representations to words, frequently exhibits accuracy issues.
During the backward pass, the final projection layer accumulates a large number of values (equal to the vocabulary size) for each output, using low-precision 16-bit arithmetic. This extensive accumulation can introduce inaccuracies, hindering convergence. Additionally, the inputs to this layer typically originate from a softmax cross-entropy layer, whose non-normal distribution deviates significantly from the typical normal distributions observed in most layers, further contributing to inaccuracy on the backward pass.
Workaround
To mitigate potential convergence issues arising from numerical instability in the final projection layer during low-precision training (POL=1), a per-layer setting of POL=0
should be applied to this specific layer.
This ensures the highest numerical precision for the final projection while maintaining the performance advantages of POL=1 throughout the rest of the model. This modification has already been incorporated into the Model Zoo variants of Cerebras large language models.
Throughput Spike After Saving Checkpoints
If you notice a throughput spike after saving checkpoints, this is a known artifact of reporting throughput on the user node, caused by the asynchronous nature of execution on Wafer-Scale Cluster. For more details and understanding of this behavior, please refer to Measure throughput of your model.
Out of Memory Errors and System Resources
View our guide on troubleshooting memory errors and system resource issues for more information.
Vocabulary Size
If you encounter the following error, your vocabulary size may be too large:
Vocabularies up to one million tokens are supported, but it may take up to 90 minutes to compile. Large vocabulary sizes have not been fully tested on models with 2.7 billion parameter models or more.
When using extremely small vocabularies (fewer than 4 tokens), compilation errors may occur.