> ## Documentation Index
> Fetch the complete documentation index at: https://training-docs.cerebras.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Common Issues and Workarounds

> Learn how to fix common errors.

## Cannot Load Checkpoints in GPUs

When trying to load a model trained on a Cerebras cluster onto a GPU, there's an incompatibility between formats.

Models trained on a Cerebras cluster are in HDF5 format, but when attempting to load the model on a GPU, the system expects the checkpoint to be in pickle format.

### Workaround

Learn how to convert between checkpoint file formats in our  [Convert Cerebras Checkpoints for GPUs](../model-zoo/migration/convert-checkpoints-and-model-configs/work-with-cerebras-checkpoints) guide.

## Custom PyTorch Script Causes Infinite Loop or Multiple Compilation Jobs

When using a custom PyTorch training/eval script, the script gets stuck in an infinite loop, or multiple compliliation jobs are launched.

### Workaround

This issue occurs because the script lacks an `if __name__ == “__main__”` guard. During execution, subprocesses may be created (e.g., for weight transfer or surrogate jobs), which can cause the entire module to run unintentionally.

To prevent this, wrap your script’s main logic inside an `if __name__ == “__main__”` block.

## Error Parsing Metadata

When compiling or running models, you may see this error message intermittently:

`Error parsing metadata: error=invalid value key=content-type value=text/html`

This error is a bug in [GRPC](https://grpc.io/).

### Workaround

The error itself does not affect the outcome of a run, but you can disble the error message by setting this environment variable:

```bash theme={null}
$ export GRPC_VERBOSITY=NONE
```

This will hide all log messages and remove ALL logs coming from GRPC. This has not been thoroughly validated.

## Error Receiving Activation

When trying to run your own model, you may encounter this error:

```bash theme={null}
cerebras.appliance.errors.ApplianceUnknownError: Ran into error while receiving activation tensor <custom-call ...> for runtime iteration ...
```

This error has many possible causes, but one common issue relates to how the dataloader is structured in your run script.

When running custom models, the dataloader must be in a separate file within the same directory as the main execution or model script. If the dataloader is defined within the run script, the input workers may fail to pickle the input function from the `__main__` module, leading to this error.

### Workaround

Place the dataloader in a separate script rather than being defined within the main training script. Below is an example of an appropriate directory structure:

```bash theme={null}
$ ls user_directory
--> run_model.py (entry script that is used to start the run using "python run_model.py ...")
--> dataloader.py (script containing the dataloaders and input functions)
```

## Failed Mount Directory During Execution

When running a training job, it fails with the following error:

```bash theme={null}
ERROR:   Uncaught exception:
Traceback (most recent call last):
 [...]
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
       status = StatusCode.INTERNAL
       details = "Please contact the Cerebras Support Team. Poll ingress failed during wsjob initialization: job-operator/wsjob-00000 has failed because total 1 replica(s) failed: [name:"wsjob-000000-worker-0" lastTimestamp:"2023-01-01 00:00:00 +0000 UTC" reason:"FailedMount" message:"Unable to attach or mount volumes: unmounted volumes=[home-volume-111111111 training-data-volume-22222 workdir-volume], unattached volumes=[kube-api-access home-volume-11111111111 training-data-volume-22222 workdir-volume worker-cache-partition-ro-volume worker-cache-dir-volume worker-dev-shm-volume cfg]: timed out waiting for the condition"]"
```

### Workaround

This error is under investigation. In some cases, rerunning the command solves the issue. If you are still encountering this error, contact Cerebras for assistance.

## Automatic Checkpoint Loading Failure

When adding custom checkpoints in `model_dir`, they aren't automatically loaded during runs. This is because the checkpoint naming convention doesn't match the expected format.

The auto-load feature searches for files named `checkpoint_<step>.mdl` in your `model_dir`, loading the one with the highest `<step>` value. This feature is enabled by default but can be disabled by setting `runconfig.autoload_last_checkpoint` to `False` in your params YAML.

### Workaround

Either:

* Rename your checkpoint to follow the `checkpoint_<step>.mdl` format
* Explicitly specify the checkpoint path using the `--checkpoint_path` flag

## Functionalization Error

When tracing a model for Cerebras hardware, you might encounter the following error:

```bash theme={null}
RuntimeError: false INTERNAL ASSERT FAILED at “aten/src/ATen/RegisterFunctionalization_1.cpp”:11608, please report a bug to PyTorch. mutating a non-functional tensor with a functional tensor is not allowed. Please ensure that all of your inputs are wrapped inside of a functionalize() call.
```

This happens because:

* In-place operations aren't allowed in the compute graph
* Cerebras uses "functionalization" to convert in-place operations to non-in-place alternatives
* For this to work, all tensors must be on the same device - specifically the device associated with the Cerebras `backend`

### Workaround

To fix this error, ensure all tensors are on the same `backend` device by creating new tensors directly on the backend device:

```python theme={null}
backend = cstorch.backend("CSX")
...
@cstorch.trace
def training_step(inputs, targets):
    ...
    new_tensor_1 = torch.tensor([1, 2, 3], device=backend.torch_device)
    new_tensor_2 = torch.tensor([1, 2, 3]).to(backend.torch_device)
    ...
```

Or, by moving existing tensors to the backend device:

```python theme={null}
@cstorch.trace
def training_step(inputs, targets):
    ...
    new_tensor_1 = torch.tensor([1, 2, 3], device=inputs.device)
    new_tensor_2 = torch.tensor([1, 2, 3]).to(inputs.torch_device)
    ...
```

## Input Starvation

If your dataloader isn't keeping up with your model during a run, you'll encounter the following error:

```bash theme={null}
WARNING: Input starvation detected
Please check dataloader throughput
```

If the issue persists, you'll encounter an additional error:

```bash theme={null}
ERROR:   Declaring stall due to input starvation, no change in status for 630 secs
```

### Workaround

To fix this issue, you'll need to speed up your data pipeline. See [Creating Custom Dataloaders](../model-zoo/components/dataloaders/creating-custom-dataloaders) to learn about improving the performance of your dataloader and view examples.

## Module Not Found

There are two `ModuleNotFound` errors you may encounter:

**Core Python Module Errors**
When trying to use certain built-in Python modules like bz2, users may receive errors about missing core modules (`bz2`, `sqlite3`). For example:

```bash theme={null}
Traceback (most recent call last):
import bz2
File "/usr/local/lib/python3.8/bz2.py", line 19, in <module>
    from _bz2 import BZ2Compressor, BZ2Decompressor
ModuleNotFoundError: No module named '_bz2'
```

This happens because the Python installation was compiled from source without all necessary system dependencies.

**User-Installed Package Errors**
You may encounter a `ModuleNotFoundError` for Python packages that are installed on your local machine but unavailable in the Cerebras environment.

Our [Custom Worker Container Workflow](../fundamentals/import-user-dependencies-in-cerebras) attempts to import your dependencies into Cerebras appliances, with a fallback that mounts packages from your virtual environment.

### Workaround

**For core python module errors**, install the missing system packages (`bzip2-devel`, `sqlite-devel`) and rebuild Python, or use a pre-built Python binary instead.

**For user-installed package errors**:

1. Disable the Custom Worker Container Workflow (see [instructions here](../fundamentals/import-user-dependencies-in-cerebras#procedure)).
2. Install packages in your virtual environment with pip.
3. Copy the custom package directory from `venv/lib/python3.8/site-packages/<package_name>` to a NFS-mountable location. Only copy the custom packages, not the entire virtual environment.
4. Add this location to `--mount_dirs` and its parent to `--python_paths` when running jobs.

## Loss Complilation Issues with Autogen

When creating custom losses, you might encounter compilation failures.

### Workaround

Wrap your custom loss class with the `@autogen_loss` decorator, which enables AutoGen to handle the compilation of these custom losses efficiently.

```python theme={null}
from cerebras_pytorch/src/cerebras/pytorch/nn/modules import autogen_loss

@autogen_loss

class CustomLoss(nn.Module):

   def __init__(self, ...):
```

## Model is Too Large to Fit in Memory

If you encounter the following error, this means the memory requirements are too large to fit on the device:

```
Model is too large to fit in memory. This can happen because of a large batch size, large input tensor dimensions, or other network parameters. Please refer to the Troubleshooting section in the documentation for potential workarounds
```

### Workaround

* On transformer models, compile again with the batch size set to 1 using one CS-2 system to determine if the specified maximum sequence length is feasible.

* You can try a smaller batch size per device or enable batch tiling (only on transformer models) by setting the `micro_batch_size` parameter in the `train_input` or `eval_input` section of your model’s yaml file (see working\_with\_microbatches).

* If you ran with batch tiling with a specific `micro_batch_size` value, you can try compiling with a decreased `micro_batch_size`. The [Using “explore” to Search for a Near-Optimal Microbatch Size](../model-zoo/tutorials/optimize-performance-with-automatic-microbatching#using-explore-to-search-for-a-near-optimal-microbatch-size) flow can recommend performant micro batch sizes that will fit in memory.

* On CNN models where batch tiling isn’t supported, try manually decreasing the batch size and/or the image/volume size.

<Note>
  - For more information on working with batch tiling and selecting performant `micro_batch_size` values, see our tutorial on [automatic microbatching](../model-zoo/tutorials/optimize-performance-with-automatic-microbatching).

  - The `batch_size` parameter set on the yaml configuration is the **global batch size**. This means that the batch size per CS-2 system is computed as the global batch size divided by the number of CS-2s used.
</Note>

## Numerical Issues

During low-precision training (POL=1), particularly with large output vocabularies (30,000-60,000 words), the final projection layer, converting internal representations to words, frequently exhibits accuracy issues.

During the backward pass, the final projection layer accumulates a large number of values (equal to the vocabulary size) for each output, using low-precision 16-bit arithmetic. This extensive accumulation can introduce inaccuracies, hindering convergence. Additionally, the inputs to this layer typically originate from a softmax cross-entropy layer, whose non-normal distribution deviates significantly from the typical normal distributions observed in most layers, further contributing to inaccuracy on the backward pass.

### Workaround

To mitigate potential convergence issues arising from numerical instability in the final projection layer during low-precision training (POL=1), a per-layer setting of `POL=0` should be applied to this specific layer.

This ensures the highest numerical precision for the final projection while maintaining the performance advantages of POL=1 throughout the rest of the model. This modification has already been incorporated into the Model Zoo variants of Cerebras large language models.

## Throughput Spike After Saving Checkpoints

If you notice a throughput spike after saving checkpoints, this is a known artifact of reporting throughput on the user node, caused by the asynchronous nature of execution on Wafer-Scale Cluster. For more details and understanding of this behavior, please refer to [Measure throughput of your model](../fundamentals/measure-throughput-of-your-model).

## Out of Memory Errors and System Resources

View our guide on [troubleshooting memory errors and system resource issues](../support/troubleshooting/out-of-memory-errors-and-system-resources) for more information.

## Vocabulary Size

If you encounter the following error, your vocabulary size may be too large:

```bash theme={null}
RuntimeError: [enforce fail at alloc_cpu.cpp:66] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 9120000000000 bytes
```

Vocabularies up to one million tokens are supported, but it may take up to 90 minutes to compile. Large vocabulary sizes have not been fully tested on models with 2.7 billion parameter models or more.

When using extremely small vocabularies (fewer than 4 tokens), compilation errors may occur.
