Out of Memory Errors and System Resources
Learn how to identify when the resources needed are larger than the resources available.
The cluster enforces specific limits on memory and CPU requests to facilitate parallel compiles and training jobs. These limits can be adjusted based on your requirements.
Identify Queued Jobs
When the cluster’s resources are fully utilized, any newly submitted jobs will be queued until capacity becomes available. Your Python client code will display messages like this:
You can obtain a full list of running and queued jobs on the cluster with the csctl tool:
Detect OOM Failures
When a job fails due to an out of memory (OOM) error, your client logs will contain messages like:
You can also view OOM events in the wsjob dashboard.
Fig. 15 OOM software error in the wsjob dashboard
Identifying Resource Capacity Failures
Jobs requesting resources beyond the cluster’s capacity will fail immediately with scheduling errors like:
Troubleshooting OOM Errors
If your job fails with an OOM error, particularly in the coordinator component, you can increase the memory allocation in the runconfig
section of your yaml configuration file:
For diagnostic purposes, you can temporarily remove memory limits by setting the value to -1 and observe the maximum memory usage in Grafana:
Use unlimited memory settings with caution, as this can impact other users’ jobs running on the same node. A job without limits can potentially consume all available system memory.