Learn how to identify when the resources needed are larger than the resources available.
The cluster enforces specific limits on memory and CPU requests to facilitate parallel compiles and training jobs. These limits can be adjusted based on your requirements.
When the cluster’s resources are fully utilized, any newly submitted jobs will be queued until capacity becomes available. Your Python client code will display messages like this:
Copy
Ask AI
INFO: Poll ingress status: Waiting for job running, current job status: Queueing, msg: job queueing, waiting for lock grant. Cluster status: 3 execute job(s) queued before current job, systems in use: 1INFO: Poll ingress status: Waiting for job running, current job status: Queueing, msg: job queueing, waiting for lock grant. Cluster status: 2 execute job(s) queued before current job, systems in use: 1
You can obtain a full list of running and queued jobs on the cluster with the csctl tool:
Copy
Ask AI
csctl get jobsNAME AGE DURATION PHASE SYSTEMS USER LABELS DASHBOARDwsjob-000000000001 18h 20s RUNNING systemCS2_1, systemCS2_2 user2 model=gpt3-tiny https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000001wsjob-000000000002 1h 6m25s QUEUED user2 model=neox,team=ml https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000002wsjob-000000000003 10m 2m01s QUEUED user1 model=gpt3-tiny https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000003
When a job fails due to an out of memory (OOM) error, your client logs will contain messages like:
Copy
Ask AI
reason: "OOMKilled"message: "Pod: job-operator.wsjob-kqsejfzmjxefkf9vyruztv-coordinator-0 exited with code 137 The pod was killed due to an out of memory (OOM) condition where the current memory limit is 32Gi."
If your job fails with an OOM error, particularly in the coordinator component, you can increase the memory allocation in the runconfig section of your yaml configuration file:
For diagnostic purposes, you can temporarily remove memory limits by setting the value to -1 and observe the maximum memory usage in Grafana:
Copy
Ask AI
runconfig: compile_crd_memory_gi: -1
Use unlimited memory settings with caution, as this can impact other users’ jobs running on the same node. A job without limits can potentially consume all available system memory.