The cluster enforces specific limits on memory and CPU requests to facilitate parallel compiles and training jobs. These limits can be adjusted based on your requirements.

Identify Queued Jobs

When the cluster’s resources are fully utilized, any newly submitted jobs will be queued until capacity becomes available. Your Python client code will display messages like this:

INFO:   Poll ingress status: Waiting for job running, current job status: Queueing, msg: job queueing, waiting for lock grant. Cluster status: 3 execute job(s) queued before current job, systems in use: 1
INFO:   Poll ingress status: Waiting for job running, current job status: Queueing, msg: job queueing, waiting for lock grant. Cluster status: 2 execute job(s) queued before current job, systems in use: 1

You can obtain a full list of running and queued jobs on the cluster with the csctl tool:

csctl get jobs
NAME                AGE  DURATION  PHASE      SYSTEMS                   USER  LABELS             DASHBOARD
wsjob-000000000001  18h  20s       RUNNING    systemCS2_1, systemCS2_2  user2 model=gpt3-tiny    https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000001
wsjob-000000000002   1h  6m25s     QUEUED                               user2 model=neox,team=ml https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000002
wsjob-000000000003  10m  2m01s     QUEUED                               user1 model=gpt3-tiny    https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000003

Detect OOM Failures

When a job fails due to an out of memory (OOM) error, your client logs will contain messages like:

reason: "OOMKilled"
message: "Pod: job-operator.wsjob-kqsejfzmjxefkf9vyruztv-coordinator-0 exited with code 137 The pod was killed due to an out of memory (OOM) condition where the current memory limit is 32Gi."

You can also view OOM events in the wsjob dashboard.

Fig. 15 OOM software error in the wsjob dashboard

Identifying Resource Capacity Failures

Jobs requesting resources beyond the cluster’s capacity will fail immediately with scheduling errors like:

reason=SchedulingFailed object=wsjob-cd2ghxfqh7ksoev79rxpvs message='cluster lacks requested capacity: requested 1 node[role:management]{cpu:32, mem:200Gi} but 1 exists with insufficient capacity {cpu:64, mem:128Gi}

Troubleshooting OOM Errors

If your job fails with an OOM error, particularly in the coordinator component, you can increase the memory allocation in the runconfig section of your yaml configuration file:

runconfig:
  compile_crd_memory_gi: 100
  execute_crd_memory_gi: 120
  wrk_memory_gi: 120

For diagnostic purposes, you can temporarily remove memory limits by setting the value to -1 and observe the maximum memory usage in Grafana:

runconfig:
  compile_crd_memory_gi: -1

Use unlimited memory settings with caution, as this can impact other users’ jobs running on the same node. A job without limits can potentially consume all available system memory.