Determining if your job is queued
If multiple users are submitting jobs to the cluster, the cluster will queue jobs which lack available capacity to execute. From a user perspective, your python client code will log messages similar to the following.Determining if job failed because of an OOM error
If your job fails with an out of memory (OOM) error, the client will receive a list of events in its log stream containing messages similar to the following:
Fig. 15 OOM software error in the wsjob dashboard
Determining if job failed because of system could not fit requested memory
Jobs which request resources like memory or number of CS systems which is beyond the capacity of the cluster will immediately fail with a scheduling error similar to the following:Troubleshooting OOM errors
If your job fails because of an OOM error and the component which OOM’d is coordinator, you can try increasing the amount of memory using the runconfigsection of the
yaml` configuration file as in the example below: