Learn how to use the csctl CLI tool to manage and monitor jobs.
csctl
tool directly from the terminal of your user node.
jobID
. The jobID
for each job will be printed on the terminal after they start running on the Cerebras Wafer-Scale cluster.
jobID
is also recorded in a file run_meta.json
inside the <model_dir>/cerebras_logs
folder. All jobIDs
that use the same model directory <model_dir>
are appended in the run_meta.json
. run_meta.json
contains two sections: compile_jobs
and execute_jobs
. Once a training job is submitted and before compilation is done, the compile job will be recorded under compile_jobs
. For this example you will see:
execute_jobs
. To correlate between compilation job and training job, you can correlate between the available time of the compilation job and the start time of the training job. For this example, you will see
jobID
, you can query information about status of a job in the system using
Flag | Default | Description |
---|---|---|
-o | table | Output Format: table, json, yaml |
-d, -debug | 0 | Debug level. Choosing a higher level of debug prints more fields in the output objects. Only applicable to json or yaml output format. |
jobID
.jobID
. More details on jobID
in Job Tracking. To cancel a job, you can use
--job_labels
when you submit your training job. You can use a list of equal-sign-separated key value pairs served as job labels.
For example, to assign a job label to training a GPT-2 model using PyTorch, you would use:
Field | Description |
---|---|
Name | jobID identification |
Age | Time since job submission |
Duration | How long the job ran |
Phase | One of QUEUED, RUNNING, SUCCEDED, FAILED, CANCELLED |
Systems | CS-X systems used in this job |
User | User that starts this job |
Labels | Customized labels by user |
Dashboard | Grafana dashboard link for this job |
grep
to extract relevant information of what jobs are queued versus running and how many systems are occupied.
When you grep 'RUNNING'
, you see a list of jobs that are currently running on the cluster.
For example, as shown below, there is one job running:
grep 'QUEUED'
, you see a list of jobs that are currently queued and waiting for system availability to start training.
For example, at the same time of the above running job, there is another job currently queued, as shown below:
Flag | Default Value | Description |
---|---|---|
-b, –binaries | False | Include binary debugging artifacts |
-h, –help | Informative message for log-export |
CPU
and MEM
columns are only relevant for nodes, and system-in-use
is only relevant for CS-X systems. The CPU
percentage is scaled so that 100% indicates that all CPU cores are fully utilized.
For example:
Flag | Description |
---|---|
-e, -error-only | Only show nodes/systems in an error state |
-n, -node-only | Only show nodes, omit the system list |
-s, -system-only | Only show CS-X systems, omit the node list |