CLI for Job Monitoring
Learn how to use the csctl CLI tool to manage and monitor jobs.
The csctl command-line interface (CLI) tool is preinstalled on the user node for efficient interaction with the Cerebras Cluster’s resource manager. This tool is pivotal for managing jobs within the cluster, which adheres to a first-come, first-served queuing system for resource allocation.
Key Features of csctl
-
Job Tracking: Inspect the state of submitted jobs, and cancel your own jobs if necessary.
-
Job Labelling: Apply labels to a job.
-
Queue Tracking: Review which jobs are queued and which jobs are running on the Cerebras cluster.
-
Get Configured Volumes: Get a list of configured volumes on the Cerebras cluster. These volumes can be used to stage code and training data.
-
Update the priority for a given job: Update the priority for a given job.
-
Log Export: Export Cerebras cluster logs of a given job to the user node. These logs can be useful when debugging a job failure and working with the Cerebras support team.
-
Worker SSD Cache: Query worker SSD cache usage.
-
Cluster Status: Query cluster status.
Usage Guidelines
Use the csctl
tool directly from the terminal of your user node.
csctl –help
To get the help message:
Job Tracking
Each training job submitted to the Cerebras cluster launches two sequential jobs. First, a compilation job is launched. When compilation is completed, an execution job is launched. Each of these is identified by a jobID
. The jobID
for each job will be printed on the terminal after they start running on the Cerebras Wafer-Scale cluster.
The jobID
is also recorded in a file run_meta.json
inside the <model_dir>/cerebras_logs
folder. All jobIDs
that use the same model directory <model_dir>
are appended in the run_meta.json
. run_meta.json
contains two sections: compile_jobs
and execute_jobs
. Once a training job is submitted and before compilation is done, the compile job will be recorded under compile_jobs
. For this example you will see:
After the compilation job has been completed and the training job is scheduled, then the compile job will report additional log information and the jobID of the training job will be recorded under execute_jobs
. To correlate between compilation job and training job, you can correlate between the available time of the compilation job and the start time of the training job. For this example, you will see
Using the jobID
, you can query information about status of a job in the system using
where:
Flag | Default | Description |
---|---|---|
-o | table | Output Format: table, json, yaml |
-d, -debug | 0 | Debug level. Choosing a higher level of debug prints more fields in the output objects. Only applicable to json or yaml output format. |
Compilation and execution jobs are queued and executed sequentially in the Cerebras cluster. This means that the compilation job is completed before the execution job is scheduled. Compilation jobs do not require CS-X resources, but it requires some resources on the server nodes. In 1.8, we allow only one concurrent compilation running in the cluster. Execution jobs require CS-X resources, they will be queued up until sufficient CS-X resources are available. Compilation and execution jobs have different jobID
.
Job Termination
You can terminate any compilation or execution job before completion by providing the jobID
. More details on jobID
in Job Tracking. To cancel a job, you can use
Terminating a job releases all resources and sets the job to a cancelled state. An example output to cancel a job is
In 1.8, this command might cause the client logs to print
This is expected.
Job Labelling
You can add labels to your jobs, to help categorize your jobs better. There are 2 ways to add labels to your jobs.
One way is to use the flag --job_labels
when you submit your training job. You can use a list of equal-sign-separated key value pairs served as job labels.
For example, to assign a job label to training a GPT-2 model using PyTorch, you would use:
The other way to add labels to your jobs is through csctl command
You can use this command to remove a label from your job
Queue Tracking
To obtain a full list of running and queued jobs on the Cerebras cluster, you can use
By default, this command produces a table including:
Field | Description |
---|---|
Name | jobID identification |
Age | Time since job submission |
Duration | How long the job ran |
Phase | One of QUEUED, RUNNING, SUCCEDED, FAILED, CANCELLED |
Systems | CS-X systems used in this job |
User | User that starts this job |
Labels | Customized labels by user |
Dashboard | Grafana dashboard link for this job |
For example:
Directly executing the command prints out a long list of current and past jobs. You can use -l options to return jobs that match with the given set of labels as
If you only want to see your own jobs, use the -m option.
To also include completed and failed jobs, use the -a option.
csctl get jobs -a
These filter options can be combined.
For example, to see your complete job history:
You also can use grep
to extract relevant information of what jobs are queued versus running and how many systems are occupied.
When you grep 'RUNNING'
, you see a list of jobs that are currently running on the cluster.
For example, as shown below, there is one job running:
When you grep 'QUEUED'
, you see a list of jobs that are currently queued and waiting for system availability to start training.
For example, at the same time of the above running job, there is another job currently queued, as shown below:
Get Configured Volumes
After installing Cerebras cluster, the system admin will configure a few volumes to be used in your jobs to access code and training data. To get a list of mounted volumes on the Cerebras cluster, you can use
For example:
Update the Priority for a Job
To update the priority for a given job, use the following command:
For example:
This updates the job’s priority to P2. Confirmation is displayed:
Log Export
To download Cerebras cluster logs of a given job to the user node, you can use
with optional flags:
Flag | Default Value | Description |
---|---|---|
-b, –binaries | False | Include binary debugging artifacts |
-h, –help | Informative message for log-export |
For example:
Cerebras cluster logs can be useful when debugging a job failure and working with Cerebras support.
Worker SSD Cache
To speed up the process of large amount of input data, we allow the users to stage their data in the worker nodes’ local SSD cache. This cache is shared among different users.
Get Worker Cache Usage
Use this command to obtain the current worker cache usage on each worker node:
Clear Worker Cache
If the cache is full, use the clear command to delete the contents of all caches on all nodes.
Cluster Status
You can check the status and system load of all CS-X systems and all cluster nodes by running
In this table, note that the CPU
and MEM
columns are only relevant for nodes, and system-in-use
is only relevant for CS-X systems. The CPU
percentage is scaled so that 100% indicates that all CPU cores are fully utilized.
For example:
You can filter the output with the following options:
Flag | Description |
---|---|
-e, -error-only | Only show nodes/systems in an error state |
-n, -node-only | Only show nodes, omit the system list |
-s, -system-only | Only show CS-X systems, omit the node list |
Conclusion
By leveraging csctl, users can effectively manage their jobs and resources on the Cerebras Cluster, ensuring optimal use of available computational assets.