Job Monitoring CLI
Learn how to use the csctl CLI tool to manage and monitor jobs.
The csctl
command-line interface (CLI) tool allows you to manage jobs on the cluster directly from the terminal of your user node.
To learn more about the available commands, run:
csctl --help
Job IDs
Each training job submitted to the cluster launches two sequential jobs, each with their own jobID
:
- a compilation job (this runs first)
- an execution job (this runs once the compliation job is done)
Job IDs are a required argument for most csctl
commands. You can view the job ID in your terminal as each job is run:
Job IDs are also recorded in the <model_dir>/cerebras_logs/run_meta.json
file, which contains two sections: compile_jobs
and execute_jobs
.
For example, the compile job will show under compile_jobs
while the training job and some additional log information will show under execute_jobs
:
After the training job is scheduled, additional log information and the jobID of the training job will show under execute_jobs
.
To correlate between compilation job and training job, you can correlate between the available time of the compilation job and the start time of the training job. For this example, you will see
Using the jobID
, query information about status of a job in the system:
where:
Flag | Default | Description |
---|---|---|
-o | table | Output Format: table, json, yaml |
-d, -debug | 0 | Debug level. Choosing a higher level of debug prints more fields in the output objects. Only applicable to json or yaml output format. |
Compilation jobs do not require CS-X resources, but they do require resources on the server nodes. We allow only one concurrent compilation running in the cluster. Execution jobs require CS-X resources and will be queued up until those resources are available. Compilation and execution jobs have different job IDs.
Cancel Jobs
You can cancel any compilation or execution job with the jobID
:
Cancelling a job releases all resources and sets the job to a cancelled state.
In 1.8, this command might cause the client logs to print
This is expected.
Label Jobs
Label a job with the label
command:
Run the same command again to remove a label.
Track Queue
Obtain a full list of running and queued jobs on the cluster:
By default, this command produces a table including:
Field | Description |
---|---|
Name | jobID identification |
Age | Time since job submission |
Duration | How long the job ran |
Phase | One of QUEUED, RUNNING, SUCCEDED, FAILED, CANCELLED |
Systems | CS-X systems used in this job |
User | User that starts this job |
Labels | Customized labels by user |
Dashboard | Grafana dashboard link for this job |
For example:
Directly executing the command prints out a long list of current and past jobs. Use -l
to return jobs that match with the given set of labels:
If you only want to see your own jobs, use -m
:
To also include completed and failed jobs, use -a
:
These filter options can be combined. For example, to see your complete job history:
You also can use grep
to view which jobs are queued versus running and how many systems are occupied.
grep 'RUNNING'
will show a list of jobs that are currently running on the cluster.
For example:
grep 'QUEUED'
will show a list of jobs that are currently queued.
For example:
Get Configured Volumes
Get a list of mounted volumes on the cluster:
Update Job Priority
Update the priority for a given job, where priority_value
is p1, p2, or p3:
For example:
This updates the job’s priority to P2.
Redundancy Pools
If you’re running a larger job with Auto Job Restart enabled, you can create a redundacy pool to ensure job continuity and system reliability.
The system will utilize the redundacy pool only when:
- Local session systems or nodes become unhealthy
- Local session lacks sufficient resources to complete the job
In those cases, resources from the redundant pool are used to complete the job.
Limitations and Compatibility Requirements
- Jobs must fit within the session’s total system capacity. For example, a session with 2 total systems cannot submit a 3-system job, even if redundancy systems are available.
- You cannot submit jobs directly to the redundant pool. If you need a redundant system for debugging purposes, system administrators can manually move redundancy systems to a debug session.
- The redundancy pool session operates in a restricted mode and is only accessible by jobs from permitted sessions.
Redundancy pool systems are filtered based on system version matching:
- Systems must align with the local session’s cluster version.
- Cluster administrators must maintain version consistency across all session systems.
- If a local session contains a system with version 2.4.1, for example, the redundancy pool will select systems with the same version.
Create Redundancy Pool
To create a redundancy pool, run the following command:
Then enable the pool:
To list all sessions in redundant mode:
Export Logs
To download logs for a specific job:
with optional flags:
Flag | Default Value | Description |
---|---|---|
-b, –binaries | False | Include binary debugging artifacts |
-h, –help | Informative message for log-export |
For example:
These logs are useful when debugging a job failure with Cerebras support.
Worker SSD Cache
To speed up the process of large amount of input data, we allow the users to stage their data in the worker nodes’ local SSD cache. This cache is shared among different users.
Get Worker Cache Usage
Use this command to obtain the current worker cache usage on each worker node:
Clear Worker Cache
If the cache is full, use the clear command to delete the contents of all caches on all nodes.
Cluster Status
Check the status and system load of all CS-X systems and all cluster nodes:
In this table, note that the CPU
and MEM
columns are only relevant for nodes, and system-in-use
is only relevant for CS-X systems. The CPU
percentage is scaled so that 100% indicates that all CPU cores are fully utilized.
For example:
You can filter the output with the following options:
Flag | Description |
---|---|
-e, -error-only | Only show nodes/systems in an error state |
-n, -node-only | Only show nodes, omit the system list |
-s, -system-only | Only show CS-X systems, omit the node list |