The csctl
command-line interface (CLI) tool allows you to manage jobs on the cluster directly from the terminal of your user node.
To learn more about the available commands, run:
csctl --help
csctl --help
Cerebras cluster command line tool.
Usage:
csctl [command]
Available Commands:
cancel Cancel job
check-volumes Check volume validity on this usernode
clear-worker-cache Clear the worker cache
config View csctl config files
get Get resources
job Job management commands
label Label resources
log-export Gather and download logs.
types Display resource types
Flags:
--csconfig string config file /opt/cerebras/config_v2 (default "/opt/cerebras/config_v2")
-d, --debug int higher debug values will display more fields in output objects
-h, --help help for csctl
-n, --namespace string configure csctl to talk to different user namespaces
--version version for csctl
Use "csctl [command] --help" for more information about a command.
Job IDs
Each training job submitted to the cluster launches two sequential jobs, each with their own jobID
:
- a compilation job (this runs first)
- an execution job (this runs once the compliation job is done)
Job IDs are a required argument for most csctl
commands. You can view the job ID in your terminal as each job is run:
Extracting the model from framework. This might take a few minutes.
WARNING:root:The following model params are unused: precision_opt_level, loss_scaling
2023-02-05 02:00:00,450 INFO: Compiling the model. This may take a few minutes.
2023-02-05 02:00:00,635 INFO: Initiating a new compile wsjob against the cluster server.
2023-02-05 02:00:00,761 INFO: Compile job initiated
...
2023-02-05 02:02:00,899 INFO: Ingress is ready.
2023-02-05 02:02:00,899 INFO: Cluster mgmt job handle: {'job_id': 'wsjob-aaaaaaaaaa000000000', 'service_url': 'cluster-server.cerebras.local:443', 'service_authority': 'wsjob-aaaaaaaaaa000000000-coordinator-0.cluster-server.cerebras.local', 'compile_dir_absolute_path': '/cerebras/cached_compile/cs_0000000000111111'}
2023-02-05 02:02:00,901 INFO: Creating a framework GRPC client: cluster-server.cerebras.local:443
2023-02-05 02:07:00,112 INFO: Compile successfully written to cache directory: cs_000000000011111
2023-02-05 02:07:30,118 INFO: Compile for training completed successfully!
2023-02-05 02:07:30,120 INFO: Initiating a new execute wsjob against the cluster server.
2023-02-05 02:07:30,248 INFO: Execute job initiated
...
2023-02-05 02:08:00,321 INFO: Ingress is ready.
2023-02-05 02:08:00,321 INFO: Cluster mgmt job handle: {'job_id': 'wsjob-bbbbbbbbbbb11111111', 'service_url': 'cluster-server.cerebras.local:443', 'service_authority': 'wsjob-bbbbbbbbbbb11111111-coordinator-0.cluster-server.cerebras.local', 'compile_artifact_dir': '/cerebras/cached_compile/cs_0000000000111111'}
...
Job IDs are also recorded in the <model_dir>/cerebras_logs/run_meta.json
file, which contains two sections: compile_jobs
and execute_jobs
.
For example, the compile job will show under compile_jobs
while the training job and some additional log information will show under execute_jobs
:
{
"compile_jobs": [
{
"id": "wsjob-aaaaaaaaaa000000000",
"log_path": "/cerebras/workdir/wsjob-aaaaaaaaaa000000000",
"start_time": "2023-02-05T02:00:00Z",
},
]
}
After the training job is scheduled, additional log information and the jobID of the training job will show under execute_jobs
.
To correlate between compilation job and training job, you can correlate between the available time of the compilation job and the start time of the training job. For this example, you will see
{
"compile_jobs": [
{
"id": "wsjob-aaaaaaaaaa000000000",
"log_path": "/cerebras/workdir/wsjob-aaaaaaaaaa000000000",
"start_time": "2023-02-05T02:00:00Z",
"cache_compile": {
"location": "/cerebras/cached_compile/cs_0000000000111111",
"available_time": "2023-02-05T02:02:00Z"
}
}
],
"execute_jobs": [
{
"id": "wsjob-bbbbbbbbbbb11111111",
"log_path": "/cerebras/workdir/wsjob-bbbbbbbbbbb11111111",
"start_time": "2023-02-05T02:02:00Z"
}
]
}
Using the jobID
, query information about status of a job in the system:
csctl [-d int] get job <jobID> [-o json|yaml]
where:
Flag | Default | Description |
---|
-o | table | Output Format: table, json, yaml |
-d, -debug | 0 | Debug level. Choosing a higher level of debug prints more fields in the output objects. Only applicable to json or yaml output format. |
csctl -d0 get job wsjob-000000000000 -oyaml
meta:
createTime: "2022-12-07T05:10:16Z"
labels:
label: customed_label
user: user1
name: wsjob-000000000000
type: job
spec:
user:
gid: "1001"
uid: "1000"
volumeMounts:
- mountPath: /data
name: data-volume-000000
subPath: ""
- mountPath: /dev/shm
name: dev-shm
subPath: ""
status:
phase: SUCCEEDED
systems:
- systemCS2_1
Compilation jobs do not require CS-X resources, but they do require resources on the server nodes. We allow only one concurrent compilation running in the cluster. Execution jobs require CS-X resources and will be queued up until those resources are available. Compilation and execution jobs have different job IDs.
Cancel Jobs
You can cancel any compilation or execution job with the jobID
:
Cancelling a job releases all resources and sets the job to a cancelled state.
In 1.8, this command might cause the client logs to print
cerebras.appliance.errors.ApplianceUnknownError: Received unexpected gRPC error (StatusCode.UNKNOWN) : 'Stream removed' while monitoring Coordinator for Runtime server errors
This is expected.
Label Jobs
Label a job with the label
command:
csctl label job wsjob-000000000000 framework=pytorch
Run the same command again to remove a label.
Track Queue
Obtain a full list of running and queued jobs on the cluster:
By default, this command produces a table including:
Field | Description |
---|
Name | jobID identification |
Age | Time since job submission |
Duration | How long the job ran |
Phase | One of QUEUED, RUNNING, SUCCEDED, FAILED, CANCELLED |
Systems | CS-X systems used in this job |
User | User that starts this job |
Labels | Customized labels by user |
Dashboard | Grafana dashboard link for this job |
For example:
csctl get jobs
NAME AGE DURATION PHASE SYSTEMS USER LABELS DASHBOARD
wsjob-000000000001 18h 20s RUNNING systemCS2_1, systemCS2_2 user2 model=gpt3-tiny https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000001
wsjob-000000000002 1h 6m25s QUEUED user2 model=neox,team=ml https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000002
wsjob-000000000003 10m 2m01s QUEUED user1 model=gpt3-tiny https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000003
Directly executing the command prints out a long list of current and past jobs. Use -l
to return jobs that match with the given set of labels:
csctl get jobs -l model=neox,team=ml
NAME AGE DURATION PHASE SYSTEMS USER LABELS DASHBOARD
wsjob-000000000002 1h 6m25s QUEUED user2 model=neox,team=ml https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000002
If you only want to see your own jobs, use -m
:
csctl get jobs -m
NAME AGE DURATION PHASE SYSTEMS USER LABELS DASHBOARD
wsjob-000000000003 10m 2m01s QUEUED user1 model=gpt3-tiny https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000003
To also include completed and failed jobs, use -a
:
csctl get jobs -a
NAME AGE DURATION PHASE SYSTEMS USER LABELS DASHBOARD
wsjob-000000000000 43h 6m27s SUCCEEDED systemCS2_1 user1 model=gpt3xl https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000000
wsjob-000000000001 18h 20s RUNNING systemCS2_1, systemCS2_2 user2 model=gpt3-tiny https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000001
wsjob-000000000002 1h 6m25s QUEUED user2 model=neox,team=ml https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000002
wsjob-000000000003 10m 2m01s QUEUED user1 model=gpt3-tiny https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000003
These filter options can be combined. For example, to see your complete job history:
csctl get jobs -a -m
NAME AGE DURATION PHASE SYSTEMS USER LABELS DASHBOARD
wsjob-000000000000 43h 6m27s SUCCEEDED systemCS2_1 user1 model=gpt3xl https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000000
wsjob-000000000003 10m 2m01s QUEUED user1 model=gpt3-tiny https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000003
You also can use grep
to view which jobs are queued versus running and how many systems are occupied.
grep 'RUNNING'
will show a list of jobs that are currently running on the cluster.
For example:
csctl get jobs | grep 'RUNNING'
wsjob-000000000001 18h 20s RUNNING systemCS2_1, systemCS2_2 user2 model=gpt3-tiny https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000001
grep 'QUEUED'
will show a list of jobs that are currently queued.
For example:
csctl get jobs | grep 'QUEUED'
wsjob-000000000002 1h 6m25s QUEUED user2 model=neox,team=ml https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000002
Get a list of mounted volumes on the cluster:
csctl get volume
NAME TYPE CONTAINERPATH SERVER SERVERPATH READONLY
training-data-volume nfs /ml 10.10.10.10 /ml false
Update Job Priority
Update the priority for a given job, where priority_value
is p1, p2, or p3:
csctl job set-priority <wsjob_id> -n <namespace> <priority_value>
For example:
$ csctl job set-priority wsjob-xxxxxx -n mynamespace p2
This updates the job’s priority to P2.
Redundancy Pools
If you’re running a larger job with Auto Job Restart enabled, you can create a redundacy pool to ensure job continuity and system reliability.
The system will utilize the redundacy pool only when:
- Local session systems or nodes become unhealthy
- Local session lacks sufficient resources to complete the job
In those cases, resources from the redundant pool are used to complete the job.
Limitations and Compatibility Requirements
- Jobs must fit within the session’s total system capacity. For example, a session with 2 total systems cannot submit a 3-system job, even if redundancy systems are available.
- You cannot submit jobs directly to the redundant pool. If you need a redundant system for debugging purposes, system administrators can manually move redundancy systems to a debug session.
- The redundancy pool session operates in a restricted mode and is only accessible by jobs from permitted sessions.
Redundancy pool systems are filtered based on system version matching:
- Systems must align with the local session’s cluster version.
- Cluster administrators must maintain version consistency across all session systems.
- If a local session contains a system with version 2.4.1, for example, the redundancy pool will select systems with the same version.
Create Redundancy Pool
To create a redundancy pool, run the following command:
csctl session create --redundant
Then enable the pool:
# enable pool-a/pool-b as qa-master's redundancy
csctl session update <session-name> --redundancy-sessions=<pool-a,pool-b...>
# clear qa-master's redundancy
csctl session update <session-name> --redundancy-sessions=""
To list all sessions in redundant mode:
# list all sessions in redundant mode only
csctl session list --redundant
NAME SYSTEMS NODEGROUPS CRD-NODES REDUNDANCY LAST-ACTIVE
test1 0 0 0 *redundant >7 days
test2 0 0 0 *redundant >7 days
Export Logs
To download logs for a specific job:
csctl log-export <jobID> [-b]
with optional flags:
Flag | Default Value | Description |
---|
-b, –binaries | False | Include binary debugging artifacts |
-h, –help | | Informative message for log-export |
For example:
csctl log-export wsjob-example-0
Gathering log data within cluster...
Starting a fresh download of log archive.
Downloaded 0.55 MB.
Logs archive: ./wsjob-example-0.zip
These logs are useful when debugging a job failure with Cerebras support.
Worker SSD Cache
To speed up the process of large amount of input data, we allow the users to stage their data in the worker nodes’ local SSD cache. This cache is shared among different users.
Get Worker Cache Usage
Use this command to obtain the current worker cache usage on each worker node:
csctl get worker-cache
NODE DISK USAGE
worker-01 57.86%
worker-02 50.84%
worker-03 49.47%
worker-04 63.56%
worker-05 63.56%
worker-06 63.71%
worker-07 63.22%
worker-09 65.80%
Clear Worker Cache
If the cache is full, use the clear command to delete the contents of all caches on all nodes.
csctl get worker-cache
Worker caches cleared successfully
Cluster Status
Check the status and system load of all CS-X systems and all cluster nodes:
In this table, note that the CPU
and MEM
columns are only relevant for nodes, and system-in-use
is only relevant for CS-X systems. The CPU
percentage is scaled so that 100% indicates that all CPU cores are fully utilized.
For example:
csctl get cluster
NAME TYPE CPU MEM SYSTEM-IN-USE JOBID JOBLABELS STATE NOTES
systemf103 system n/a n/a InUse wsjob-jcvs23zpsxopvu9ymd2e5u wsjob-label= ok
systemf116 system n/a n/a InUse wsjob-jcvs23zpsxopvu9ymd2e5u wsjob-label= ok
cs-swx001-sx-sr18 broadcastreduce 22.17% 14.20% n/a n/a ok
cs-wse002-mg-sr01 management 3.23% 9.45% n/a n/a ok
cs-wse005-mx-sr04 memory 13.00% 12.93% n/a n/a ok
You can filter the output with the following options:
Flag | Description |
---|
-e, -error-only | Only show nodes/systems in an error state |
-n, -node-only | Only show nodes, omit the system list |
-s, -system-only | Only show CS-X systems, omit the node list |