Key Features of csctl
- Job Tracking: Inspect the state of submitted jobs, and cancel your own jobs if necessary.
- Job Labelling: Apply labels to a job.
- Queue Tracking: Review which jobs are queued and which jobs are running on the Cerebras cluster.
- Get Configured Volumes: Get a list of configured volumes on the Cerebras cluster. These volumes can be used to stage code and training data.
- Update the priority for a given job: Update the priority for a given job.
- Log Export: Export Cerebras cluster logs of a given job to the user node. These logs can be useful when debugging a job failure and working with the Cerebras support team.
- Worker SSD Cache: Query worker SSD cache usage.
- Cluster Status: Query cluster status.
Usage Guidelines
Use thecsctl tool directly from the terminal of your user node.
csctl –help
To get the help message:Job Tracking
Each training job submitted to the Cerebras cluster launches two sequential jobs. First, a compilation job is launched. When compilation is completed, an execution job is launched. Each of these is identified by ajobID. The jobID for each job will be printed on the terminal after they start running on the Cerebras Wafer-Scale cluster.
jobID is also recorded in a file run_meta.json inside the <model_dir>/cerebras_logs folder. All jobIDs that use the same model directory <model_dir> are appended in the run_meta.json . run_meta.json contains two sections: compile_jobs and execute_jobs. Once a training job is submitted and before compilation is done, the compile job will be recorded under compile_jobs. For this example you will see:
execute_jobs. To correlate between compilation job and training job, you can correlate between the available time of the compilation job and the start time of the training job. For this example, you will see
jobID, you can query information about status of a job in the system using
| Flag | Default | Description |
|---|---|---|
| -o | table | Output Format: table, json, yaml |
| -d, -debug | 0 | Debug level. Choosing a higher level of debug prints more fields in the output objects. Only applicable to json or yaml output format. |
Compilation and execution jobs are queued and executed sequentially in the Cerebras cluster. This means that the compilation job is completed before the execution job is scheduled. Compilation jobs do not require CS-X resources, but it requires some resources on the server nodes. In 1.8, we allow only one concurrent compilation running in the cluster. Execution jobs require CS-X resources, they will be queued up until sufficient CS-X resources are available. Compilation and execution jobs have different
jobID.Job Termination
You can terminate any compilation or execution job before completion by providing thejobID. More details on jobID in Job Tracking. To cancel a job, you can use
Job Labelling
You can add labels to your jobs, to help categorize your jobs better. There are 2 ways to add labels to your jobs. One way is to use the flag--job_labels when you submit your training job. You can use a list of equal-sign-separated key value pairs served as job labels.
For example, to assign a job label to training a GPT-2 model using PyTorch, you would use:
Queue Tracking
To obtain a full list of running and queued jobs on the Cerebras cluster, you can use| Field | Description |
|---|---|
| Name | jobID identification |
| Age | Time since job submission |
| Duration | How long the job ran |
| Phase | One of QUEUED, RUNNING, SUCCEDED, FAILED, CANCELLED |
| Systems | CS-X systems used in this job |
| User | User that starts this job |
| Labels | Customized labels by user |
| Dashboard | Grafana dashboard link for this job |
grep to extract relevant information of what jobs are queued versus running and how many systems are occupied.
When you grep 'RUNNING', you see a list of jobs that are currently running on the cluster.
For example, as shown below, there is one job running:
grep 'QUEUED', you see a list of jobs that are currently queued and waiting for system availability to start training.
For example, at the same time of the above running job, there is another job currently queued, as shown below:
Get Configured Volumes
After installing Cerebras cluster, the system admin will configure a few volumes to be used in your jobs to access code and training data. To get a list of mounted volumes on the Cerebras cluster, you can useUpdate the Priority for a Job
To update the priority for a given job, use the following command:Log Export
To download Cerebras cluster logs of a given job to the user node, you can use| Flag | Default Value | Description |
|---|---|---|
| -b, –binaries | False | Include binary debugging artifacts |
| -h, –help | Informative message for log-export |
Worker SSD Cache
To speed up the process of large amount of input data, we allow the users to stage their data in the worker nodes’ local SSD cache. This cache is shared among different users.Get Worker Cache Usage
Use this command to obtain the current worker cache usage on each worker node:Clear Worker Cache
If the cache is full, use the clear command to delete the contents of all caches on all nodes.Cluster Status
You can check the status and system load of all CS-X systems and all cluster nodes by runningCPU and MEM columns are only relevant for nodes, and system-in-use is only relevant for CS-X systems. The CPU percentage is scaled so that 100% indicates that all CPU cores are fully utilized.
For example:
| Flag | Description |
|---|---|
| -e, -error-only | Only show nodes/systems in an error state |
| -n, -node-only | Only show nodes, omit the system list |
| -s, -system-only | Only show CS-X systems, omit the node list |