Learn how to launch a job on a Cerebras cluster.
Flag | Description |
---|---|
CSX | Specifies that the target device for execution is a Cerebras Cluster. |
--params <path/to/params.yaml> | Path to a YAML file containing model/run configuration options. |
Flag | Description | Default |
---|---|---|
--compile_only | Compiles the model by matching to Cerebras kernels and mapping to hardware. No execution occurs. Compile artifacts are stored in the specified --compile_dir . For training with a pre-compiled model, use the same --compile_dir . Cannot be used with --validate_only . | None |
--validate_only | Performs lightweight compilation to validate model compatibility with Cerebras kernels. Does not map to hardware or execute. Cannot be used with --compile_only . | None |
--model_dir <path/to/model> | Directory for storing model checkpoints and TensorBoard events files. | $CWD/model_dir |
--compile_dir <path/to/dir> | Directory for storing compile artifacts in the Cerebras cluster. | None |
--num_csx <1,2,4,8,16> | Number of CS-X systems to use for training. | 1 |
--job_priority | Sets the job priority. Valid inputs are p1, p2, and p3. Learn more about how jobs are prioritized here. | p2 |
--validate_only
performs a lightweight compilation to check model compatibility with Cerebras kernels, while cszoo validate
verifies that a model’s configuration meets expected requirements.Validate the Job (optional)
--validate_only
flag. This performs a quick compatibility check without executing a full run:Compile the Model
--compile_only
flag. This step typically takes 15-60 minutes:Execute the Job
--num_csx
flag.
LoopTimeRemaining
displays the estimated time remaining in your current operation loop, where a loop is a single training iteration, a single validation dataloader execution, or an eval harness run.
TimeRemaining
shows the estimated total time remaining for your entire run, whether it’s a complete training session (fit
) or a validation run (validate
or validate_all
).
+ ??
appended to the TimeRemaining
metric. The initial estimate is optimistic since it doesn’t account for stages that haven’t been measured yet.
Once all stages have been observed at least once, the + ??
indicator will disappear, and you’ll receive more accurate time estimates.
These metrics are displayed consistently across CSX, CPU, and GPU hardware.
<model_dir>
directory contains all run results and artifacts, including:
<model_dir>/train
or <model_dir>/eval
.<model_dir>/cerebras_logs/latest/run.log
or <model_dir>/cerebras_logs/<train|eval>/<timestamp>/run.log
.