Running jobs on the Cerebras Wafer-Scale cluster is straightforward and similar to running them on a single device. Here’s a comprehensive guide to get you started.
(venv_cerebras_pt)
environment.
/path/to/parent/modelzoo
, the arguments should be /path/to/parent/modelzoo/src
.prepare_data.py
script that will download sample data.
configs/
, you will find YAML files corresponding to different model sizes. Locate the YAML file for your desired model size and modify the data path accordingly to ensure the training or evaluation process can find your data:run.py
. This script is designed to handle the compilation, training, and evaluation of your models on the Cerebras Wafer-Scale cluster.
To launch your job, you’ll need to specify the following flags:
Flag | Mandatory | Description |
---|---|---|
CSX | Yes | Specifies that the target device for execution is a Cerebras Cluster. |
--params <...> | Yes | Path to a YAML file containing model/run configuration options. |
--mode <train,eval,train_and_eval,eval_all> | Yes | Whether to run train, eval, train_and_eval, or eval_all. |
--python_paths <...> | Yes | List of paths to be exported to PYTHONPATH when starting the workers on the Appliance container. It should include parent paths to Cerebras Model Zoo and python packages needed by input workers. (Default: Pulled from path defined by env variable CEREBRAS_WAFER_SCALE_CLUSTER_DEFAULTS ) For more information, see Cerebras cluster settings |
--compile_only | No | Compile the model including matching to Cerebras kernels and mapping to hardware. It does not execute on system. Upon success, compile artifacts are stored inside the Cerebras cluster, under the directory specified in --compile_dir . To start training from a pre-compiled model, use the same --compile_dir used in a compile-only run. Mutually exclusive with validate_only. (Default: None ) |
--validate_only | No | Validate model can be matched to Cerebras kernels. This is a lightweight compilation. It does not map to the hardware nor execute on system. Mutually exclusive with compile_only. (Default: None ) |
--model_dir <...> | No | Path to store model checkpoints, TensorBoard events files, etc. (Default: $CWD/model_dir ) |
--compile_dir <...> | No | Path to store the compile artifacts inside Cerebras cluster. (Default: None ) |
--num_csx <1,2,4,8,16> | No | Number of CS-X systems to use in training. (Default: 1 ) |
--validate_only
flag. This flag enables you to quickly check compatibility without the need to execute a full model run. It’s especially useful when you’re developing or adjusting your models and want to ensure they will work with the platform.
For instance, you might run a command like this:
--validate_only
and --compile_only flags
. To achieve this, ensure that you use the same --compile_dir
path during both the compilation and execution phases.--mode train --compile_only
--mode eval --compile_only
--compile_dir
based on whether you’re training or evaluating.
--python_paths
: Specify the Python paths needed to execute your job correctly. This should include all necessary scripts and packages.
--mount_dirs
: Indicate which directories should be mounted to access required files, datasets, or model weights.
--python_paths
and --mount_dirs
.
python_paths
and mount_dirs
arguments either in the:run.py
script: Provide them as command-line arguments while executing the script.params.yaml
: Define these parameters within the YAML configuration file.When running a model from the Cerebras Model Zoo, ensure that the paths specified include the parent directory where the Model Zoo is located. For instance, if your directory structure is /path/to/parent/modelzoo
, the arguments should be /path/to/parent/modelzoo/src
.train
: For training the model.
eval
: For evaluating the model on a specific dataset.
eval_all
: For evaluating across multiple datasets.
train_and_eval
: For both training and evaluating.
--num_csx
flag specifying the number of CS-X systems. The global batch size divided by the number of CS-Xs will be the effective batch size per device.
--model_dir
flag, contain all results and artifacts from the latest run. These include:
- Checkpoints
Checkpoints are saved in the <model_dir>
directory.
- Tensorboard event files
Tensorboard event files are stored in the <model_dir>
directory. Events files can be visualized using Tensorboard. Here’s an example of how to launch Tensorboard:
<model_dir>
/train or <model_dir>
/eval directory depending on the execution mode.
- Run logs
Stdout from the run is located under <model_dir>/cerebras_logs/latest/run.log
. If there are multiple runs, look under the corresponding <model_dir>/cerebras_logs/<train|eval>/<timestamp>/run.log
.