Prerequisite
Make sure you have set up your installation.Activate Cerebras virtual environment
Before starting any jobs on the Cerebras Wafer-Scale Cluster, make sure to activate your virtual environment. On the user node, activate the environment by issuing the following command:(venv_cerebras_pt)
environment.
Navigate to your model implementation
For demonstration purposes, we will use models from the Cerebras Model Zoo. Once you have cloned the modelzoo, follow these steps to navigate to your model implementation: 1. Choose the Model: Identify the model you want to work with in the repository. 2. Navigate to the Model Directory: In this example, we’re using the GPT-2 model. Navigate to its directory with the following command:ImportantWhen running a model from the Cerebras Model Zoo, ensure that the paths specified include the parent directory where the Model Zoo is located. For instance, if your directory structure is
/path/to/parent/modelzoo
, the arguments should be /path/to/parent/modelzoo/src
.Prepare your datasets
Each model in the Cerebras Model Zoo comes with scripts to help you prepare your datasets. For general guidance, check the Data processing and dataloaders section. Additionally, you can find dataset examples in the README file of each model.-
For instance, the FC-MNIST model includes a
prepare_data.py
script that will download sample data. - For language models, leverage the data processing in the Cerebras Model Zoo. The Training and Fine-Tuning a Large Language Model(LLM) tutorial provides an example.
NoteAfter preparing your data, update the data path in the configuration file to the absolute path where your data is stored. Inside the
configs/
, you will find YAML files corresponding to different model sizes. Locate the YAML file for your desired model size and modify the data path accordingly to ensure the training or evaluation process can find your data:Launch your job
Each model in the Cerebras Model Zoo contains a script calledrun.py
. This script is designed to handle the compilation, training, and evaluation of your models on the Cerebras Wafer-Scale cluster.
To launch your job, you’ll need to specify the following flags:
Flag | Mandatory | Description |
---|---|---|
CSX | Yes | Specifies that the target device for execution is a Cerebras Cluster. |
--params <...> | Yes | Path to a YAML file containing model/run configuration options. |
--mode <train,eval,train_and_eval,eval_all> | Yes | Whether to run train, eval, train_and_eval, or eval_all. |
--python_paths <...> | Yes | List of paths to be exported to PYTHONPATH when starting the workers on the Appliance container. It should include parent paths to Cerebras Model Zoo and python packages needed by input workers. (Default: Pulled from path defined by env variable CEREBRAS_WAFER_SCALE_CLUSTER_DEFAULTS ) For more information, see Cerebras cluster settings |
--compile_only | No | Compile the model including matching to Cerebras kernels and mapping to hardware. It does not execute on system. Upon success, compile artifacts are stored inside the Cerebras cluster, under the directory specified in --compile_dir . To start training from a pre-compiled model, use the same --compile_dir used in a compile-only run. Mutually exclusive with validate_only. (Default: None ) |
--validate_only | No | Validate model can be matched to Cerebras kernels. This is a lightweight compilation. It does not map to the hardware nor execute on system. Mutually exclusive with compile_only. (Default: None ) |
--model_dir <...> | No | Path to store model checkpoints, TensorBoard events files, etc. (Default: $CWD/model_dir ) |
--compile_dir <...> | No | Path to store the compile artifacts inside Cerebras cluster. (Default: None ) |
--num_csx <1,2,4,8,16> | No | Number of CS-X systems to use in training. (Default: 1 ) |
Validate your job (optional)
If you want to verify that your model implementation is compatible with the Cerebras software platform, you can use the--validate_only
flag. This flag enables you to quickly check compatibility without the need to execute a full model run. It’s especially useful when you’re developing or adjusting your models and want to ensure they will work with the platform.
For instance, you might run a command like this:
Compile your job
To generate the executable files for your model on the Cerebras cluster, you can use the —compile_only flag. This step takes more time compared to validation (typically 15 minutes to an hour) as it prepares the model’s computation graph for optimal execution. An example command might look like this:NoteYou can speed up your training or evaluation runs by reusing pre-compiled artifacts obtained through the
--validate_only
and --compile_only flags
. To achieve this, ensure that you use the same --compile_dir
path during both the compilation and execution phases.-
--mode train --compile_only
-
--mode eval --compile_only
--compile_dir
based on whether you’re training or evaluating.
Execute your job
To execute your job on the Cerebras Wafer-Scale cluster, follow these steps: 1. Specify the Target Device: Use “CSX” as the first positional argument to target the Cerebras cluster.-
--python_paths
: Specify the Python paths needed to execute your job correctly. This should include all necessary scripts and packages. -
--mount_dirs
: Indicate which directories should be mounted to access required files, datasets, or model weights.
--python_paths
and --mount_dirs
.
NoteYou can specify the
python_paths
and mount_dirs
arguments either in the:run.py
script: Provide them as command-line arguments while executing the script.
params.yaml
: Define these parameters within the YAML configuration file.When running a model from the Cerebras Model Zoo, ensure that the paths specified include the parent directory where the Model Zoo is located. For instance, if your directory structure is /path/to/parent/modelzoo
, the arguments should be /path/to/parent/modelzoo/src
.-
Execution Mode: Choose one of the following modes based on your requirements:
-
train
: For training the model. -
eval
: For evaluating the model on a specific dataset. -
eval_all
: For evaluating across multiple datasets. -
train_and_eval
: For both training and evaluating.
-
- Configuration File Path: Provide the path to the relevant configuration file that contains the necessary settings.
Note
- Cerebras only supports using a single CS-X when running in eval mode.
-
To scale to multiple CS-X systems, simply add the
--num_csx
flag specifying the number of CS-X systems. The global batch size divided by the number of CS-Xs will be the effective batch size per device. - Once you have submitted your job to execute in the Cerebras Wafer-Scale cluster, you can track the progress or kill your job using the csctl tool. You can also monitor the performance using a Grafana dashboard.
Explore output files and artifacts
The contents of the model directory, specified by the--model_dir
flag, contain all results and artifacts from the latest run. These include:
- Checkpoints
Checkpoints are saved in the <model_dir>
directory.
- Tensorboard event files
Tensorboard event files are stored in the <model_dir>
directory. Events files can be visualized using Tensorboard. Here’s an example of how to launch Tensorboard:
<model_dir>
/train or <model_dir>
/eval directory depending on the execution mode.
- Run logs
Stdout from the run is located under <model_dir>/cerebras_logs/latest/run.log
. If there are multiple runs, look under the corresponding <model_dir>/cerebras_logs/<train|eval>/<timestamp>/run.log
.