1

Set the Global Batch Size (GBS)

In your YAML or Python file, set num_csx and batch_size parameters:

trainer:
init:
  backend:
    backend_type: CSX
    cluster_config:
      num_csx: 2
  ...
  callbacks:
    - ScopedTrainFlags:
      csx.performance.micro_batch_size: 2
fit:
train_dataloader:
  batch_size: 12
  ...

Make sure batch_size is greater than or equal to num_csx. In this example, the global batch size of “12” will be split between two CS-X systems into a per-box batch size of “6”, and each CS-X will process this via microbatches of size “2”.

2

Choose Your Training Mode

Decide which mode fits your goal:

YAML SettingDescription
micro_batch_size: autoSet this to find a reasonable MBS automatically. Compiles faster than explore but may select less optimal values. This is the default when micro_batch_size is not specified.
micro_batch_size: exploreSet this to search exhaustively for the best MBS for speed. This takes much longer to compile and works only in compile_only mode. Unlike auto, it evaluates all possible micro-batch sizes regardless of divisibility by batch_size/num_csx.
micro_batch_size: <positive_int>Recommended when you know the optimal value (use auto or explore above to determine this), as it substantially reduces compilation time. The compiler may slightly adjust your specified value to ensure even workload distribution across CS-X systems, and will notify you if adjustments are made.
micro_batch_size: noneDisable microbatching and use the global batch_size parameter as the microbatch size. This may result in the model with the given batch size being too large to fit into device memory, in which case compilation will fail. If it does fit, however, the chosen batch size may be suboptimal for performance.
trainer:
 init:
   ...
   callbacks:
     - GlobalFlags:
         csx.performance.micro_batch_size: "explore"

If using explore and you have a specific range in mind for acceptable microbatch sizes, you can define a batch exploration range to limit the search space and get a set of recommended options more quickly. You can specify this range by providing either one or both of the bounds as follows:

trainer:
  init:
    ...
    callbacks:
      - GlobalFlags:
          csx.performance.micro_batch_size:
            explore:
                min: $min
                max: $max
3

Launch a Job

Launch a compile_only run:

cszoo fit <params_model.yaml> --compile_only

4

Set Optimal MBS

After your initial run (whether using auto or explore), you should:

  • Check what micro_batch_size the system selected (printed in logs).
  • Update your YAML to explicitly set that value for future runs.

The batch size recommended is specific to the current model configuration and may require adjustments if there are any changes to the model’s performance-affecting parameters. For instance, altering the model’s operation to evaluation mode or modifying the hidden size could impact performance. In such scenarios, it’s advisable to rerun explore or auto mode to ensure the batch size is optimized for the new configuration.

  • Model performance is a function of the microbatch size used on a Cerebras system. For example, for a given model, a microbatch of “2” will perform equally well regardless of the values used for num_csx or the global batch_size (as long as batch_size / num_csx is a multiple of the micro-batch size).
  • The microbatching feature will auto-disable for models that it does not support even if micro_batch_size is set. This includes models using batch normalization, or other kinds of non-linear computation over the batch dimension.
  • Since the examples above are limited to training, the microbatch size will be restored to its previous value after training is completed.

Effective Microbatching Examples

Below is a suggested list of micro-batch sizes that have demonstrated good performance, primarily with GPT-3 models. These sizes can also serve as useful estimates for other similar-sized GPT-style models, such as BLOOM and LLaMA.

Model FamilyModel Size (Params)Micro Batch Size (MBS)
GPT-31.3B253
GPT-32.7B198
GPT-36.7B121
GPT-313B99
GPT-320B77
GPT-330B69
GPT-339B55
GPT-365B55
GPT-382B48
GPT-3175B35
T53B256
T511B520