When using this feature, the gradients from each microbatch are accumulated before the weight update, so the model still operates on the total global batch size. If there are no gradients (i.e. non-training runs), you can still use microbatching to improve performance.

You can configure microbatching through the Trainer YAML file or Python code. Both approaches use the GlobalFlags or ScopedTrainFlags callback. Learn more about these callbacks in Performance Flags

We have two guides depending on your familiarity with microbatching. We recommend reading the rest of this guide before moving on to the beginner or advanced guides:

  • Beginner Guide: Covers how to set Global Batch Size (GBS) and how to use training modes.
  • Advanced Guide: Covers how the platform picks or overrides Micro Batch Size (MBS) and how to optimize it manually.

Read Trainer Essentials, which provides a basic overview of how to configure and use the Trainer.

How It Works

Microbatching divides large training batches into smaller portions, allowing models to process batch sizes that exceed available device memory. The Cerebras software stack facilitates automatic microbatching for transformer models without requiring any modifications to the model code. Additionally, the software can automatically determine optimal microbatch sizes.

As illustrated in the figure below, when a batch exceeds memory limits it’s segmented into manageable microbatches that are processed sequentially. The system accumulates gradients across these microbatches before updating network weights, effectively simulating training with the full batch size. Statistics like loss can be combined across microbatches in a similar way.

The Cerebras implementation intelligently distributes workloads even when:

  • The batch_size isn’t divisible by num_csx
  • The per-system batch size isn’t divisible by the micro_batch_size

This means there is no need to change the global batch size when scaling the number of Cerebras CS-X systems up or down. This behaviour is controlled via the micro_batch_size parameter in the YAML config file.

Key Parameters

Some of these paramters are derived by the system.

Limitations

  • Microbatching has been thoroughly tested mainly with transformer models. The technique is not compatible with models that incorporate batch normalization or layers that execute non-linear computations across batches.

  • The functionality of Automatic Batch Exploration is confined to transformer models. Attempting to apply it to vision networks, such as CNNs, will result in a runtime error.

  • To circumvent extended compile times, it’s advisable to directly assign a known effective value to the micro_batch_size parameter instead of leaving it undefined.

  • Enabling Automatic Batch Exploration by setting micro_batch_size to “explore” initiates an exhaustive search, potentially extending over several hours. However, the typical compile time for most GPT models is expected to be around one hour.