Trainer
with automatic microbatching.
The gradients from each microbatch are then accumulated before the weight update, so the model still operates on the total global batch size. Even if there are no gradients (i.e. non-training runs), microbatching is a valid technique that you can still utilize to improve performance.
By following this guide, you will be comfortable in configuring a specific microbatch size and Cerebras’ automatic microbatching feature using the Trainer
for any model.
num_csx
systems or into micro batches.
⌈batch_size / num_csx⌉
and represents the size of the batch used on each Cerebras system.
batch_size
parameter cannot be smaller than the num_csx
parameter.Fig. 10 Tiling and accumulation of gradients along the batch dimension.
batch_size
parameter is not divisible by num_csx
. It also ensures even distribution across microbatching steps, even when the per-system batch size is not divisible by micro_batch_size
. Consequently, there is no need to change the global batch size when scaling the number of Cerebras CS-X systems up or down.
This behaviour is controlled via the micro_batch_size
parameter in the YAML config file, as described below.
GlobalFlags
for microbatching.
GlobalFlags
callback. In this example, let’s use a cluster with two CSX systems to configure a training run only with the specific microbatch size value of “2”:
auto
setting, or find a near-optimal performant value via the explore
setting. Both automatic options, while convenient, will result in a longer compile time; because of this, we recommend using auto
and explore
to find a good setting and then manually setting this for subsequent experiments.
The range of settings for micro_batch_size
are described in detail below:
batch_size / num_csx
) if the original batch size per CS-X system does not fit into device memory or the compiler estimates that a lower micro-batch will achieve significantly better samples/second performance. This setting may incur a high compile time due to the search for a satisfactory micro-batch size. Compared to the “explore” setting, “auto” incurs less compile time penalty but can pick sub-optimal micro-batch values. This is the default value if micro_batch_size
is not specified.
compile_only
mode. Note that this is different from the “auto” setting of micro_batch_size, which tries to find a reasonable micro-batch size selection without too large an increase of compile time. Also, unlike the “auto” setting, “explore” considers all micro-batch sizes and is not restricted by batch_size/num_csx
divisibility constraints. You can generally expect higher-quality selection of micro-batch size values with "explore"
at the expense of a longer compilation run. See Using “explore” to Search for a Near-Optimal Microbatch Size for more information.
⌈batch_size / num_csx⌉
) to ensure an approximately even distribution of the global batch across CS-X systems and microbatching steps. A user message will be provided if this adjustment occurs.
batch_size
parameter as the micro-batch size. This may result in the model with the given batch size being too large to fit into device memory, in which case compilation will fail. If it does fit, however, the chosen batch size may be suboptimal for performance.
num_csx
or the global batch_size
(as long as batch_size / num_csx
is a multiple of the micro-batch size).micro_batch_size
is set. This includes models using batch normalization, or other kinds of non-linear computation over the batch dimension.GlobalFlags
callback as such:
batch_size / num_csx
, be evenly divisible by the micro-batch size. Therefore, if you are setting micro-batch size explicitly via micro_batch_size: <positive_int>
, you must set the global batch size parameter batch_size
to be a mutiple of micro_batch_size * num_csx
.
If you set micro_batch_size: auto
, be aware that the compiler’s choice of microbatch will be restricted to values that evenly divide batch_size/num_csx
, and may be less performant than explicitly setting a microbatch size based on the explore
flow.
micro_batch_size
can significantly improve model performance. However, finding a good micro_batch_size can be a cumbersome process. Cerebras’ Automatic Batch Exploration (CABE) tool provides a convenient way to select the best performing micro_batch_size.
num_csx
and batch_size
parameters. These parameters are needed to guide the compiler stack as an initial data point, but their values do not impact the micro_batch_size
recommended by the flow. This batch size can be same as the default batch size defined in Model Zoo for the model.
2. Set csx.performance.micro_batch_size
performance flag to “explore” via the GlobalFlags
, ScopedTrainFlags
), or ScopedValidateFlags
callbacks depending on the scoping you desire. The example below works with global scoping:
compile_only
run to start exploration (set compile_only
in under the backend
configuration of the Trainer
).
micro_batch_size
is compared relative to the base performance set by the first recommendation. In the example above, line 2, which recommends micro_batch_size: 2
, is estimating that this option is likely to provide 1.20x the performance of micro_batch_size: 1
, which is used as the baseline for this run of the tool.
After selecting a micro_batch_size
and setting batch_size
according to Jointly Setting “batch_size” and “micro_batch_size”, you may launch either a compile-only run or a full training run as needed.
The batch size recommended by CABE is specific to the current model configuration and may require adjustments if there are any changes to the model’s performance-affecting parameters. For instance, altering the model’s operation to evaluation mode or modifying the hidden size could impact performance. In such scenarios, it’s advisable to rerun CABE to ensure the batch size is optimized for the new configuration.
Model Family | Model Size (Params) | Micro Batch Size (MBS) |
---|---|---|
GPT-3 | 1.3B | 253 |
GPT-3 | 2.7B | 198 |
GPT-3 | 6.7B | 121 |
GPT-3 | 13B | 99 |
GPT-3 | 20B | 77 |
GPT-3 | 30B | 69 |
GPT-3 | 39B | 55 |
GPT-3 | 65B | 55 |
GPT-3 | 82B | 48 |
GPT-3 | 175B | 35 |
T5 | 3B | 256 |
T5 | 11B | 520 |
micro_batch_size
parameter instead of leaving it undefined.
micro_batch_size
to “explore” initiates an exhaustive search, potentially extending over several hours. However, the typical compile time for most GPT models is expected to be around one hour.