Advanced: Understanding and Manually Controlling MBS
Learn how Micro Batch Size (MBS) works under the hood, how the platform picks or overrides it, and how to optimize it manually.
If you set a micro_batch_size
that:
- Is valid (i.e., evenly divides the per-system batch size after rounding): it will be used as is.
- Is not valid (not supported): the compiler will automatically override it to the nearest valid value and issue a warning.
The compiler ensures:
- Approximately even distribution of the work across CS-X systems.
- Automatic adjustments if
micro_batch_size
isn’t optimal or feasible.
Numeric Examples
The following examples demonstrate how the system determines if a micro_batch_size
is valid and what happens if it isn’t. Values can be automatically overwritten or result in an error.
Use Case 1 - MBS is Valid
If you provide:
batch_size
= 133num_csx
= 1micro_batch_size
= 34
The system implicitly derives the following:
- Per-system batch size =
Ceil(133/1) = 133
- Valid
NumMicroBatches
= {1, 2, 3, …, 133} - Supported MBS values = {133, 133/2, 133/3, 133/4, …, 133/133}
- After dividing & taking Ceil = {133, 67, 45, 34, 27, 23, 19, 17, 15, 14, 13, 12, …, 1}
NumMicroBatches
=Ceil(133/34) = Ceil(3.912) = 4
Use Case 2 - MBS is Overwritten
If you provide:
batch_size
= 673num_csx
= 2micro_batch_size
= 168
The system implicitly derives the following:
- Per-system batch size =
Ceil(673/2) = 337
- Valid
NumMicroBatches
= {1, 2, 3, …, 337} - Supported MBS values = {337, 337/2, 337/3, 337/4, …, 337/337}
- After dividing & taking ceil = {337, 169, 113, 85, 68, 57, 49, 43, 38, 34, 31, 29, 26, 25, 23, 22, 20, 19, 18, 17, …, 1}, 168 is not found in this list.
Since the MBS here is invalid, it’s overwritten to the nearest supported value, which is 169, and shows the following warning message:
INFO: The micro batch size is changed to 169 to allow approximately even distribution across boxes and gradient accumulation iterations
NumMicroBatches
=Ceil(337/169) = Ceil(1.994) = 2
Use Case 3 - Invalid MBS Error
If you provide:
batch_size
= 240num_csx
= 2micro_batch_size
= 121
The system implicitly derives the following:
- Per-system batch size =
Ceil(240/2) = 120
- Valid
NumMicroBatches
= {1, 2, 3, …, 120}
In this case we can see that the given MBS is already larger than the per-system batch size, which is the largest valid value of NumMicroBatches
, so you will see the following error message:
ERROR: <unknown>:0:error: Minimum microbatch size 121 must be smaller or equal to the per-box batch size 120 where the number of CSX boxes is 2