micro_batch_size
that:
- Is valid (i.e., evenly divides the per-system batch size after rounding): it will be used as is.
- Is not valid (not supported): the compiler will automatically override it to the nearest valid value and issue a warning.
- Approximately even distribution of the work across CS-X systems.
- Automatic adjustments if
micro_batch_size
isn’t optimal or feasible.
Numeric Examples
The following examples demonstrate how the system determines if amicro_batch_size
is valid and what happens if it isn’t. Values can be automatically overwritten or result in an error.
Use Case 1 - MBS is Valid
If you provide:batch_size
= 133num_csx
= 1micro_batch_size
= 34
- Per-system batch size =
Ceil(133/1) = 133
- Valid
NumMicroBatches
= {1, 2, 3, …, 133} - Supported MBS values = {133, 133/2, 133/3, 133/4, …, 133/133}
- After dividing & taking Ceil = {133, 67, 45, 34, 27, 23, 19, 17, 15, 14, 13, 12, …, 1}
NumMicroBatches
=Ceil(133/34) = Ceil(3.912) = 4
Use Case 2 - MBS is Overwritten
If you provide:batch_size
= 673num_csx
= 2micro_batch_size
= 168
- Per-system batch size =
Ceil(673/2) = 337
- Valid
NumMicroBatches
= {1, 2, 3, …, 337} - Supported MBS values = {337, 337/2, 337/3, 337/4, …, 337/337}
- After dividing & taking ceil = {337, 169, 113, 85, 68, 57, 49, 43, 38, 34, 31, 29, 26, 25, 23, 22, 20, 19, 18, 17, …, 1}, 168 is not found in this list.
INFO: The micro batch size is changed to 169 to allow approximately even distribution across boxes and gradient accumulation iterations
NumMicroBatches
=Ceil(337/169) = Ceil(1.994) = 2
Use Case 3 - Invalid MBS Error
If you provide:batch_size
= 240num_csx
= 2micro_batch_size
= 121
- Per-system batch size =
Ceil(240/2) = 120
- Valid
NumMicroBatches
= {1, 2, 3, …, 120}
NumMicroBatches
, so you will see the following error message:
ERROR: <unknown>:0:error: Minimum microbatch size 121 must be smaller or equal to the per-box batch size 120 where the number of CSX boxes is 2