Prerequisites
Make sure to have read through Trainer Overview and Trainer Configuration Overview which provide the basic overview of how to run Model Zoo models. In this document, you will be using the tools and configurations outlined in those pages.How It Works
MoE layers need larger batches to increase arithmetic intensity, but the batch size increase is limited by the memory bottleneck primarily coming from the attention layer. BTA helps by tiling the attention, reducing the effective working batch of attention. Generally, for smaller sequence lengths, MoE cycles dominate end-to-end runs and increasing batch size helps increase MoE performance and overall end-to-end performance.Performance Impact
Without BTA, MoE models experience significant throughput degradation as sparsity increases:| Configuration | Throughput Degradation (without BTA) |
|---|---|
| 128 experts | Up to 53% slower (2x slowdown) |
| Low top_k (high sparsity) | Up to 86% slower (7x slowdown) |
Parameters
ws_opt_enable_bta: Set totrueto enable Batch Tiling on Attention.ws_opt_bta_max_tile: Optional. Caps the maximum tile size. By default, BTA automatically selects a tile size based on model dimensions. If the automatic tiling is too aggressive or not aggressive enough, set this to a positive integer to cap the tile size. A smaller value reduces per-tile memory usage in attention, allowing for a larger overall batch size to improve MoE performance. The value must be a factor of your batch size.
Enabling Batch Tiling on Attention
To enable BTA, set the following in your Trainer configuration:YAML
Key Considerations
- Memory Trade-off: BTA reduces peak memory usage but may introduce additional compute overhead due to tiling
- Model Compatibility: BTA is supported for transformer-based models with standard attention mechanisms

