The preprocessing section in the train_input and eval_input sections of the YAML configuration file enables on-the-fly data preprocessing during training and/or evaluation, which reduces turnaround time and storage requirements when running experiments on smaller datasets.

The parameters for preprocessing are the same as those used for offline data preprocessing, applying the same algorithms and techniques.

For multibox runs, sharding is based on the number of input files in the input directory, which should be greater than or equal to the product of the number of systems and workers per system.

Below are examples for pretraining and fine-tuning configurations. We currently support OTF preprocessing for pretraining and finetuning on text-only and multimodal datasets.

Example Text-Only Configurations

Use the tabs to view examples for text-only pretraining and finetuning:

Multimodal Example Configurations

Use the tabs to view examples for multimodal pretraining and finetuning:

We don’t support global shuffling with on-the-fly processing. This feature will be released in subsequent releases.