Our data processing pipeline integrates an innovative online shuffling feature that seamlessly interacts with HDF5 storage. This crucial functionality prevents machine learning models from learning the data order and ensures randomized distribution of data sequences, leading to improved model robustness and performance.
Our novel online shuffling mechanism operates seamlessly with HDF5 storage, eliminating the need for post-processing and its associated memory overhead. By interleaving shuffling with data serialization, it efficiently randomizes data sequences as they are written to files, significantly reducing processing time. To activate this functionality, simply specify the shuffle flag and provide a seed with the shuffle_seed flag within your data pre-processing configuration file.