Learn how to configure your input data for preprocessing—whether you’re working with a single directory of data or organizing large datasets into subsets.
type
to local
.
source
to provide the path to the input directory.
top_level_as_subsets: True
to automatically treat each top-level folder in your input directory as a separate dataset. Each top-level directory is treated as a subset and a separate output folder will be created under output_dir
with its respective preprocessed HDF5 files. Defaults to False
if not specified.subsets: [list]
to manually specify which subfolders to preprocess. Only the folders listed in subsets will be preprocessed and each subset will have its own output folder under output_dir
.type
to huggingface
.
source
to specify the dataset name from the Hugging Face hub.
split
to specify the dataset split.
load_dataset
API.
load_dataset
documentation here.