You can configure local or Hugging Face data as input for preprocessing. In this guide you’ll learn how to define your data source, specify optional parameters like subsets or splits, and structure your config file to support flexible, scalable preprocessing workflows.

Local Data

  • Set type to local.

  • Use source to provide the path to the input directory.

For example:

  setup:
    data:
        source: "/input/dir/here"
        type: "local"

    mode: "pretraining"
    output_dir: "./output/dir/here/"
    processes: 1

Preprocess Subdirectories

You can optionally preprocess subdirectories within your input directory as separate datasets. This enables more flexible data management for large-scale pretraining tasks.

There are two supported options:

  • Use top_level_as_subsets: True to automatically treat each top-level folder in your input directory as a separate dataset. Each top-level directory is treated as a subset and a separate output folder will be created under output_dir with its respective preprocessed HDF5 files. Defaults to False if not specified.
  • Use subsets: [list] to manually specify which subfolders to preprocess. Only the folders listed in subsets will be preprocessed and each subset will have its own output folder under output_dir.

Use the tabs to view examples:

setup:
    data:
        source: "input_dir"
        type: "local"
        top_level_as_subsets: True
    mode: "pretraining"
    output_dir: "./output_dir"
    processes: 1

Hugging Face Data

  • Set type to huggingface.

  • Use source to specify the dataset name from the Hugging Face hub.

  • Use split to specify the dataset split.

The preprocessing pipeline passes these parameters to the Hugging Face load_dataset API.

When calling the API, parameters are passed as keyword arguments and they must conform to the specifications outlined by HuggingFace. Refer to the load_dataset documentation here.

For example:

Hugging Face Data
setup:
  data:
      source: "stanfordnlp/imdb"
      type: "huggingface"
      split: "test"
      cache_dir: "path/to/cache_dir"
      ...other parameters accepted by HuggingFace ``load_dataset`` API...


  mode: "pretraining"
  output_dir: "./output/dir/here/"
  processes: 1

View example configs for various use cases here.