For local data:
  • Set type to local.
  • Use source to provide the path to the input directory.
For Hugging Face data:
  • Set type to huggingface.
  • Use source to specify the dataset name from the Hugging Face hub.
  • Use split to specify the dataset split.
The preprocessing pipeline passes these parameters to the Hugging Face load_dataset API.
When calling the API, parameters are passed as keyword arguments and they must conform to the specifications outlined by HuggingFace. Refer to the load_dataset documentation here.

Config Examples

Use the tabs to view examples:
setup:
  data:
      source: "/input/dir/here"
      type: "local"

  mode: "pretraining"
  output_dir: "./output/dir/here/"
  processes: 1

View example configs for various use cases here.

What’s Next?

Now that you’ve configured your input data, learn how to process it with read hooks.