For local data:

  • Set type to local.

  • Use source to provide the path to the input directory.

For Hugging Face data:

  • Set type to huggingface.

  • Use source to specify the dataset name from the Hugging Face hub.

  • Use split to specify the dataset split.

The preprocessing pipeline passes these parameters to the Hugging Face load_dataset API.

When calling the API, parameters are passed as keyword arguments and they must conform to the specifications outlined by HuggingFace. Refer to the load_dataset documentation here.

Config Examples

Use the tabs to view examples:

View example configs for various use cases here.

What’s Next?

Now that you’ve configured your input data, learn how to process it with read hooks.