Data preprocessing
Input Data Configuration
Learn how to configure local or Hugging Face data as input for preprocessing.
For local data:
-
Set
type
tolocal
. -
Use
source
to provide the path to the input directory.
For Hugging Face data:
-
Set
type
tohuggingface
. -
Use
source
to specify the dataset name from the Hugging Face hub. -
Use
split
to specify the dataset split.
The preprocessing pipeline passes these parameters to the Hugging Face load_dataset
API.
When calling the API, parameters are passed as keyword arguments and they must conform to the specifications outlined by HuggingFace. Refer to the load_dataset
documentation here.
Config Examples
Use the tabs to view examples:
View example configs for various use cases here.
What’s Next?
Now that you’ve configured your input data, learn how to process it with read hooks.