Configure Data Preprocessing
Learn how to preprocess your data into HDF5 format for pretraining, finetuning, and custom processing tasks.
We’ll walk through configuring the YAML file used to define your preprocessing setup, including how to specify input data, set processing parameters, and initialize tokenizers.
Ready to jump into data preprocessing? Check out our quickstart guide.
The configuration is made up of three main sections:
-
setup
: Environment configuration parameters -
processing
: Common preprocessing parameters -
dataset
: Task-specific or token generator parameters
Refer to the sections below for more details.
Refer to example configuration files here.
Setup Section
Use the setup section to configure the environment and parameters required for processing tasks. This is where you’ll specify the output directory, image directory (for multimodal tasks), input file handling, set the number of processes, and choose the preprocessing mode.
-
output_dir
specifies the directory where output files will be saved.-
Default:
./output/
if not explicitly set. -
This directory is essential for storing all generated output files.
-
-
image_dir
determines the directory path where image files will be saved for multimodal tasks. Used only whenis_multimodal
is True in thedataset
section. -
data
configures input data handling, including data source and format. Learn more about using these parameters to configure input data in our Configure Input Data guide.-
source
:-
For local data, this defines the directory path containing input files and is mandatory.
-
For Hugging Face datasets, this specifies the dataset name from the Hugging Face hub.
-
-
type
specifies how the input data is accessed:-
local
: Reads data from a specified directory. Supported file formats for local data:.jsonl
,.json.gz
,.jsonl.zst
,.jsonl.zst.tar
,.parquet
,.txt
, and.fasta
. -
huggingface
: Loads datasets using Hugging Face’sload_dataset
function with additional parameters.
-
-
top_level_as_subsets
: IfTrue
, all top-level directories in yoursource
(if using local data) are processed as separate datasets. Defaults toFalse
if not specified. -
subsets
: To process only specific subdirectories in yoursource
(if using local data), provide a list of those directories (subsets: ["subset_1", "subset_2" ...]
). -
split
(required for Hugging Face datasets) indicates the dataset split to process (e.g.,train
,validation
). -
kwargs
specifies additional parameters passed to theload_dataset
function when using Hugging Face datasets.
-
-
mode
specifies the processing mode (pretraining, finetuning, or others). Learn more about modes here. -
processes
determines the number of processes to use for the task. Default value is1
. If set to0
, it automatically uses all available CPU cores for optimal performance. -
token_generator
(custom mode only) specifies which token generator to use. This parameter is split to extract the token generator’s name, enabling the system to initialize and use the specified token generator during processing.
Split the Dataset
The setup section also supports powerful options for dividing your dataset into training, validation, and test splits, along with optional context-based splits.
-
data_splits_dir
: Top-level directory where split datasets will be saved. -
data_splits
: Use this to define how your data is split across different stages (e.g.,train
,val
,test
).split_fractions
: Corresponding fractions for each MSL. Must sum to 1.
context_splits
: (optional) Further divides a split (like train or val) based on Maximum Sequence Lengths (MSLs).MSL_List
: List of MSL values to apply.
Example nested in a data split:
Learn more about splitting datasets in Dataset Splitting and Preprocessing.
For optimal processing, it is recommended that all files (except .txt
files) contain a substantial amount of text within each individual file. Ideally, each file should be sized in the range of gigabytes (GB).
Modes
In the setup section of the configuration file, the mode specifies the processing approach for the dataset. It determines how dataset parameters are managed and which token generator is initialized:
pretraining
is used for pretraining tasks. Depending on the dataset configuration, different token generators are initialized:
-
If the dataset uses the
is_multimodal
parameter, it does multimodal pretraining data preprocessing. -
If the training objective is “Fill In the Middle” (FIM), it initializes the
FIMTokenGenerator
. -
If the
use_vsl
parameter is set toTrue
, it initializes theVSLPretrainingTokenGenerator
. Otherwise, it initializes thePretrainingTokenGenerator
.
finetuning
is used for finetuning tasks. Depending on the dataset configuration, different token generators are initialized:
-
If the dataset uses the
is_multimodal
parameter, it does multimodal finetuning data preprocessing. -
If the
use_vsl
parameter is set toTrue
, it initializes theVSLFinetuningTokenGenerator
. Otherwise, it initializes theFinetuningTokenGenerator
.
Other Modes
-
dpo
: This mode initializes theDPOTokenGenerator
. It is used for tasks that require specific processing under thedpo
mode. -
nlg
: This mode initializes theNLGTokenGenerator
. It is used for natural language generation tasks. -
custom
: This mode allows for user-defined processing by plugging in their own custom token generator.
Processing Section
Use the processing section to initialize parameters for preprocessing tasks and set up class attributes based on the provided configuration.
Initialization and Configuration Params
-
resume_from_checkpoint
: Boolean flag indicating whether to resume processing from a checkpoint. Defaults toFalse
. -
max_seq_length
: Specifies the maximum sequence length for processing. Defaults to 2048. -
min_sequence_len
: Specifies the minimum sequence length of the tokenized doc. Docs having less thanmin_sequence_len
will be discarded. -
fraction_of_RAM_alloted
: Upper limit on fraction of RAM allocated for processing. Defaults to 0.7 (70% of available RAM).
Data Handling Params
-
read_chunk_size
: The size of chunks to read from the input data, specified in KB. Defaults to 1024 KB (1 MB). -
write_chunk_size
: The size of chunks to write to the output data, specified in KB. Defaults to 1024 KB (1 MB). -
write_in_batch
: Boolean flag indicating whether to write data in batches. Defaults toFalse
. -
shuffle
: Boolean flag indicating whether to shuffle the data. Defaults toFalse
. IfTrue
, the shuffle seed is also set. -
shuffle_seed
: The seed for shuffling data. Defaults to 0 if not specified. -
token_limit
: Stop the data preprocessing pipeline after a specified number of tokens are processed.
Read Hooks Params
-
read_hook
: Path to the read hook function. See an example here. Defaults toNone
. User must provide theread_hook
for every preprocessing run. -
read_hook_kwargs
: A dictionary of keyword arguments for the read hook function. Must include the keys to be used for data processing with the naming convention*_key
. See an example here.
Tokenization Params
-
huggingface_tokenizer
: Parameter to provide Hugging Face tokenizer -
custom_tokenizer
: Parameter to provide custom tokenizer. -
tokenizer_params
: Parameter to provide additional parameters to initialize tokenizer. For more details check the Initialization. -
input_ids_dtype
: dtype of processed input_ids. Defaults toint32
. -
input_mask_dtype
: dtype of processed input loss masks. Defaults toint32
.
Text Processing Params
-
use_ftfy
: Boolean flag indicating whether or not to fix text with ftfy. -
ftfy_normalizer
: Choose what kind of unicode normalization is applied. Usually, we apply NFC normalization, so that letters followed by combining characters become single combined characters. Using None applies no normalization while fixing text. Defaults toNFC
. -
wikitext_detokenize
: Use wikitext detokenizer to fix text. Defaults toFalse
.
Sequence Control Params
short_seq_prob
: Probability of creating sequences which are shorter than the maximum sequence length. Defaults to0.0
.
Semantic Masks and Weights Params
-
semantic_drop_mask
: Dictionary which indicates which semantic region to drop from input data before tokenization. Defaults to{}
. -
semantic_loss_weight
: Dictionary which indicates the loss mask of the different semantic regions post tokenization. Defaults to{}
. -
semantic_attention_mask
: Dictionary which indicates the attention mask of the different semantic regions. Defaults to{}
.
The keys provided in the read_hook_kwargs
flag should end with *_key
.
The max_seq_length
specified in the processing section of the data config should match max_position_embeddings
in the model
section in the model’s config. Also make sure the vocab_size
in the model
section in the model’s config matches the vocab size of the tokenizer used for data preprocessing.
Example Processing Section
Dataset Section
The following dataset parameters are also processed:
-
use_vsl
: A boolean parameter indicating whether to use VSL (variable sequence length) mode. -
is_multimodal
: A boolean parameter indicating whether the dataset is multimodal. -
training_objective
: Specifies the training objective which can either befim
ormlm
.mlm
is for masked language modeling which is a part of pretraining token generator. -
truncate_to_msl
: Specifies if truncation of sequences needs to be done, to fit a sequence within the MSL. This is applicable only forFineTuning
as well asVSLFineTuning
modes. (For more details, please refer to the section on truncation here)
use_vsl
is not supported for multimodal tasks.
Read Hook Function
A read hook function is a user-defined function that customizes the way data is processed. This function can be specified through a configuration parameter under processing
section in the config and is crucial for preprocessing datasets. The parameter fields it takes are read_hook
and read_hook_kwargs
.
-
Always ensure that a read hook function is provided for datasets to handle the data types appropriately.
-
Specify the read hook path in the configuration in the format
module_name:func_name
to ensure the correct function is loaded and utilized.
Example Configuration
Here is how to specify a format hook function in the configuration:
This configuration will load “my_custom_hook” from “my_module.my_submodule” and bind data_keys
, param1
, and param2
with the respective values.
The user must name key parameter of data related keys with _key
. For example, the user can name the key for text as text_key: "text"
. For finetuning it can be names as prompt_key: "prompt"
.
This is important because this is how data processing framework identifies data keys in read_hook_kwargs
from other parameters in the kwargs. An example of finetuning read hook args is shown below.
Tokenizer Initialization
This section describes how the tokenizer is initialized based on the provided processing parameters. The initialization process handles different types of tokenizers, including Hugging Face tokenizer, GPT-2, NeoX, and custom tokenizers.
Configuration Parameters
-
huggingface_tokenizer
: Specifies the Hugging Face tokenizer to use. -
custom_tokenizer
: Specifies the custom tokenizer to use. The way to specify custom tokenizer is same as any other custom module - you can usemodule_name:tokenizer_name
.gpt2tokenizer
andneoxtokenizer
are provided as special case, custom tokenizers for legacy reasons. For more details about custom tokenizers, refer to - _custom-tokenizer-section. -
tokenizer_params
: A dictionary of additional parameters for the tokenizer. These parameters are passed to the tokenizer during initialization. -
eos_id
: Optional. Specifies the end-of-sequence token ID. Used if the tokenizer does not have aneos_id
. -
pad_id
: Optional. Specifies the padding token ID. Used if the tokenizer does not have apad_id
.
Initialization Process
-
Handling Tokenizer Types:
-
Hugging Face Tokenizer: Initialized using AutoTokenizer from Hugging Face.
-
Custom Tokenizer: For custom tokenizers, initialized from the user-provided module and class.
-
-
GPT-2 and NeoX Tokenizers:
- Kept as custom tokenizers because they require custom
vocab
andencoder
files for initialization which are located in ModelZoo. Note that you can still use Hugging Face tokenizers for GPT2 and NeoX. But, these tokenizers exist for legacy reasons.
- Kept as custom tokenizers because they require custom
-
Override IDs:For GPT-2 tokenizers, make sure the
pad_id
is set to the same value as theeos_id
.- Override the
eos_id
andpad_id
if specified in the processing parameters. Ensure that theeos_id
andpad_id
provided in the configuration match the tokenizer’seos_id
andpad_id
, if available.
- Override the
Example Configurations
Hugging Face Tokenizer
This configuration will initialize the specified Hugging Face tokenizer with specific parameters.
GPT-2 Tokenizer
GPT-2 and NeoX tokenizers are treated as custom tokenizers because they require specific vocab
and encoder
files for initialization. These files must be provided through the tokenizer_params
.
NeoX Tokenizer
Custom Tokenizer
Output Files Structure
The output directory will contain a number of .h5
files as shown below:
-
data_params.json
stores the parameters used for generating this set of files. -
checkpoint_*.txt
can be used for resuming the processing in case the run script gets killed for some reason. To use this file, simply set theresume_from_checkpoint
flag toTrue
in the processing section inside the configuration file.
Statistics Generated After Preprocessing
After preprocessing has been completed, the following statistics are generated in data_params.json
:
Attribute | Description |
---|---|
average_bytes_per_sequence | The average number of bytes per sequence after processing |
average_chars_per_sequence | The average number of characters per sequence after processing |
discarded_files | The number of files discarded during processing because the resulting number of token IDs were either greater than the MSL or less than the min_sequence_len |
eos_id | The token ID used to signify the end of a sequence |
loss_valid_tokens | Number of tokens on which loss is computed |
n_examples | The total number of examples (sequences) that were processed |
non_pad_tokens | Non pad tokens |
normalized_bytes_count | The total number of bytes after normalization (e.g., UTF-8 encoding) |
normalized_chars_count | The total number of characters after normalization (e.g., lowercasing, removing special characters) |
num_masked_tokens | The total number of tokens that were masked (used in tasks like masked language modeling) |
num_pad_tokens | The total number of padding tokens used to equalize the length of the sequences |
num_tokens | Total number of tokens |
pad_id | The token ID used as padding |
processed_files | The number of files successfully processed after tokenizing |
raw_bytes_count | The total number of bytes before any processing |
raw_chars_count | The total number of characters before any processing |
successful_files | The number of files that were successfully processed without any issues |
total_raw_docs | The total number of raw docs present in the input data |
raw_docs_skipped | The number of raw docs that were skipped due to missing sections in the data |
vocab_size | The size of the vocabulary used in the tokenizer |
What’s Next?
Now that you’ve mastered the essentials of data preprocessing on Cerebras Systems, dive deeper into configuring your input data with our detailed guide on Input Data Configuration on Cerebras Systems. This guide will help you set up and manage local and Hugging Face data sources effectively, ensuring seamless integration into your preprocessing workflow.
Additionally, explore the various read hooks available for data processing. These read hooks are tailored to handle different types of input data, preparing it for specific machine learning tasks. Understanding and utilizing these read hooks will further enhance your data preprocessing capabilities, leading to better model performance and more accurate results.