Configure Data Preprocessing

We’ll walk through configuring the YAML file used to define your preprocessing setup, including how to specify input data, set processing parameters, and initialize tokenizers.

Ready to jump into data preprocessing? Check out our quickstart guide.

The configuration is made up of three main sections:

setup: Environment configuration parameters
processing: Common preprocessing parameters
dataset: Task-specific or token generator parameters

Refer to the sections below for more details.

Refer to example configuration files here.

Setup Section

Use the setup section to configure the environment and parameters required for processing tasks. This is where you’ll specify the output directory, image directory (for multimodal tasks), input file handling, set the number of processes, and choose the preprocessing mode.

output_dir specifies the directory where output files will be saved.
- Default: ./output/ if not explicitly set.
- This directory is essential for storing all generated output files.
image_dir determines the directory path where image files will be saved for multimodal tasks. Used only when is_multimodal is True in the dataset section.
data configures input data handling, including data source and format. Learn more about using these parameters to configure input data in our Configure Input Data guide.
- source:
  - For local data, this defines the directory path containing input files and is mandatory.
  - For Hugging Face datasets, this specifies the dataset name from the Hugging Face hub.
- type specifies how the input data is accessed:
  - local: Reads data from a specified directory. Supported file formats for local data: .jsonl, .json.gz, .jsonl.zst, .jsonl.zst.tar, .parquet, .txt, and .fasta.
  - huggingface: Loads datasets using Hugging Face’s load_dataset function with additional parameters.
- top_level_as_subsets: If True, all top-level directories in your source (if using local data) are processed as separate datasets. Defaults to False if not specified.
- subsets: To process only specific subdirectories in your source (if using local data), provide a list of those directories (subsets: ["subset_1", "subset_2" ...]).
- split (required for Hugging Face datasets) indicates the dataset split to process (e.g., train, validation).
- kwargs specifies additional parameters passed to the load_dataset function when using Hugging Face datasets.
mode specifies the processing mode (pretraining, finetuning, or others). Learn more about modes here.
processes determines the number of processes to use for the task. Default value is 1. If set to 0, it automatically uses all available CPU cores for optimal performance.
token_generator (custom mode only) specifies which token generator to use. This parameter is split to extract the token generator’s name, enabling the system to initialize and use the specified token generator during processing.

Split the Dataset

The setup section also supports powerful options for dividing your dataset into training, validation, and test splits, along with optional context-based splits.

data_splits_dir: Top-level directory where split datasets will be saved.
data_splits: Use this to define how your data is split across different stages (e.g., train, val, test).
- split_fractions: Corresponding fractions for each MSL. Must sum to 1.

data_splits:
  train:
    split_fraction: 0.8
  val:
    split_fraction: 0.2

context_splits: (optional) Further divides a split (like train or val) based on Maximum Sequence Lengths (MSLs).
- MSL_List: List of MSL values to apply.

Example nested in a data split:

data_splits:
  train:
    split_fraction: 0.8
    context_splits:
      MSL_List: [128, 512]
      split_fractions: [0.5, 0.5]

Learn more about splitting datasets in Dataset Splitting and Preprocessing.

For optimal processing, it is recommended that all files (except .txt files) contain a substantial amount of text within each individual file. Ideally, each file should be sized in the range of gigabytes (GB).

Modes

In the setup section of the configuration file, the mode specifies the processing approach for the dataset. It determines how dataset parameters are managed and which token generator is initialized: pretraining is used for pretraining tasks. Depending on the dataset configuration, different token generators are initialized:

If the dataset uses the is_multimodal parameter, it does multimodal pretraining data preprocessing.
If the training objective is “Fill In the Middle” (FIM), it initializes the FIMTokenGenerator.
If the use_vsl parameter is set to True, it initializes the VSLPretrainingTokenGenerator. Otherwise, it initializes the PretrainingTokenGenerator.

finetuning is used for finetuning tasks. Depending on the dataset configuration, different token generators are initialized:

If the dataset uses the is_multimodal parameter, it does multimodal finetuning data preprocessing.
If the use_vsl parameter is set to True, it initializes the VSLFinetuningTokenGenerator. Otherwise, it initializes the FinetuningTokenGenerator.

Other Modes

dpo: This mode initializes the DPOTokenGenerator. It is used for tasks that require specific processing under the dpo mode.
nlg: This mode initializes the NLGTokenGenerator. It is used for natural language generation tasks.
custom: This mode allows for user-defined processing by plugging in their own custom token generator.

Processing Section

Use the processing section to initialize parameters for preprocessing tasks and set up class attributes based on the provided configuration. Initialization and Configuration Params

resume_from_checkpoint: Boolean flag indicating whether to resume processing from a checkpoint. Defaults to False.
max_seq_length: Specifies the maximum sequence length for processing. Defaults to 2048.
min_sequence_len: Specifies the minimum sequence length of the tokenized doc. Docs having less than min_sequence_len will be discarded.
fraction_of_RAM_alloted: Upper limit on fraction of RAM allocated for processing. Defaults to 0.7 (70% of available RAM).

Data Handling Params

read_chunk_size: The size of chunks to read from the input data, specified in KB. Defaults to 1024 KB (1 MB).
write_chunk_size: The size of chunks to write to the output data, specified in KB. Defaults to 1024 KB (1 MB).
write_in_batch: Boolean flag indicating whether to write data in batches. Defaults to False.
shuffle: Boolean flag indicating whether to shuffle the data. Defaults to False. If True, the shuffle seed is also set.
shuffle_seed: The seed for shuffling data. Defaults to 0 if not specified.
token_limit: Stop the data preprocessing pipeline after a specified number of tokens are processed.

Read Hooks Params

read_hook: Path to the read hook function. See an example here. Defaults to None. User must provide the read_hook for every preprocessing run.
read_hook_kwargs: A dictionary of keyword arguments for the read hook function. Must include the keys to be used for data processing with the naming convention *_key. See an example here.

Tokenization Params

huggingface_tokenizer: Parameter to provide Hugging Face tokenizer
custom_tokenizer: Parameter to provide custom tokenizer.
tokenizer_params: Parameter to provide additional parameters to initialize tokenizer. For more details check the Initialization.
input_ids_dtype: dtype of processed input_ids. Defaults to int32.
input_mask_dtype: dtype of processed input loss masks. Defaults to int32.

Text Processing Params

use_ftfy: Boolean flag indicating whether or not to fix text with ftfy.
ftfy_normalizer: Choose what kind of unicode normalization is applied. Usually, we apply NFC normalization, so that letters followed by combining characters become single combined characters. Using None applies no normalization while fixing text. Defaults to NFC.
wikitext_detokenize: Use wikitext detokenizer to fix text. Defaults to False.

Sequence Control Params

short_seq_prob: Probability of creating sequences which are shorter than the maximum sequence length. Defaults to 0.0 .

Semantic Masks and Weights Params

semantic_drop_mask: Dictionary which indicates which semantic region to drop from input data before tokenization. Defaults to {}.
semantic_loss_weight: Dictionary which indicates the loss mask of the different semantic regions post tokenization. Defaults to {}.
semantic_attention_mask: Dictionary which indicates the attention mask of the different semantic regions. Defaults to {}.

The keys provided in the read_hook_kwargs flag should end with *_key.The max_seq_length specified in the processing section of the data config should match max_position_embeddings in the model section in the model’s config. Also make sure the vocab_size in the model section in the model’s config matches the vocab size of the tokenizer used for data preprocessing.

Example Processing Section

processing:
    huggingface_tokenizer: "unsloth/Llama-3.3-70B-Instruct"
    max_seq_length: 4096

    write_in_batch: True

    resume_from_checkpoint: False

    read_hook: "cerebras.modelzoo.data_preparation.data_preprocessing.hooks:chat_read_hook"
    read_hook_kwargs:
        multi_turn_key: "messages"
        multi_turn_content_key: "content"
        multi_turn_role_key: "role"

    shuffle_seed: 0
    shuffle: False
    use_ftfy: True
    ftfy_normalizer: "NFC"
    wikitext_detokenize: False
    token_limit: 1024000

Dataset Section

The following dataset parameters are also processed:

use_vsl: A boolean parameter indicating whether to use VSL (variable sequence length) mode.
is_multimodal: A boolean parameter indicating whether the dataset is multimodal.
training_objective: Specifies the training objective which can either be fim or mlm. mlm is for masked language modeling which is a part of pretraining token generator.
truncate_to_msl: Specifies if truncation of sequences needs to be done, to fit a sequence within the MSL. This is applicable only for FineTuning as well as VSLFineTuning modes. (For more details, please refer to the section on truncation here)

use_vsl is not supported for multimodal tasks.

Read Hook Function

A read hook function is a user-defined function that customizes the way data is processed. This function can be specified through a configuration parameter under processing section in the config and is crucial for preprocessing datasets. The parameter fields it takes are read_hook and read_hook_kwargs.

Always ensure that a read hook function is provided for datasets to handle the data types appropriately.
Specify the read hook path in the configuration in the format module_name:func_name to ensure the correct function is loaded and utilized.

Example Configuration

Here is how to specify a format hook function in the configuration:

processing:
  read_hook: "my_module.my_submodule:my_custom_hook"
  read_hook_kwargs:
    key1_key: "<key1_name>"
    key2_key: "<key2_name>"
    param1: value1
    param2: value2

This configuration will load “my_custom_hook” from “my_module.my_submodule” and bind data_keys, param1, and param2 with the respective values.

The user must name key parameter of data related keys with _key. For example, the user can name the key for text as text_key: "text". For finetuning it can be names as prompt_key: "prompt".This is important because this is how data processing framework identifies data keys in read_hook_kwargs from other parameters in the kwargs. An example of finetuning read hook args is shown below.

processing:
  read_hook: "my_module.my_submodule:my_custom_hook"
  read_hook_kwargs:
    prompt_key: "prompt"
    completion_key: "completion"
    param1: value1
    param2: value2

Tokenizer Initialization

This section describes how the tokenizer is initialized based on the provided processing parameters. The initialization process handles different types of tokenizers, including Hugging Face tokenizer, GPT-2, NeoX, and custom tokenizers.

Configuration Parameters

huggingface_tokenizer: Specifies the Hugging Face tokenizer to use.
custom_tokenizer: Specifies the custom tokenizer to use. The way to specify custom tokenizer is same as any other custom module - you can use module_name:tokenizer_name. gpt2tokenizer and neoxtokenizer are provided as special case, custom tokenizers for legacy reasons. For more details about custom tokenizers, refer to - _custom-tokenizer-section.
tokenizer_params: A dictionary of additional parameters for the tokenizer. These parameters are passed to the tokenizer during initialization.
eos_id: Optional. Specifies the end-of-sequence token ID. Used if the tokenizer does not have an eos_id.
pad_id: Optional. Specifies the padding token ID. Used if the tokenizer does not have a pad_id.

Initialization Process

Handling Tokenizer Types:
- Hugging Face Tokenizer: Initialized using AutoTokenizer from Hugging Face.
- Custom Tokenizer: For custom tokenizers, initialized from the user-provided module and class.
GPT-2 and NeoX Tokenizers:
- Kept as custom tokenizers because they require custom vocab and encoder files for initialization which are located in ModelZoo. Note that you can still use Hugging Face tokenizers for GPT2 and NeoX. But, these tokenizers exist for legacy reasons.
Override IDs:For GPT-2 tokenizers, make sure the pad_id is set to the same value as the eos_id.
- Override the eos_id and pad_id if specified in the processing parameters. Ensure that the eos_id and pad_id provided in the configuration match the tokenizer’s eos_id and pad_id, if available.

Example Configurations

Hugging Face Tokenizer

processing:
  huggingface_tokenizer: "bert-base-uncased"
  tokenizer_params:
    param1: value1
    param2: value2

This configuration will initialize the specified Hugging Face tokenizer with specific parameters.

GPT-2 Tokenizer

GPT-2 and NeoX tokenizers are treated as custom tokenizers because they require specific vocab and encoder files for initialization. These files must be provided through the tokenizer_params.

processing:
  custom_tokenizer: "gpt2tokenizer"
  tokenizer_params:
    vocab_file: "path/to/vocab.json"
    encoder_file: "path/to/merges.txt"

NeoX Tokenizer

processing:
  custom_tokenizer: "neoxtokenizer"
  tokenizer_params:
    encoder_file: "path/to/encoder.json"

Custom Tokenizer

processing: 
  custom_tokenizer: "path.to.module:tokenizer_class"
  tokenizer_params:
    param1: "param1"
    param2: "param2"

Output Files Structure

The output directory will contain a number of .h5 files as shown below:

<path/to/output_dir>
├── checkpoint_process_0.txt
├── checkpoint_process_1.txt
├── data_params.json
├── output_chunk_0_0_0_0.h5
├── output_chunk_1_0_0_0.h5
├── output_chunk_1_0_16_1.h5
├── output_chunk_0_0_28_1.h5
├── output_chunk_0_0_51_2.h5
├── output_chunk_1_0_22_2.h5
├── output_chunk_0_1_0_3.h5
├── ...

data_params.json stores the parameters used for generating this set of files.
checkpoint_*.txt can be used for resuming the processing in case the run script gets killed for some reason. To use this file, simply set the resume_from_checkpoint flag to True in the processing section inside the configuration file.

Statistics Generated After Preprocessing

After preprocessing has been completed, the following statistics are generated in data_params.json:

Attribute	Description
`average_bytes_per_sequence`	The average number of bytes per sequence after processing
`average_chars_per_sequence`	The average number of characters per sequence after processing
`discarded_files`	The number of files discarded during processing because the resulting number of token IDs were either greater than the MSL or less than the `min_sequence_len`
`eos_id`	The token ID used to signify the end of a sequence
`loss_valid_tokens`	Number of tokens on which loss is computed
`n_examples`	The total number of examples (sequences) that were processed
`non_pad_tokens`	Non pad tokens
`normalized_bytes_count`	The total number of bytes after normalization (e.g., UTF-8 encoding)
`normalized_chars_count`	The total number of characters after normalization (e.g., lowercasing, removing special characters)
`num_masked_tokens`	The total number of tokens that were masked (used in tasks like masked language modeling)
`num_pad_tokens`	The total number of padding tokens used to equalize the length of the sequences
`num_tokens`	Total number of tokens
`pad_id`	The token ID used as padding
`processed_files`	The number of files successfully processed after tokenizing
`raw_bytes_count`	The total number of bytes before any processing
`raw_chars_count`	The total number of characters before any processing
`successful_files`	The number of files that were successfully processed without any issues
`total_raw_docs`	The total number of raw docs present in the input data
`raw_docs_skipped`	The number of raw docs that were skipped due to missing sections in the data
`vocab_size`	The size of the vocabulary used in the tokenizer

What’s Next?

Now that you’ve mastered the essentials of data preprocessing on Cerebras Systems, dive deeper into configuring your input data with our detailed guide on Input Data Configuration on Cerebras Systems. This guide will help you set up and manage local and Hugging Face data sources effectively, ensuring seamless integration into your preprocessing workflow. Additionally, explore the various read hooks available for data processing. These read hooks are tailored to handle different types of input data, preparing it for specific machine learning tasks. Understanding and utilizing these read hooks will further enhance your data preprocessing capabilities, leading to better model performance and more accurate results.

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

Configure Data Preprocessing

Setup Section

Split the Dataset

Modes

Other Modes

Processing Section

Example Processing Section

Dataset Section

Read Hook Function

Example Configuration

Tokenizer Initialization

Configuration Parameters

Initialization Process

Example Configurations

Hugging Face Tokenizer

GPT-2 Tokenizer

NeoX Tokenizer

Custom Tokenizer

Output Files Structure

Statistics Generated After Preprocessing

What’s Next?

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

​Setup Section

​Split the Dataset

​Modes

​Other Modes

​Processing Section

​Example Processing Section

​Dataset Section

​Read Hook Function

​Example Configuration

​Tokenizer Initialization

​Configuration Parameters

​Initialization Process

​Example Configurations

​Hugging Face Tokenizer

​GPT-2 Tokenizer

​NeoX Tokenizer

​Custom Tokenizer

​Output Files Structure

​Statistics Generated After Preprocessing

​What’s Next?

Setup Section

Split the Dataset

Modes

Other Modes

Processing Section

Example Processing Section

Dataset Section

Read Hook Function

Example Configuration

Tokenizer Initialization

Configuration Parameters

Initialization Process

Example Configurations

Hugging Face Tokenizer

GPT-2 Tokenizer

NeoX Tokenizer

Custom Tokenizer

Output Files Structure

Statistics Generated After Preprocessing

What’s Next?