We’ll walk through configuring the YAML file used to define your preprocessing setup, including how to specify input data, set processing parameters, and initialize tokenizers.

Ready to jump into data preprocessing? Check out our quickstart guide.

The configuration is made up of three main sections:

  • setup: Environment configuration parameters

  • processing: Common preprocessing parameters

  • dataset: Task-specific or token generator parameters

Refer to the sections below for more details.

Refer to example configuration files here.

Setup Section

Use the setup section to configure the environment and parameters required for processing tasks. This is where you’ll specify the output directory, image directory (for multimodal tasks), input file handling, set the number of processes, and choose the preprocessing mode.

  1. output_dir specifies the directory where output files will be saved.

    • Default: ./output/ if not explicitly set.

    • This directory is essential for storing all generated output files.

  2. image_dir determines the directory path where image files will be saved for multimodal tasks. Used only when is_multimodal is True in the dataset section.

  3. data configures input data handling, including data source and format. Learn more about using these parameters to configure input data in our Configure Input Data guide.

    • source:

      • For local data, this defines the directory path containing input files and is mandatory.

      • For Hugging Face datasets, this specifies the dataset name from the Hugging Face hub.

    • type specifies how the input data is accessed:

      • local: Reads data from a specified directory. Supported file formats for local data: .jsonl, .json.gz, .jsonl.zst, .jsonl.zst.tar, .parquet, .txt, and .fasta.

      • huggingface: Loads datasets using Hugging Face’s load_dataset function with additional parameters.

    • top_level_as_subsets: If True, all top-level directories in your source (if using local data) are processed as separate datasets. Defaults to False if not specified.

    • subsets: To process only specific subdirectories in your source (if using local data), provide a list of those directories (subsets: ["subset_1", "subset_2" ...]).

    • split (required for Hugging Face datasets) indicates the dataset split to process (e.g., train, validation).

    • kwargs specifies additional parameters passed to the load_dataset function when using Hugging Face datasets.

  4. mode specifies the processing mode (pretraining, finetuning, or others). Learn more about modes here.

  5. processes determines the number of processes to use for the task. Default value is 1. If set to 0, it automatically uses all available CPU cores for optimal performance.

  6. token_generator (custom mode only) specifies which token generator to use. This parameter is split to extract the token generator’s name, enabling the system to initialize and use the specified token generator during processing.

Split the Dataset

The setup section also supports powerful options for dividing your dataset into training, validation, and test splits, along with optional context-based splits.

  1. data_splits_dir: Top-level directory where split datasets will be saved.

  2. data_splits: Use this to define how your data is split across different stages (e.g., train, val, test).

    • split_fractions: Corresponding fractions for each MSL. Must sum to 1.
data_splits:
  train:
    split_fraction: 0.8
  val:
    split_fraction: 0.2
  1. context_splits: (optional) Further divides a split (like train or val) based on Maximum Sequence Lengths (MSLs).
    • MSL_List: List of MSL values to apply.

Example nested in a data split:

data_splits:
  train:
    split_fraction: 0.8
    context_splits:
      MSL_List: [128, 512]
      split_fractions: [0.5, 0.5]

Learn more about splitting datasets in Dataset Splitting and Preprocessing.

For optimal processing, it is recommended that all files (except .txt files) contain a substantial amount of text within each individual file. Ideally, each file should be sized in the range of gigabytes (GB).

Modes

In the setup section of the configuration file, the mode specifies the processing approach for the dataset. It determines how dataset parameters are managed and which token generator is initialized:

pretraining is used for pretraining tasks. Depending on the dataset configuration, different token generators are initialized:

  • If the dataset uses the is_multimodal parameter, it does multimodal pretraining data preprocessing.

  • If the training objective is “Fill In the Middle” (FIM), it initializes the FIMTokenGenerator.

  • If the use_vsl parameter is set to True, it initializes the VSLPretrainingTokenGenerator. Otherwise, it initializes the PretrainingTokenGenerator.

finetuning is used for finetuning tasks. Depending on the dataset configuration, different token generators are initialized:

  • If the dataset uses the is_multimodal parameter, it does multimodal finetuning data preprocessing.

  • If the use_vsl parameter is set to True, it initializes the VSLFinetuningTokenGenerator. Otherwise, it initializes the FinetuningTokenGenerator.

Other Modes

  • dpo: This mode initializes the DPOTokenGenerator. It is used for tasks that require specific processing under the dpo mode.

  • nlg: This mode initializes the NLGTokenGenerator. It is used for natural language generation tasks.

  • custom: This mode allows for user-defined processing by plugging in their own custom token generator.

Processing Section

Use the processing section to initialize parameters for preprocessing tasks and set up class attributes based on the provided configuration.

Initialization and Configuration Params

  • resume_from_checkpoint: Boolean flag indicating whether to resume processing from a checkpoint. Defaults to False.

  • max_seq_length: Specifies the maximum sequence length for processing. Defaults to 2048.

  • min_sequence_len: Specifies the minimum sequence length of the tokenized doc. Docs having less than min_sequence_len will be discarded.

  • fraction_of_RAM_alloted: Upper limit on fraction of RAM allocated for processing. Defaults to 0.7 (70% of available RAM).

Data Handling Params

  • read_chunk_size: The size of chunks to read from the input data, specified in KB. Defaults to 1024 KB (1 MB).

  • write_chunk_size: The size of chunks to write to the output data, specified in KB. Defaults to 1024 KB (1 MB).

  • write_in_batch: Boolean flag indicating whether to write data in batches. Defaults to False.

  • shuffle: Boolean flag indicating whether to shuffle the data. Defaults to False. If True, the shuffle seed is also set.

  • shuffle_seed: The seed for shuffling data. Defaults to 0 if not specified.

  • token_limit: Stop the data preprocessing pipeline after a specified number of tokens are processed.

Read Hooks Params

  • read_hook: Path to the read hook function. See an example here. Defaults to None. User must provide the read_hook for every preprocessing run.

  • read_hook_kwargs: A dictionary of keyword arguments for the read hook function. Must include the keys to be used for data processing with the naming convention *_key. See an example here.

Tokenization Params

  • huggingface_tokenizer: Parameter to provide Hugging Face tokenizer

  • custom_tokenizer: Parameter to provide custom tokenizer.

  • tokenizer_params: Parameter to provide additional parameters to initialize tokenizer. For more details check the Initialization.

  • input_ids_dtype: dtype of processed input_ids. Defaults to int32.

  • input_mask_dtype: dtype of processed input loss masks. Defaults to int32.

Text Processing Params

  • use_ftfy: Boolean flag indicating whether or not to fix text with ftfy.

  • ftfy_normalizer: Choose what kind of unicode normalization is applied. Usually, we apply NFC normalization, so that letters followed by combining characters become single combined characters. Using None applies no normalization while fixing text. Defaults to NFC.

  • wikitext_detokenize: Use wikitext detokenizer to fix text. Defaults to False.

Sequence Control Params

  • short_seq_prob: Probability of creating sequences which are shorter than the maximum sequence length. Defaults to 0.0 .

Semantic Masks and Weights Params

  • semantic_drop_mask: Dictionary which indicates which semantic region to drop from input data before tokenization. Defaults to {}.

  • semantic_loss_weight: Dictionary which indicates the loss mask of the different semantic regions post tokenization. Defaults to {}.

  • semantic_attention_mask: Dictionary which indicates the attention mask of the different semantic regions. Defaults to {}.

The keys provided in the read_hook_kwargs flag should end with *_key.

The max_seq_length specified in the processing section of the data config should match max_position_embeddings in the model section in the model’s config. Also make sure the vocab_size in the model section in the model’s config matches the vocab size of the tokenizer used for data preprocessing.

Example Processing Section

processing:
    huggingface_tokenizer: "unsloth/Llama-3.3-70B-Instruct"
    max_seq_length: 4096

    write_in_batch: True

    resume_from_checkpoint: False

    read_hook: "cerebras.modelzoo.data_preparation.data_preprocessing.hooks:chat_read_hook"
    read_hook_kwargs:
        multi_turn_key: "messages"
        multi_turn_content_key: "content"
        multi_turn_role_key: "role"

    shuffle_seed: 0
    shuffle: False
    use_ftfy: True
    ftfy_normalizer: "NFC"
    wikitext_detokenize: False
    token_limit: 1024000

Dataset Section

The following dataset parameters are also processed:

  • use_vsl: A boolean parameter indicating whether to use VSL (variable sequence length) mode.

  • is_multimodal: A boolean parameter indicating whether the dataset is multimodal.

  • training_objective: Specifies the training objective which can either be fim or mlm. mlm is for masked language modeling which is a part of pretraining token generator.

  • truncate_to_msl: Specifies if truncation of sequences needs to be done, to fit a sequence within the MSL. This is applicable only for FineTuning as well as VSLFineTuning modes. (For more details, please refer to the section on truncation here)

use_vsl is not supported for multimodal tasks.

Read Hook Function

A read hook function is a user-defined function that customizes the way data is processed. This function can be specified through a configuration parameter under processing section in the config and is crucial for preprocessing datasets. The parameter fields it takes are read_hook and read_hook_kwargs.

  • Always ensure that a read hook function is provided for datasets to handle the data types appropriately.

  • Specify the read hook path in the configuration in the format module_name:func_name to ensure the correct function is loaded and utilized.

Example Configuration

Here is how to specify a format hook function in the configuration:

processing:
  read_hook: "my_module.my_submodule:my_custom_hook"
  read_hook_kwargs:
    key1_key: "<key1_name>"
    key2_key: "<key2_name>"
    param1: value1
    param2: value2

This configuration will load “my_custom_hook” from “my_module.my_submodule” and bind data_keys, param1, and param2 with the respective values.

The user must name key parameter of data related keys with _key. For example, the user can name the key for text as text_key: "text". For finetuning it can be names as prompt_key: "prompt".

This is important because this is how data processing framework identifies data keys in read_hook_kwargs from other parameters in the kwargs. An example of finetuning read hook args is shown below.

processing:
  read_hook: "my_module.my_submodule:my_custom_hook"
  read_hook_kwargs:
    prompt_key: "prompt"
    completion_key: "completion"
    param1: value1
    param2: value2

Tokenizer Initialization

This section describes how the tokenizer is initialized based on the provided processing parameters. The initialization process handles different types of tokenizers, including Hugging Face tokenizer, GPT-2, NeoX, and custom tokenizers.

Configuration Parameters

  • huggingface_tokenizer: Specifies the Hugging Face tokenizer to use.

  • custom_tokenizer: Specifies the custom tokenizer to use. The way to specify custom tokenizer is same as any other custom module - you can use module_name:tokenizer_name. gpt2tokenizer and neoxtokenizer are provided as special case, custom tokenizers for legacy reasons. For more details about custom tokenizers, refer to - _custom-tokenizer-section.

  • tokenizer_params: A dictionary of additional parameters for the tokenizer. These parameters are passed to the tokenizer during initialization.

  • eos_id: Optional. Specifies the end-of-sequence token ID. Used if the tokenizer does not have an eos_id.

  • pad_id: Optional. Specifies the padding token ID. Used if the tokenizer does not have a pad_id.

Initialization Process

  1. Handling Tokenizer Types:

    • Hugging Face Tokenizer: Initialized using AutoTokenizer from Hugging Face.

    • Custom Tokenizer: For custom tokenizers, initialized from the user-provided module and class.

  2. GPT-2 and NeoX Tokenizers:

    • Kept as custom tokenizers because they require custom vocab and encoder files for initialization which are located in ModelZoo. Note that you can still use Hugging Face tokenizers for GPT2 and NeoX. But, these tokenizers exist for legacy reasons.
  3. Override IDs:For GPT-2 tokenizers, make sure the pad_id is set to the same value as the eos_id.

    • Override the eos_id and pad_id if specified in the processing parameters. Ensure that the eos_id and pad_id provided in the configuration match the tokenizer’s eos_id and pad_id, if available.

Example Configurations

Hugging Face Tokenizer

processing:
  huggingface_tokenizer: "bert-base-uncased"
  tokenizer_params:
    param1: value1
    param2: value2

This configuration will initialize the specified Hugging Face tokenizer with specific parameters.

GPT-2 Tokenizer

GPT-2 and NeoX tokenizers are treated as custom tokenizers because they require specific vocab and encoder files for initialization. These files must be provided through the tokenizer_params.

processing:
  custom_tokenizer: "gpt2tokenizer"
  tokenizer_params:
    vocab_file: "path/to/vocab.json"
    encoder_file: "path/to/merges.txt"

NeoX Tokenizer

processing:
  custom_tokenizer: "neoxtokenizer"
  tokenizer_params:
    encoder_file: "path/to/encoder.json"

Custom Tokenizer

processing: 
  custom_tokenizer: "path.to.module:tokenizer_class"
  tokenizer_params:
    param1: "param1"
    param2: "param2"

Output Files Structure

The output directory will contain a number of .h5 files as shown below:

<path/to/output_dir>
├── checkpoint_process_0.txt
├── checkpoint_process_1.txt
├── data_params.json
├── output_chunk_0_0_0_0.h5
├── output_chunk_1_0_0_0.h5
├── output_chunk_1_0_16_1.h5
├── output_chunk_0_0_28_1.h5
├── output_chunk_0_0_51_2.h5
├── output_chunk_1_0_22_2.h5
├── output_chunk_0_1_0_3.h5
├── ...
  • data_params.json stores the parameters used for generating this set of files.

  • checkpoint_*.txt can be used for resuming the processing in case the run script gets killed for some reason. To use this file, simply set the resume_from_checkpoint flag to True in the processing section inside the configuration file.

Statistics Generated After Preprocessing

After preprocessing has been completed, the following statistics are generated in data_params.json:

AttributeDescription
average_bytes_per_sequenceThe average number of bytes per sequence after processing
average_chars_per_sequenceThe average number of characters per sequence after processing
discarded_filesThe number of files discarded during processing because the resulting number of token IDs were either greater than the MSL or less than the min_sequence_len
eos_idThe token ID used to signify the end of a sequence
loss_valid_tokensNumber of tokens on which loss is computed
n_examplesThe total number of examples (sequences) that were processed
non_pad_tokensNon pad tokens
normalized_bytes_countThe total number of bytes after normalization (e.g., UTF-8 encoding)
normalized_chars_countThe total number of characters after normalization (e.g., lowercasing, removing special characters)
num_masked_tokensThe total number of tokens that were masked (used in tasks like masked language modeling)
num_pad_tokensThe total number of padding tokens used to equalize the length of the sequences
num_tokensTotal number of tokens
pad_idThe token ID used as padding
processed_filesThe number of files successfully processed after tokenizing
raw_bytes_countThe total number of bytes before any processing
raw_chars_countThe total number of characters before any processing
successful_filesThe number of files that were successfully processed without any issues
total_raw_docsThe total number of raw docs present in the input data
raw_docs_skippedThe number of raw docs that were skipped due to missing sections in the data
vocab_sizeThe size of the vocabulary used in the tokenizer

What’s Next?

Now that you’ve mastered the essentials of data preprocessing on Cerebras Systems, dive deeper into configuring your input data with our detailed guide on Input Data Configuration on Cerebras Systems. This guide will help you set up and manage local and Hugging Face data sources effectively, ensuring seamless integration into your preprocessing workflow.

Additionally, explore the various read hooks available for data processing. These read hooks are tailored to handle different types of input data, preparing it for specific machine learning tasks. Understanding and utilizing these read hooks will further enhance your data preprocessing capabilities, leading to better model performance and more accurate results.