-
setup: Environment configuration parameters -
processing: Common preprocessing parameters -
dataset: Task-specific or token generator parameters
Setup Section
Use the setup section to configure the environment and parameters required for processing tasks. This is where you’ll specify the output directory, image directory (for multimodal tasks), input file handling, set the number of processes, and choose the preprocessing mode.-
output_dirspecifies the directory where output files will be saved.-
Default:
./output/if not explicitly set. - This directory is essential for storing all generated output files.
-
Default:
-
image_dirdetermines the directory path where image files will be saved for multimodal tasks. Used only whenis_multimodalis True in thedatasetsection. -
dataconfigures input data handling, including data source and format. Learn more about using these parameters to configure input data in our Configure Input Data guide.-
source:- For local data, this defines the directory path containing input files and is mandatory.
- For Hugging Face datasets, this specifies the dataset name from the Hugging Face hub.
-
typespecifies how the input data is accessed:-
local: Reads data from a specified directory. Supported file formats for local data:.jsonl,.json.gz,.jsonl.zst,.jsonl.zst.tar,.parquet,.txt, and.fasta. -
huggingface: Loads datasets using Hugging Face’sload_datasetfunction with additional parameters.
-
-
top_level_as_subsets: IfTrue, all top-level directories in yoursource(if using local data) are processed as separate datasets. Defaults toFalseif not specified. -
subsets: To process only specific subdirectories in yoursource(if using local data), provide a list of those directories (subsets: ["subset_1", "subset_2" ...]). -
split(required for Hugging Face datasets) indicates the dataset split to process (e.g.,train,validation). -
kwargsspecifies additional parameters passed to theload_datasetfunction when using Hugging Face datasets.
-
-
modespecifies the processing mode (pretraining, finetuning, or others). Learn more about modes here. -
processesdetermines the number of processes to use for the task. Default value is1. If set to0, it automatically uses all available CPU cores for optimal performance. -
token_generator(custom mode only) specifies which token generator to use. This parameter is split to extract the token generator’s name, enabling the system to initialize and use the specified token generator during processing.
Split the Dataset
The setup section also supports powerful options for dividing your dataset into training, validation, and test splits, along with optional context-based splits.-
data_splits_dir: Top-level directory where split datasets will be saved. -
data_splits: Use this to define how your data is split across different stages (e.g.,train,val,test).split_fractions: Corresponding fractions for each MSL. Must sum to 1.
- (optional)
context_splits: Further divides a split (like train or val) based on Maximum Sequence Lengths (MSLs).MSL_List: List of MSL values to apply.
For optimal processing, it is recommended that all files (except
.txt files) contain a substantial amount of text within each individual file. Ideally, each file should be sized in the range of gigabytes (GB).Enable Multi-Node Setup
Enable distributed preprocessing across multiple nodes.num_nodesspecifies the total number of nodes to be used in a distributed processing setup.slurmdefines the parameters required to submit and manage the job on a SLURM cluster. Ifnum_nodesis greater than 1, specify these additional parameters:queue: Name of the SLURM partition or queue to submit jobs to.cores: Number of CPU cores per task.memory: Memory allocation per task. Accepts formats like4GB,4000MB, or plain integers.walltime: Maximum wall clock time per job. Format:HH:MM:SS.- (optional)
enable_job_tracking: Boolean flag to enable detailed job status monitoring. Defaults toFalse. - (optional)
log_dir: Directory for saving SLURM logs. Defaults toslurm_logs.
Modes
In the setup section of the configuration file, the mode specifies the processing approach for the dataset. It determines how dataset parameters are managed and which token generator is initialized:pretraining is used for pretraining tasks. Depending on the dataset configuration, different token generators are initialized:
-
If the dataset uses the
is_multimodalparameter, it does multimodal pretraining data preprocessing. -
If the training objective is “Fill In the Middle” (FIM), it initializes the
FIMTokenGenerator. -
If the
use_vslparameter is set toTrue, it initializes theVSLPretrainingTokenGenerator. Otherwise, it initializes thePretrainingTokenGenerator.
finetuning is used for finetuning tasks. Depending on the dataset configuration, different token generators are initialized:
-
If the dataset uses the
is_multimodalparameter, it does multimodal finetuning data preprocessing. -
If the
use_vslparameter is set toTrue, it initializes theVSLFinetuningTokenGenerator. Otherwise, it initializes theFinetuningTokenGenerator.
Other Modes
-
dpo: This mode initializes theDPOTokenGenerator. It is used for tasks that require specific processing under thedpomode. -
nlg: This mode initializes theNLGTokenGenerator. It is used for natural language generation tasks. -
custom: This mode allows for user-defined processing by plugging in their own custom token generator.
Processing Section
Use the processing section to initialize parameters for preprocessing tasks and set up class attributes based on the provided configuration. Initialization and Configuration Params-
resume_from_checkpoint: Boolean flag indicating whether to resume processing from a checkpoint. Defaults toFalse. -
max_seq_length: Specifies the maximum sequence length for processing. Defaults to 2048. -
min_sequence_len: Specifies the minimum sequence length of the tokenized doc. Docs having less thanmin_sequence_lenwill be discarded. -
fraction_of_RAM_alloted: Upper limit on fraction of RAM allocated for processing. Defaults to 0.7 (70% of available RAM).
-
read_chunk_size: The size of chunks to read from the input data, specified in KB. Defaults to 1024 KB (1 MB). -
write_chunk_size: The size of chunks to write to the output data, specified in KB. Defaults to 1024 KB (1 MB). -
write_in_batch: Boolean flag indicating whether to write data in batches. Defaults toFalse. -
shuffle: Boolean flag indicating whether to shuffle the data. Defaults toFalse. IfTrue, the shuffle seed is also set. -
shuffle_seed: The seed for shuffling data. Defaults to 0 if not specified. -
token_limit: Stop the data preprocessing pipeline after a specified number of tokens are processed.
-
read_hook: Path to the read hook function. See an example here. Defaults toNone. User must provide theread_hookfor every preprocessing run. -
read_hook_kwargs: A dictionary of keyword arguments for the read hook function. Must include the keys to be used for data processing with the naming convention*_key. See an example here.
-
huggingface_tokenizer: Parameter to provide Hugging Face tokenizer -
custom_tokenizer: Parameter to provide custom tokenizer. -
tokenizer_params: Parameter to provide additional parameters to initialize tokenizer. For more details check the Initialization. -
input_ids_dtype: dtype of processed input_ids. Defaults toint32. -
input_mask_dtype: dtype of processed input loss masks. Defaults toint32.
-
use_ftfy: Boolean flag indicating whether or not to fix text with ftfy. -
ftfy_normalizer: Choose what kind of unicode normalization is applied. Usually, we apply NFC normalization, so that letters followed by combining characters become single combined characters. Using None applies no normalization while fixing text. Defaults toNFC. -
wikitext_detokenize: Use wikitext detokenizer to fix text. Defaults toFalse.
short_seq_prob: Probability of creating sequences which are shorter than the maximum sequence length. Defaults to0.0.
-
semantic_drop_mask: Dictionary which indicates which semantic region to drop from input data before tokenization. Defaults to{}. -
semantic_loss_weight: Dictionary which indicates the loss mask of the different semantic regions post tokenization. Defaults to{}. -
semantic_attention_mask: Dictionary which indicates the attention mask of the different semantic regions. Defaults to{}.
The keys provided in the
read_hook_kwargs flag should end with *_key.The max_seq_length specified in the processing section of the data config should match max_position_embeddings in the model section in the model’s config. Also make sure the vocab_size in the model section in the model’s config matches the vocab size of the tokenizer used for data preprocessing.Example Processing Section
Dataset Section
The following dataset parameters are also processed:-
use_vsl: A boolean parameter indicating whether to use VSL (variable sequence length) mode. -
is_multimodal: A boolean parameter indicating whether the dataset is multimodal. -
training_objective: Specifies the training objective which can either befimormlm.mlmis for masked language modeling which is a part of pretraining token generator. -
truncate_to_msl: Specifies if truncation of sequences needs to be done, to fit a sequence within the MSL. This is applicable only forFineTuningas well asVSLFineTuningmodes. (For more details, please refer to the section on truncation here)
use_vsl is not supported for multimodal tasks.Read Hook Function
A read hook function is a user-defined function that customizes the way data is processed. This function can be specified through a configuration parameter underprocessing section in the config and is crucial for preprocessing datasets. The parameter fields it takes are read_hook and read_hook_kwargs.
Example Configuration
Here is how to specify a format hook function in the configuration:data_keys, param1, and param2 with the respective values.
The user must name key parameter of data related keys with
_key. For example, the user can name the key for text as text_key: "text". For finetuning it can be names as prompt_key: "prompt".This is important because this is how data processing framework identifies data keys in read_hook_kwargs from other parameters in the kwargs. An example of finetuning read hook args is shown below.Tokenizer Initialization
This section describes how the tokenizer is initialized based on the provided processing parameters. The initialization process handles different types of tokenizers, including Hugging Face tokenizer, GPT-2, NeoX, and custom tokenizers.Configuration Parameters
-
huggingface_tokenizer: Specifies the Hugging Face tokenizer to use. -
custom_tokenizer: Specifies the custom tokenizer to use. The way to specify custom tokenizer is same as any other custom module - you can usemodule_name:tokenizer_name.gpt2tokenizerandneoxtokenizerare provided as special case, custom tokenizers for legacy reasons. For more details about custom tokenizers, refer to - _custom-tokenizer-section. -
tokenizer_params: A dictionary of additional parameters for the tokenizer. These parameters are passed to the tokenizer during initialization. -
eos_id: Optional. Specifies the end-of-sequence token ID. Used if the tokenizer does not have aneos_id. -
pad_id: Optional. Specifies the padding token ID. Used if the tokenizer does not have apad_id.
Initialization Process
-
Handling Tokenizer Types:
- Hugging Face Tokenizer: Initialized using AutoTokenizer from Hugging Face.
- Custom Tokenizer: For custom tokenizers, initialized from the user-provided module and class.
-
GPT-2 and NeoX Tokenizers:
- Kept as custom tokenizers because they require custom
vocabandencoderfiles for initialization which are located in ModelZoo. Note that you can still use Hugging Face tokenizers for GPT2 and NeoX. But, these tokenizers exist for legacy reasons.
- Kept as custom tokenizers because they require custom
-
Override IDs:For GPT-2 tokenizers, make sure the
pad_idis set to the same value as theeos_id.- Override the
eos_idandpad_idif specified in the processing parameters. Ensure that theeos_idandpad_idprovided in the configuration match the tokenizer’seos_idandpad_id, if available.
- Override the
Example Configurations
Hugging Face Tokenizer
GPT-2 Tokenizer
GPT-2 and NeoX tokenizers are treated as custom tokenizers because they require specificvocab and encoder files for initialization. These files must be provided through the tokenizer_params.
NeoX Tokenizer
Custom Tokenizer
Output Files Structure
The output directory will contain a number of.h5 files as shown below:
-
data_params.jsonstores the parameters used for generating this set of files. -
checkpoint_*.txtcan be used for resuming the processing in case the run script gets killed for some reason. To use this file, simply set theresume_from_checkpointflag toTruein the processing section inside the configuration file.
Statistics Generated After Preprocessing
After preprocessing has been completed, the following statistics are generated indata_params.json:
| Attribute | Description |
|---|---|
average_bytes_per_sequence | The average number of bytes per sequence after processing |
average_chars_per_sequence | The average number of characters per sequence after processing |
discarded_files | The number of files discarded during processing because the resulting number of token IDs were either greater than the MSL or less than the min_sequence_len |
eos_id | The token ID used to signify the end of a sequence |
loss_valid_tokens | Number of tokens on which loss is computed |
n_examples | The total number of examples (sequences) that were processed |
non_pad_tokens | Non pad tokens |
normalized_bytes_count | The total number of bytes after normalization (e.g., UTF-8 encoding) |
normalized_chars_count | The total number of characters after normalization (e.g., lowercasing, removing special characters) |
num_masked_tokens | The total number of tokens that were masked (used in tasks like masked language modeling) |
num_pad_tokens | The total number of padding tokens used to equalize the length of the sequences |
num_tokens | Total number of tokens |
pad_id | The token ID used as padding |
processed_files | The number of files successfully processed after tokenizing |
raw_bytes_count | The total number of bytes before any processing |
raw_chars_count | The total number of characters before any processing |
successful_files | The number of files that were successfully processed without any issues |
total_raw_docs | The total number of raw docs present in the input data |
raw_docs_skipped | The number of raw docs that were skipped due to missing sections in the data |
vocab_size | The size of the vocabulary used in the tokenizer |

