**IMPORTANT**: This preprocessing script for creating HDF5 dataset files is soon going to be officially DEPRECATED.

To ensure optimal performance and compatibility, we strongly recommend transitioning to the script provided in the Data preprocessing document.

Overview

We provide two methods to generate Hierarchical Data Formats (HDF) files (.h5) that you can use in the input pipeline for GPT style models to implement data loader for GPT style models efficiently.

If you have a PyTorch dataset and need to convert it to an HDF5 format, follow section Converting a PyTorch dataset to HDF5 format.

If you have raw data and want to convert it to an HDF5 dataset, follow section Generating HDF5 files of this document.

Converting a PyTorch dataset to HDF5 format

If you have a PyTorch dataset for GPT models (from any source such HuggingFace, Map-Style or Iterable), you can easily write the samples of that dataset in HDF5 format to use with Cerebras optimized HDF5 DataProcessor. This can be done by calling the function convert_dataset_to_HDF5() which is defined in convert_dataset_to_HDF5.py. The following example shows conversion of HuggingFace Eli5 dataset to HDF5:

from cerebras.modelzoo.data_preparation.huggingface.HuggingFace_Eli5 import (
    HuggingFace_Eli5,
)
from cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.convert_dataset_to_HDF5 import (
    convert_dataset_to_HDF5,
)

dataset, data_collator = HuggingFace_Eli5(
  split="train", num_workers=8, sequence_length=128
)

convert_dataset_to_HDF5(
    dataset=dataset,
    data_collator=data_collator,
    output_dir="./eli5_hdf5_dataset/",
    num_workers=8,
)

The function convert_dataset_to_HDF5() uses a PyTorch Dataloader to fetch samples from the specified dataset and writes those samples in h5 files. The following table explains the arguments to the convert_dataset_to_HDF5() function:

Table 1: convert_dataset_to_HDF5 Arguments#

ArgumentDefault ValueDescription
datasetN/APyTorch dataset to fetch the data from (IterableDataset or Dataset ).
output_dir./hdf5_dataset/Directory where HDF5 will be stored.
namedataset-partitionName of the dataset; i.e. prefix to use for HDF5 file names.
samples_per_file2000Number of samples written to each HDF5 file
num_workers8Number of Python processes to use for generating data.
batch_size64The batch size to use fetching the data.
data_collatorN/AMerges a list of samples to form a mini-batch of Tensor(s).
dtypei4Data type for the HDF5 dataset.
compressiongzipHDF5 Compression strategy.

While the function convert_dataset_to_HDF5() is generic and can be used with all transformer models, note that PyTorch dataset features dictionary should have the the following key/values GPT models:

  • input_ids: Input token IDs, padded with 0 to max_sequence_length.

    • Shape: (batch_size, max_sequence_length)

    • Type: torch.int32

  • attention_mask: Mask for padded positions. Has values 0 on the padded positions and 1 elsewhere.

    • Shape: (batch_size, max_sequence_length)

    • Type: torch.int32

  • labels: Labels for language modeling pre-training task, padded with 0 to max_sequence_length.

    • Shape: (batch_size, max_sequence_length)

    • Type: torch.int32

There are two extra key/values needed to train GPT models on variable sequence length (VSL) samples that are packed into fixed length sequence:

“attention_span”, “position_ids”

  • attention_span: Specifies the attention span for each attention key to prevent attending to out-of-sample queries, padded with 0 to max_sequence_length.

    • Shape: (batch_size, max_sequence_length)

    • Type: torch.int32

  • position_ids: Token position index relative to the data sample, padded with 0 to max_sequence_length.

    • Shape: (batch_size, max_sequence_length)

    • Type: torch.int32

NOTE:

  1. More information on using of HuggingFace datasets can be found in this document: Using HuggingFace datasets for auto-regressive LM].

  2. attention_mask here actually represents loss mask and is used as loss mask by our gpt-style models. It can mask out components like padding tokens or prompt tokens that shouldn’t be in the loss calculation. For example: \

  • If we do autoregressive language modeling with the input in format [input_ids, padding_tokens] (LMData), attention_mask will look something like [1, 1, …, 1, 0, 0, 0, …, 0] where the 1’s corresponds to input_ids and 0’s to padding_tokens.

  • If we do prompted generation like instruction tuning with input in format [prompt_ids, input_ids, padding_tokens(optional)] (Summarization), attention_mask will look something like [0, 0, …, 0, … 1, 1, …, 1, 0, 0, 0, …, 0] where the first chunk of 0’s correspond to prompt_ids, the 1’s correspond to input_ids and the second chunk of 0’s correspond to padding_tokens.

Set up environment

NOTE: Model Zoo environment setup is required for a clean run of the data preprocessing script.

Set up the Model Zoo environment as described in PYTHON-SETUP.md.

Input files format

Ensure the input text documents are in a specific file format before utilizing the provided script, except for the Customize mode. The acceptable file formats are '.jsonl', '.json.gz', '.jsonl.zst', '.jsonl.zst.tar', '.txt'. These files should have the data in a specific structure described in data format section.

To optimally process the files, we recommend that all files with any of the above formats besides .txt contain enough text in a single file. The recommended size for each file is in the order of GB.

On the contrary, if you are processing smaller files with .txt format, input a metadata file containing a list of paths to these files to leverage multi-processing better.

Input data format

As mentioned above, the preprocessing script accepts two primary input files: .json based or .txt based. The input data must follow a specific structure for each type to be accurately converted into hdf5 files.

Format for jsonl files

The raw text and meta data for generation should be represented in the .jsonl based files as:

{"text": "Any text excerpt from the dataset of interest...\nThe new lines should be represented as a newline character.",
"meta": {"info": "any other metadata dict"}}

For the jsonl files, as shown above, the raw text is extracted from key=text in the input files by default. If your input files do not contain a text key, you should know the key corresponding to the text you need to extract. Then, extract the text from the command line argument --jsonl_key=<your key name>.

For example, if your jsonl files have the content such as the following:

{"idea": "Any text excerpt from the dataset of interest, with custom key: 'idea'...\nThe new lines should be represented as a newline character.",
"meta": {"info": "any other metadata dict"}}

then you’d need to pass --jsonl_key=idea in the command line arguments.

Format for txt based files in LMData mode

Always represent raw text for generation in a .txt based files as:

Any text excerpt from the dataset of interest...
The new lines may not be represented as a newline character.

Note that there are no special tags or anchors in the above. If they exist, all these will be treated as a single document and may not represent the natural language.

For example, the following text gets tokenized be entirely as:

<DOC>
    <DOCNO>TRC2-2008-01-01-0000</DOCNO>
    <BLOGS08DAY>-13</BLOGS08DAY>
    <CONTENT>
    Example content in the format that may be outside of the
    </DOCS>

Format for PARQUET based files in LMData mode

Column Name = “text”: “Any text excerpt from the dataset of interest…\nThe new lines should be represented as a newline character.”, Column Name = “abc”: “….” etc

For the `parquet` files, as shown above, by default, the raw text is extracted from the column with name `text` in the input files. If your input files do not contain a column with name  `text` as key then you should know the `key` corresponding to the text you need to extract. Then, you can use the command line argument `--jsonl_key=<your key name>` to extract the text.

    For example, if your parquet files have the content as below and you want to extract the value from column name `idea`:

    ```parquet
    Column Name = "idea": "Any text excerpt from the dataset of interest, with custom key: 'idea'...\nThe new lines should be represented as a newline character.",
    Column Name = "abc": "...."
    etc

then you’d need to pass --jsonl_key=idea in the command line arguments.

Definition of vocab_file and encoder_file

The script supports two kinds of tokenizers: GPT2Tokenizer and NeoXTokenizer.

Supply the correct vocab_file and encoder_file when using the desired tokenizer.

  • For GPT2Tokenizer, vocab_file=gpt2-vocab.bpe and encoder_file=gpt2-encoder.json

  • For NeoXTokenizer, encoder_file=/neox-encoder.json

These files can be found here.

Note: For the GPT2Tokenizer, we follow the nomenclature used by OpenAI in their implementation which is slightly different from Hugging Face’s nomenclature where they call the vocab_file as merges_file and encoder_file as vocab_file. However, the content of the files is the same. For NeoXTokenizer, we use the same nomenclature to avoid confusion.

Generating HDF5 files

Once you have a text dataset that meets the above requirement, you can generate HDF5 files using the create_hdf5_dataset.py script:

python create_hdf5_dataset.py [mode] [--arguments]

As we mentioned before, the mode can be one of {LMData, Summarization,}. The four modes share the same setup and processing arguments but differ in their dataset arguments, as detailed below:

Table 2: Setup Arguments#

ArgumentDefault ValueDescription
paramsN/APath to YAML config file for setting dataset preprocessing parameters. Optional alternative for providing command line arguments.
input_dirN/ADirectory where raw data is stored. Supports only the formats: ['.jsonl', '.jsonl.zst', '.jsonl.zst.tar', '.txt'].
metadata_filesN/APath to text file containing a list of file names corresponding to the raw input documents to be processed and stored; can handle multiple metadata files separated by comma.
output_dir./data_dir/Directory where HDF5 files will be stored.
processescpu countNumber of processes to use.
moduleN/APython file name contains the custom dataset processor for Customize mode only.
dataset_processorN/AName of the custom dataset processor for Customize mode only.

Note: You have to provide either the input_dir or metadata_files argument. Only files referenced in the metadata_files will be processed if you provided both.

Table 3: Processing Arguments

ArgumentDefault ValueDescription
tokenizer_typerequired argType of tokenizer to use for HDF5 dataset generation. Can be one of GPT2Tokenizer or NeoXTokenizer.
vocab_fileN/APath to the vocabulary file.
encoder_fileN/APath to the encoder file.
max_seq_length2048Maximum sequence length.
short_seq_prob0.0Probability of creating sequences which are shorter than the maximum sequence length.
output_nameexamplesName of the dataset; i.e. prefix to use for HDF5 file names.
files_per_record50000Text files to write per HDF5 file.
write_in_batchFalseWhether to write the samples in batch for the HDF5 format, setting to false will save memory but a bit slower.
write_remainderTrueWrite the remainder files when data is left over from processing.
resume_from_checkpointFalseResume record writing from a given checkpoint.
display_pbarTrueDisplay progress while runs.
seed0Random seed.

Table 4: Dataset Arguments (LMData mode)

ArgumentDefault ValueDescription
use_ftfyFalseFix text with ftfy.
ftfy_normalizerNFCChoose what kind of unicode normalization is applied. Usually, we apply NFC normalization, so that letters followed by combining characters become single combined characters. Using None applies no normalization while fixing text.
wikitext_detokenizeFalseUse wikitext detokenizer to fix text.
jsonl_keytextThe key name in input jsonl files from which the raw text will be extracted in order to further process it.
pack_sequencesTrueConcatenate a document smaller than maximum sequence length with other documents, instead of filling it with Padding token.
min_sequence_len10Minimum token length to skip the sample.
input_ids_dtypeint32dtype of processed input_ids.
input_mask_dtypeint32dtype of processed input loss masks.
inverted_maskFalseIf False, 0 represents masked positions. If True 1 represents masked positions.
split_text_to_tokenizeFalseWhether to split the text into smaller chunks before tokenization. This is helpful for very long documents with tokenizers such as Llama tokenizer which performs quadratically in the text length.
chunk_len_to_split2000Length of the text chunks to split the text into before tokenization for slower tokenizers. Could be optionally used with the above flag split_text_to_tokenize. Without the previous flag, this argument will be ignored.
remove_bos_in_chunksFalseWhether to remove the BOS token from the beginning of the chunks. Set this to True when using split_test_to_tokenize and chunk_len_to_split to avoid having multiple BOS tokens in the middle of the text. Not applicable to all tokenizers.

Table 5: Dataset Arguments (Summarization mode)

ArgumentDefault ValueDescription
use_ftfyFalseFix text with ftfy.
ftfy_normalizerNFCChoose what kind of unicode normalization is applied. Usually, we apply NFC normalization, so that letters followed by combining characters become single combined characters. Using None applies no normalization while fixing text.
wikitext_detokenizeFalseUse wikitext detokenizer to fix text.
min_sequence_len10Minimum token length to skip the sample.
sep_tokenNoneToken added between prompt and completion in preprocessed sequences. If supplied with a non-None value, the tokenizer will add the token to the vocab size and modify the vocab size. This may not be advisable for doing fine tuning on a pre-trained model on the types of models that do not provision for extra tokens.
prompt_keyrequired argJson key for the prompt.
completion_keyrequired argJson key for the completion.
input_ids_dtypeint32dtype of processed input_ids.
input_mask_dtypeint32dtype of processed input loss masks.
inverted_maskFalseIf False, 0 represents masked positions. If True 1 represents masked positions.

Table 6: Dataset Arguments (FIM mode)

ArgumentDefault ValueDescription
fim_rate0.90Float specifying percentage of data to apply FIM transformation, instead of leaving as auto-regressive.
spm_rate0.50Float specifying percentage of FIM transformation to convert to prefix-suffix-middle (PSM) vs suffix-prefix-middle (SPM) formats.

The FIM mode is very similar to the LMData mode, and uses all the same other arguments as listed in the LMData table. These additional parameters determine whether what percentage of samples have the FIM transformation applied, and what percent of these end up in PSM (prefix, suffix, middle) or SPM format.

Note: For CodeLlama, to follow the note here specify the EOT token as the EOS token in the config.

Table 7: Dataset Arguments (LMData_VSL mode)

ArgumentDefault ValueDescription
fold_long_docTrueFold documents larger than max_seq_length into multiple sequences, instead of dropping them.
position_ids_dtypeint32dtype of token position ids.

The LMData_VSL mode inherits all the other arguments from the LMData mode as listed in Table 4.

Table 8: Dataset Arguments (Summarization_VSL mode)

ArgumentDefault ValueDescription
position_ids_dtypeint32dtype of token position ids.
prompt_prefixNoneIf specified, this will be added before the prompt in every sequence. Example usage is to add <|user|> before the user message in a multi-turn dialogue.
completion_prefixNoneSimilar to prompt_prefix, but for the completion. Example usage is to add <|assistant|> before the model’s response in a multi-turn dialogue.
eos_after_promptFalseSome current chat templates will include an EOS token after the end of the user input in a multi-turn dialogue. If this flag is specified, there will be EOS tokens after all prompts.
multi_turn_keyNoneIf specified, this replaces the prompt_key and completion_key usage. The assumption is that a multi-turn dialogue stores a list of the entries, which can be referenced by this key.
multi_turn_content_keyNoneIf the data stored at multi_turn_key is a list of dictionaries rather than a list of strings (of user and assistant responses), this key accesses the message content within the dictionary. For example, some data stores the dialogue as dictionaries of {"content": ..., "user": ...}, in which case multi_turn_content_key would be content.

The Summarization_VSL mode inherits all the other arguments from the Summarization mode as listed in Table 5.

Table 9: Dataset Arguments (LlavaPhaseOne mode)

ArgumentDefault ValueDescription
eos_after_promptFalseSome current chat templates will include an EOS token after the end of the user input in a multi-turn dialogue. If this flag is specified, there will be EOS tokens after all prompts.
multi_turn_keyNoneIf specified, this replaces the prompt_key and completion_key usage. The assumption is that a multi-turn dialogue stores a list of the entries, which can be referenced by this key.
multi_turn_content_keyNoneIf the data stored at multi_turn_key is a list of dictionaries rather than a list of strings (of user and assistant responses), this key accesses the message content within the dictionary. For example, some data stores the dialogue as dictionaries of {"content": ..., "user": ...}, in which case multi_turn_content_key would be content.
image_keyNoneImage key of the LLaVA dataset. For example a jsonl file might have the image path contained at the key, “image”.
multi_modal_non_image_ex_keyNoneSome examples in LLaVA training are text-only and have no images, so that the model does not forget how to answer text-only questions while it is learning multi-modality. These examples will not have the image_key, but will have another key to represent that it is a no-image example.
image_tokenNoneString that represents where in the text the image patches will be inserted. For example, the original LLaVA dataset contained the string “” in the prompt.
num_patchesNoneNumber of patches to represent an image. This is determined by the patch-size (in pixels) of the image-encoder, and the pixel count of the input images.
image_dirNoneAbsolute path of image directory. Used along with the relative path under the image_key field to check that images exist, and throw out examples with no image.

The LlavaPhaseOne mode inherits all the other arguments from the Summarization mode as listed in Table 8. Note that the LLaVA phase-one training removes the prompt text, and trains on the image + completion pair.

Also note that both preprocessors for LLaVA currently only support tokenizers based on the Llama and Mistral models.

Table 10: Dataset Arguments (LlavaPhaseTwo mode)

ArgumentDefault ValueDescription
prompt_prefixNoneIf specified, this will be added before the prompt in every sequence. Example usage is to add <|user|> before the user message in a multi-turn dialogue.
completion_prefixNoneSimilar to prompt_prefix, but for the completion. Example usage is to add <|assistant|> before the model’s response in a multi-turn dialogue.
system_prompt_styleNoneKey to obtain the system prompt used for the LLM backbone within LLaVA. The currently supported keys are vicuna_v0, vicuna_v1. For example, if you are training a LLaVA model based on the Vicuna model, you could specify vicuna_v1.

The LlavaPhaseTwo mode inherits all the other arguments from LlavaPhaseOne mode as listed in Table 9. LLaVA phase-two training does not remove the prompt text as phase-one does.

Usage of create_hdf5_dataset.py file

You can provide the above arguments either as command line arguments or as YAML config file:

Command line

python create_hdf5_dataset.py LMData --input_dir /path/to/data --tokenizer_type NeoXTokenizer --encoder_file /path/to/encoder --max_seq_length 4096 --ftfy True --pack_sequences False

YAML config file

python create_hdf5_dataset.py LMData --params ./configs/autoregressive_lm_preprocessing.yaml

Example of sample YAML files for LMData and Summarization are located on Cerebras Model Zoo.

Note: You can use both, but command-line arguments will override any common arguments with the YAML configuration file.

Customize mode steps

  1. Create a python file or put under ./hdf5_dataset_preprocessors.py

  2. Import the module HDF5Preprocessor in the file you created as follows:

from modelzoo.data_preparation.nlp.hdf5_preprocessing.hdf5_preprocessor import HDF5Preprocessor

  1. Create a class that inherits from HDF5Preprocessor. (e.g CustomDataset)

  2. Implements init takes as input a dictionary contains the dataset parameters that is needed for HDF5Preprocessor.

  3. Implements the method file_read_generator and preprocessing_generator following Write Customized Preprocessor

  4. Run create_hdf5_dataset.py script.

Write customized preprocessor

You can create customized preprocessors for various datasets or objectives. We provide two references at hdf5_dataset_preprocessors.py where:

  1. LMDataPreprocessor: the preprocessor for autoregressive language modeling tasks

  2. SummarizationPreprocessor: the preprocessor for summarization tasks

They both inherit from the HDF5BasePreprocessor at hdf5_base_preprocessor.py with two functions that can be overridden to customize for various cases:

  1. file_read_generator() takes a file path, reads from the file, and yields the corresponding text documents. You can customize how you want the file to be read based on its format (ex. csv, zip, etc.). Our default preprocessors use lm_dataformat reader with specific JSON keys.

  2. preprocessing_generator(), This function takes in the output of file_read_generator(), performs tokenization and other preprocessing techniques, and yields the data samples in np.array format.

For example, in the autoregressive language modeling task, file_read_generator yields an str object, and the preprocessing_generator produces an np array with shape [3, max_sequence_length] with the following three features concatenated on the first dimension:

  1. input_ids: Input token ids, padded with 0’s to max_sequence_length.

  2. input_mask: Loss mask for the sequence. It has 0’s padded positions like prompts or padding tokens, and 1’s elsewhere.

  3. labels: input_ids shifted to the right by one position as the target labels.

NOTE: To avoid tedious setup of arguments specific to your customized preprocessor, we recommend running with a YAML file config.

Best practices

  • It is recommended to use the ftfy module to fix the datasets. Enable with the --ftfy argument.

  • The NeoXTokenizer uses the HuggingFace library’s inbuilt tokenizer and handles NFC normalization independently. When using this tokenizer_type, set the --ftfy_normalizer argument to None. For the GPT2Tokenizer, use the default NFC value for the normalizer.

  • To process HDF5 for training, we recommend using multi-processing. Moreover, we suggest using several input files such that the totalnum,ber of input files are greater than or equal to the number of processes provided by --processes. Note that this requires a high-spec CPU server, which can handle the concurrent running processes in RAM and the I/O for reads and writes. If the I/O of the server is slow, the processes may appear to be hung for a very long while.

  • The recommendation is to split the data into smaller subsets and write out each subset for very large datasets (with several files, with each file in the order of GBs). You can then mix all HDF5 in a common folder for use by the data pipeline or just provide the locations of each subset in a list. The overall time to write out HDF5 can depend on the CPU server used.

  • It is better to split the input dataset into multiple files with similar sizes to leverage the full potential of parallel processing.

  • For CodeGen models processing, please use GPT2Tokenizer along with the updated vocab files such that the vocabulary of GPT-2 is extended by special tokens representing repeating tokens of tabs and white spaces.

Output files structure

The output directory will contain many h5 files, as shown below (with two processes):

<path/to/output_dir>
├── checkpoint_0.txt
├── checkpoint_1.txt
├── data_params.json
├── examples_0_0.h5
├── examples_0_1.h5
├── examples_1_0.h5
├── examples_1_1.h5
├── examples_2_0.h5
├── examples_2_1.h5
├── examples_3_0.h5
├── examples_3_1.h5
├── examples_4_0.h5
├── examples_4_1.h5
├── examples_5_0.h5
├── examples_6_0.h5
├── examples_7_0.h5
└── examples_8_0.h5

Here data_params.json is the file that stores the parameters used for generating this set of files. checkpoint_*.txt can be used for resuming the processing in case the run script gets killed for some reason. There is one checkpoint_*.txt file for each process. To use this file, resume the previous command that you ran along with the additional command line argument --resume_from_checkpoint.

Example for HuggingFace Eli5 dataset

The example shows conversion of HuggingFace Eli5 dataset to HDF5:

from modelzoo.data_preparation.huggingface.HuggingFace_Eli5 import (
    HuggingFace_Eli5,
)
from modelzoo.data_preparation.nlp.hdf5_preprocessing.convert_dataset_to_HDF5 import (
    convert_dataset_to_HDF5,
)

dataset, data_collator = HuggingFace_Eli5(split="train", num_workers=8)

convert_dataset_to_HDF5(
    dataset=dataset,
    data_collator=data_collator,
    output_dir="./eli5_hdf5_dataset/",
    num_workers=8,
)