Quickstart: Preprocess Your Data

Before You Begin

Before you can pre-train or fine-tune your model, preprocess your text-only or multimodal data into HDF5 format for the Cerebras platform.

You will:

Set up your data directory
Prepare your config file
Run the preprocessing script
Visualize the preprocessed data

Input files should be in one of the following formats:

.jsonl

,

.json.gz

,

.jsonl.zst

,

.jsonl.zst.tar
.parquet
.txt
.fasta

Set Up Your Data Directory

Place your raw data in a directory. If processing multimodal data, set up an image directory as well. You’ll need to supply these paths in your config file.

If you’re using a HuggingFace dataset, note the name of the dataset. Learn more about using HuggingFace data here.

When using the Hugging Face CLI to download a dataset, you may encounter the following error:

KeyError: 'tags'

This issue occurs due to an outdated version of the huggingface_hub package. To resolve it, update the package by running:

pip install --upgrade huggingface_hub==0.26.1

Prepare Your Config File

Your config file contains three sections: setup, processing, and dataset.

View example configs for various use cases here.

Each section is broken down below with examples for different data types.

Setup

The setup section is where you set the path to your data, the mode (pre-training or fine-tuning), and other key flags.

Learn more about additional Setup flags here.

Use the tabs to see examples for different data types:

setup:
    data:
      source: "<path/to/dir>" #local dir or HuggingFace dataset name
      type: "local" #set to "huggingface" to use HuggingFace data
      split: "test" 
      cache_dir: "path/to/cache_dir"
      # other parameters accepted by HuggingFace ``load_dataset`` API...

    mode: "pretraining" #or "finetuning"
    output_dir: "./output/dir/here/" 
    processes: 1 #desired number of CPU cores, set to 0 for default number of cores available

Processing

The processing section is where you define which tokenizer and read hook you want to use.

There are a variety of read hooks available and each one processes data differently, preparing it for specific machine learning tasks. Learn more about read hooks here.
Similarly, there are a variety of tokenizers available. Learn more about how to initialize different tokenizers here.

processing:
  huggingface_tokenizer: "bert-base-uncased"
  tokenizer_params:
    param1: value1
    param2: value2
  read_hook: "cerebras.modelzoo.data_preparation.data_preprocessing.hooks.text_read_hook"
  read_hook_kwargs:
    text_key: "text"
  resume_from_checkpoint: False
  max_seq_length: 2048
  read_chunk_size: 1024
  write_chunk_size: 1024
  shuffle: False
  shuffle_seed: 0
  ftfy_normalizer: NFC
  use_ftfy: False
  wikitext_detokenize: False

Dataset

The dataset section is where you can set additional token generator parameters.

Learn more about additional dataset parameters here.

dataset:
  use_vsl: False
  pack_sequences: True

Run the Preprocessing Script

Run the following command with the path to your config file:

python preprocess_data.py --config /path/to/configuration/file

Visualize Your Data

This tool visualizes preprocessed data efficiently and in an organized fashion, allowing for easy debugging and error-catching in the output data. You can supply other arguments (listed below) if needed.

Run the following command:

python launch_tokenflow.py --output_dir <directory/of/file(s)>

In your terminal, you will see a url like http://172.31.48.239:5000. Copy and paste this into your browser to launch TokenFlow, a tool for interactively visualizing whether loss and attention masks were applied correctly:

Arguments

Output

There are 4 sections in the visualization output. input_strings and label_strings are converted tokens from input_ids and labels respectively. The tokens in the string sections are highlighted in green when the loss weight is greater than zero for that specific token. Similarly, the tokens are highlighted in red when their attention mask is set to zero. For multimodal datasets, hovering over the image pad tokens also displays the corresponding image in the popup window.

Learn more about visualization and debugging here.

What’s Next?

Depending on your goals, check out our Pretraining or Finetuning quickstart guides.

Getting Started

Concepts

Model Zoo

CS Torch

Cluster Monitoring

Fundamentals

Support

Quickstart: Preprocess Your Data

Before You Begin

Set Up Your Data Directory

Prepare Your Config File

Setup

Processing

Dataset

Run the Preprocessing Script

Visualize Your Data

Output

What’s Next?

Getting Started

Concepts

Model Zoo

CS Torch

Cluster Monitoring

Fundamentals

Support

​Before You Begin

​Set Up Your Data Directory

​Prepare Your Config File

​Setup

​Processing

​Dataset

​Run the Preprocessing Script

​Visualize Your Data

​Output

​What’s Next?

Before You Begin

Set Up Your Data Directory

Prepare Your Config File

Setup

Processing

Dataset

Run the Preprocessing Script

Visualize Your Data

Output

What’s Next?