Read hooks are critical tools for efficiently handling diverse data sources in machine learning workflows. This guide explores how to configure and utilize read hooks to streamline data preprocessing across different data types and platforms, enabling more robust and adaptable machine learning pipelines.

By mastering read hooks, you’ll gain the ability to:

  • Seamlessly integrate local and HuggingFace data sources

  • Customize data loading for specific tasks

  • Optimize preprocessing efficiency

  • Enhance overall model performance

Learn how we use read hooks to convert from different input formats to our semantic data array.

Fine-tuning LLaVA Hook

This read hook processes conversation data to format it for fine-tuning LLaVA models. It looks for conversation turns, optional system prompts, and images. It requires keys for conversation data and image paths.

Pretraining Text Hook

This read hook extracts and processes plain text data for reading tasks. It requires a key to extract text from input data.

Pretraining Image Captions Hook

This read hook prepares data for image captioning pretraining tasks by extracting image paths and captions.

NLG Hook

This read hook processes natural language generation (NLG) data, organizing context and completion information into a structured format. It requires context and completion keys.

Prompt Completion Text Hook

This read hook formats prompt and completion text into a structured list. It requires prompt and completion keys.

Chat Hook

This read hook transforms chat data into a semantic data array, distinguishing between user and assistant roles. Assumes data is in conversation format and requires a key for multi-turn content if the data is not in OpenAI ChatML format.

DPO Hook

This read hook structures data for Direct Preference Optimization (DPO) tasks, organizing prompts, chosen responses, and rejected responses into semantic data array. Requires keys for prompt, chosen, and rejected data. The implementation can be found here.

Prompt Completion Chat Hook

This read hook processes prompt and completion data as a single turn chat and creates a semantic data array format. The implementation can be found here.

Fine-Tuning Image Captions Hook

Processes fine-tuning image captions data into a semantic data array format. Requires keys for image and caption data. The hook implementation can be found here.

Fine-Tuning LLaVA Hook Prompt Completion

This read hook transforms conversation data for fine-tuning LLaVA, alternating between prompt and completion roles. Requires keys for conversation data and image paths. The hook implementation can be found here.

Custom Read Hook Examples

This section describes examples of custom read hooks for processing data on Cerebras Systems. The return data fields are:

  1. type: Indicates the role in the conversation. Possible values are system, user, assistant, prompt, or completion.

  2. content: A list of dictionaries representing parts of the conversation turn. Each dictionary can contain: - text: A segment of text. - image: The path to an image (if applicable).

  3. semantic_loss_weight: A list of booleans indicating which parts of the content might be dropped during training for semantic purposes.

Ultra Chat Common Words Mask Hook

This hook masks selected words in the sequence and was written for Ultrachat data. The link to the dataset is here. The hook implementation can be found here.

Obelics Hook

Processes obelics dataset examples into a semantic data array format. Requires keys for image and text data. The dataset link and description can be found here. The hooks implementation can be found here.

Llama3.1 Chat Template Formatted Data Hook

Processes a multi turn conversation dataset on which chat template of the tokenizer meta-llama/Meta-Llama-3-8B-Instruct has already been applied, into a semantic data array format. Requires keys for image and text data. The dataset link and description for a sample dataset can be found here. The hooks implementation can be found here.

Important Considerations

  • Handling keys to read data: The read_hook_kwargs must have data keys with the suffix _key to segregate these from other parameters for read_hook_kwargs. These keys for data will be exculsively used to read data from the input while other parameters which are not data keys will be used to create semantic data array from the read hooks.

  • Space Handling: When combining custom regions, the data processor does not add any spaces or separators between the regions. Space handling must be managed within the read hooks.When creating custom semantic regions, ensure there is a leading space at the start of each region (except the first) to prevent the merging of words from neighboring regions.

  • Multimodal Datasets: When working with multimodal datasets, if images are provided as URLs, the hooks should download the images and generate image paths to be used by the multimodal models.

  • Separator Handling With Prompt Completion Read Hook: Token generator adds a separator token between prompt and completion semantic regions. The tokenizer’s sep_token attribute is used as a separator token if present; else we use <|sep|>.

What’s Next?

Now that you’ve configured your input data and understand how to process it using various read hooks, the next step is to set up your token generators. Token generators play a crucial role in the preprocessing pipeline, as they convert raw data into tokenized formats suitable for machine learning models.