Read Hooks
This guide details various read hooks used to efficiently process different types of input data for machine learning tasks on Cerebras Systems.
Read hooks are critical tools for efficiently handling diverse data sources in machine learning workflows. This guide explores how to configure and utilize read hooks to streamline data preprocessing across different data types and platforms, enabling more robust and adaptable machine learning pipelines.
By mastering read hooks, you’ll gain the ability to:
-
Seamlessly integrate local and HuggingFace data sources
-
Customize data loading for specific tasks
-
Optimize preprocessing efficiency
-
Enhance overall model performance
Learn how we use read hooks to convert from different input formats to our semantic data array.
Fine-tuning LLaVA Hook
This read hook processes conversation data to format it for fine-tuning LLaVA models. It looks for conversation turns, optional system prompts, and images. It requires keys for conversation data and image paths.
Pretraining Text Hook
This read hook extracts and processes plain text data for reading tasks. It requires a key to extract text from input data.
Pretraining Image Captions Hook
This read hook prepares data for image captioning pretraining tasks by extracting image paths and captions.
NLG Hook
This read hook processes natural language generation (NLG) data, organizing context and completion information into a structured format. It requires context and completion keys.
Prompt Completion Text Hook
This read hook formats prompt and completion text into a structured list. It requires prompt and completion keys.
Chat Hook
This read hook transforms chat data into a semantic data array, distinguishing between user and assistant roles. Assumes data is in conversation format and requires a key for multi-turn content if the data is not in OpenAI ChatML format.
DPO Hook
This read hook structures data for Direct Preference Optimization (DPO) tasks, organizing prompts, chosen responses, and rejected responses into semantic data array. Requires keys for prompt, chosen, and rejected data. The implementation can be found here.
Prompt Completion Chat Hook
This read hook processes prompt and completion data as a single turn chat and creates a semantic data array format. The implementation can be found here.
Fine-Tuning Image Captions Hook
Processes fine-tuning image captions data into a semantic data array format. Requires keys for image and caption data. The hook implementation can be found here.
Fine-Tuning LLaVA Hook Prompt Completion
This read hook transforms conversation data for fine-tuning LLaVA, alternating between prompt and completion roles. Requires keys for conversation data and image paths. The hook implementation can be found here.
Custom Read Hook Examples
This section describes examples of custom read hooks for processing data on Cerebras Systems. The return data fields are:
-
type
: Indicates the role in the conversation. Possible values are system, user, assistant, prompt, or completion. -
content
: A list of dictionaries representing parts of the conversation turn. Each dictionary can contain: - text: A segment of text. - image: The path to an image (if applicable). -
semantic_loss_weight
: A list of booleans indicating which parts of the content might be dropped during training for semantic purposes.
Ultra Chat Common Words Mask Hook
This hook masks selected words in the sequence and was written for Ultrachat data. The link to the dataset is here. The hook implementation can be found here.
Obelics Hook
Processes obelics dataset examples into a semantic data array format. Requires keys for image and text data. The dataset link and description can be found here. The hooks implementation can be found here.
Llama3.1 Chat Template Formatted Data Hook
Processes a multi turn conversation dataset on which chat template of the tokenizer meta-llama/Meta-Llama-3-8B-Instruct
has already been applied, into a semantic data array format. Requires keys for image and text data. The dataset link and description for a sample dataset can be found here. The hooks implementation can be found here.
Important Considerations
-
Handling keys to read data: The
read_hook_kwargs
must have data keys with the suffix_key
to segregate these from other parameters forread_hook_kwargs
. These keys for data will be exculsively used to read data from the input while other parameters which are not data keys will be used to create semantic data array from the read hooks. -
Space Handling: When combining custom regions, the data processor does not add any spaces or separators between the regions. Space handling must be managed within the read hooks.When creating custom semantic regions, ensure there is a leading space at the start of each region (except the first) to prevent the merging of words from neighboring regions.
-
Multimodal Datasets: When working with multimodal datasets, if images are provided as URLs, the hooks should download the images and generate image paths to be used by the multimodal models.
-
Separator Handling With Prompt Completion Read Hook: Token generator adds a separator token between
prompt
andcompletion
semantic regions. The tokenizer’ssep_token
attribute is used as a separator token if present; else we use<|sep|>
.
What’s Next?
Now that you’ve configured your input data and understand how to process it using various read hooks, the next step is to set up your token generators. Token generators play a crucial role in the preprocessing pipeline, as they convert raw data into tokenized formats suitable for machine learning models.