Learn how to create custom read hooks for processing different data formats into semantic data arrays, enabling flexible data preprocessing for AI model training.
(text, loss_mask_value)
:
pretraining
mode. We offer this capability in finetuning mode, for both text and multi-modal datasets.type
field controls behavior for chat templates (more details here), and can take values of "system", "prompt", "completion", "user", "assistant"
. The difference between prompt/completion and user/asssistant is whether we apply a chat-template: prompt/completion does not apply the template, and user/assistant does apply the chat template.
The content
field is a list of dictionaries, where the key is the name of the semantic region, and the value is the content of the semantic. Currently, the image
semantic region is special and the content would have to be a string that represents the path to the image. Other regions names besides image
will be interpreted as text.
The semantic_loss_weight
and semantic_{drop/attention}_mask
are optional, and have default values according to the type
field. If specified, they should be a list that has the same number of entries as the content list.
By default, completion
and assistant
are not loss-masked, i.e. have semantic_loss_weight = 1
. The system
, prompt
, and user
types have default of semantic_loss_weight = 0
.
All types have default of semantic_attention_mask = 1
, i.e. they have attention paid to them.
All types also have defaults of semantic_drop_mask = False
, which means they are not dropped. The popular LLaVA model dropped the text from their prompts in one phase of training, so we introduced this feature to support the ability to drop arbitrary semantic regions according to a desired scheme (more details here).
Now let us represent some of the pedagogical examples from above into real semantic data arrays:
True
for the text region of “prompt,” indicating that this text portion will be dropped from the dataset and not tokenized. The semantic attention mask determines which regions contribute to the final attention mask passed to the model. A loss weight of 0 for a region means that the label tokens corresponding to that region will not be included in the loss calculation.
type
content
semantic_loss_weight
meta-llama/Meta-Llama-3-8B-Instruct
has already been applied, into a semantic data array format. Requires keys for image and text data. The dataset link and description for a sample dataset can be found here. The hooks implementation can be found here.
read_hook_kwargs
property must have data keys with the suffix _key
to segregate these from other parameters for read_hook_kwargs
. These keys will be exclusively used to read data from the input while other parameters which are not data keys will be used to create semantic data array from the read hooks.
prompt
and completion
semantic regions. The tokenizer’s sep_token
attribute is used as a separator token if present; else we use <|sep|>
.