Semantic Regions
Learn how to organize and process training data for AI models using semantic regions and read hooks, making it easier to handle different types of text and control how the model learns.
Semantic regions provide a generalized approach to data preprocessing for machine learning models, particularly for instruction-following datasets. Traditional methods hard-code masking for user and assistant sections, but this approach offers more flexibility.
What is a Semantic Region?
Typically, machine learning models trained on instruction-following datasets mask out user input to focus on generating assistant responses. Existing libraries often use a rigid, hard-coded approach to this masking.
In our data preprocessing, we generalized this idea with the concept of semantic regions. A semantic region is a section of data that shares a common meaning. This allows for more nuanced data preprocessing across different types of datasets.
Why Use Semantic Regions?
Let’s consider the user and assistant sections of a dialogue as two special cases of a semantic region. The user section shows the model an example question, which is distinct from the assistant response that demonstrates the desired output of the model. For pedagogical purposes, we introduce a simple representation to visualize semantic regions.
Below, the tuple contains (text, loss_mask_value)
:
The above case fits easily into existing frameworks. However, consider a medical question-answering dataset with three key components:
-
A medical passage
-
A related question
-
An answer
In older, hard-coded systems, you would have to:
-
Combine the passage and question into a single “user” region
-
Lose the ability to learn from the medical passage during training
Our semantic regions approach solves this by allowing granular separation:
Similarly, consider the case where we have question-answering on Confluence documents. There are ‘structure tokens’ that represent section headers and metadata (Date, Author, or Last Updated) that are not useful to learn to predict. We can separate out structure tokens into semantic regions that get loss-masked, but include loss over the useful content in the user** section:
We currently do not offer the ability to divide inputs into different semantic regions in pretraining
mode. We offer this capability in finetuning mode, for both text and multi-modal datasets.
Semantic Data Arrays
We also introduce a data specification for our processing pipeline, called the semantic data array. Input data can come in a variety of formats, but we require a standard format to correctly parse it into semantic regions so that we can apply the corresponding attributes such as loss mask.
The type
field controls behavior for chat templates (more details here), and can take values of "system", "prompt", "completion", "user", "assistant"
. The difference between prompt/completion and user/asssistant is whether we apply a chat-template: prompt/completion does not apply the template, and user/assistant does apply the chat template.
The content
field is a list of dictionaries, where the key is the name of the semantic region, and the value is the content of the semantic. Currently, the image
semantic region is special and the content would have to be a string that represents the path to the image. Other regions names besides image
will be interpreted as text.
The semantic_{loss/drop/attention}_mask
are optional, and have default values according to the type
field. If specified, they should be a list that has the same number of entries as the content list.
By default, completion
and assistant
are not loss-masked, i.e. have semantic_loss_mask = 1
. The system
, prompt
, and user
types have default of semantic_loss_mask = 0
.
All types have default of semantic_attention_mask = 1
, i.e. they have attention paid to them.
All types also have defaults of semantic_drop_mask = False
, which means they are not dropped. The popular LLaVA model dropped the text from their prompts in one phase of training, so we introduced this feature to support the ability to drop arbitrary semantic regions according to a desired scheme (more details here).
Now let us represent some of the pedagogical examples from above into real semantic data arrays:
Read Hooks to Organize Semantic Data Arrays
We use read hooks to convert from different input formats to our semantic data array. Pre-built hooks are provided for standard input formats and masking schemes. But the hooks also allow the user to write code to transform arbitrary inputs into any valid configuration of the semantic data array.
In this example, the drop mask is set to True
for the text region of “prompt,” indicating that this text portion will be dropped from the dataset and not tokenized. The semantic attention mask determines which regions contribute to the final attention mask passed to the model. A loss mask of 0 for a region means that the label tokens corresponding to that region will not be included in the loss calculation.
The value of the semantic loss weight should be either 0 or 1.
Learn more about pre-built and custom read hooks here.