Learn about supported Token Generators for data preprocessing.
mode
parameter that is specified in the config file (refer to Modes.
Pretraining Parameters
PretrainingTokenGenerator
.Flag | Default Value | Description |
---|---|---|
pack_sequences | True | Concatenate a document smaller than maximum sequence length with other documents, instead of filling it with Padding token. |
inverted_mask | False | If False, 0 represents masked positions. If True 1 represents masked positions. |
seed | 0 | Random seed used for generating short sequences |
short_seq_prob | 0.0 | Probability of creating sequences which are shorter than the maximum sequence length. |
split_text_to_tokenize | False | Whether to split the text into smaller chunks before tokenization. This is helpful for very long documents with tokenizers such as Llama tokenizer which performs quadratically in the text length. |
chunk_len_to_split | 2000 | Length of the text chunks to split the text into before tokenization for slower tokenizers. Could be optionally used with the above flag split_text_to_tokenize. Without the previous flag, this argument will be ignored. |
remove_bos_in_chunks | False | Whether to remove the BOS token from the beginning of the chunks. Set this to True when using split_test_to_tokenize and chunk_len_to_split to avoid having multiple BOS tokens in the middle of the text. Not applicable to all tokenizers. |
PretrainingTokenGenerator
, in addition to the ones specified below.Flag | Default Value | Description |
---|---|---|
mlm_fraction | 0.15 | Fraction of tokens to be masked in MLM tasks. |
mlm_with_gather | False | MLM processing mode. When set to True the length of the returned labels is equal to mlm_fraction * msl, else it is equal to msl |
ignore_index | -100 | Required when mlm_with_gather is set to False. Presence of ignore_index value at a position in the labels indicates that this position will not be used for loss calculation. |
excluded_tokens | [‘<cls>’, ‘<pad>’, ‘<eos>’, ‘<unk>’, ‘<null_1>’, ‘<mask>’] | Tokens to be excluded when masking. Provided only through YAML config. |
VSL Finetuning Token Generator Parameters
VSLFineTuningTokenGenerator
.VSLFineTuningTokenGenerator
also uses the config paramaters that are used by FineTuningTokenGenerator
, in addition to the ones specified below.Flag | Default Value | Description |
---|---|---|
use_vsl | True | Generate examples with multiple sequences packed together |
position_ids_dtype | int32 | dtype of token position ids. |
VSL Pretraining Token Generator Parameters
VSLPretrainingTokenGenerator
. use_vsl
needs to be set to True in the train_input
or eval_input
section of the model config.VSLPretrainingTokenGenerator
also uses the config paramaters that are used by PretrainingTokenGenerator
, in addition to the ones specified below.Flag | Default Value | Description |
---|---|---|
use_vsl | True | Generate examples with multiple sequences packed together |
fold_long_doc | True | Fold documents larger than max_seq_length into multiple sequences, instead of dropping them. |
DPO Token Generator Parameters
DPOTokenGenerator
.Flag | Default Value | Description |
---|---|---|
max_prompt_length | 512 | If the sequence exceeds the max_seq_length , this parameters caps the prompt length to the specified limit. |
response_delimiter | <response> | This is used to set the separator between prompt and response . The user need not set this value for general use-case. |
Multimodal Pretraining Token Generator Parameters
MultiModalPretrainingTokenGenerator
.Flag | Default Value | Description |
---|---|---|
max_num_img | 1 | Maximum number of images allowed in one preprocessed sequence. Sequences with more than max_num_img images will be discarded |
num_patches | None | Number of patches to represent an image. This is determined by the patch-size (in pixels) of the image-encoder, and the pixel count of the input images. |
is_multimodal | False | Whether the dataset is multimodal (text plus images) or text only. Set it to True for multimodal tasks. |
Multimodal Token Generator Parameters
Flag | Default Value | Description |
---|---|---|
max_num_img | 1 | Maximum number of images allowed in one preprocessed sequence. Sequences with more than max_num_img images will be discarded |
num_patches | 1 | Number of patches to represent an image. This is determined by the patch-size (in pixels) of the image-encoder, and the pixel count of the input images. |
is_multimodal | False | Whether the dataset is multimodal (text plus images) or text only. Set it to True for multimodal tasks. |
PretrainingTokenGenerator
: General-purpose pretraining on large text corpora. When training_objective
is set to mlm
, it does MLM task processing. For multimodal pretraining, is_multimodal
is set to True.
FIMTokenGenerator
: Designed for fill-in-the-middle tasks. Initialized when training_objective
is set to fim
in the config file.
VSLPretrainingTokenGenerator
: For visual and language pretraining. Initialized when use_vsl
is set to True
in the config file.
FinetuningTokenGenerator
: General-purpose fine-tuning. For multimodal finetuning, is_multimodal
is set to True.
VSLFinetuningTokenGenerator
: Fine-tuning for visual and language tasks. Initialized when use_vsl
is set to True
in the config file.
DPOTokenGenerator
: Focused on direct preference optimization (DPO) during token generation. Initialized when mode
is set to dpo
.
NLGTokenGenerator
: Optimized for natural language generation tasks. Initialized when mode
is set to nlg
.
mode
param is set to custom
, in order to be able to specify your own token generator.
2. Specify the path to the custom token generator class in the config file, in the token_generator
param, within the setup
section. This would look like:
token_generator
path should be specified with the class name being separated with a colon : from the module name, for the custom token generator be instantiated correctly.encode
method, which tokenizes and encodes the data according to the user definition. For more examples on how the encode
method looks like, refer to the code of pre-built token generators that are present in Model Zoo.
3. The signature of the encode
method is given below, where it takes in a semantic_data_array
: