Token Generators
Learn about supported Token Generators for data preprocessing.
Token generators convert raw data into tokenized formats suitable for machine learning models, ensuring efficient and effective data processing. This guide covers the configuration of pre-built and custom token generators, along with examples and use cases.
Pre-Built Token Generators
Cerebras Model Zoo provides a comprehensive suite of pre-built token generators tailored to support various stages and tasks in the development of LLMs. The initialization of these token generators is dependent on the mode
parameter that is specified in the config file (refer to Modes.
Flags Supported by Pre-Built Token Generators
Supported Token Generators - Pretraining Mode
-
PretrainingTokenGenerator
: General-purpose pretraining on large text corpora. Whentraining_objective
is set tomlm
, it does MLM task processing. For multimodal pretraining,is_multimodal
is set to True. -
FIMTokenGenerator
: Designed for fill-in-the-middle tasks. Initialized whentraining_objective
is set tofim
in the config file. -
VSLPretrainingTokenGenerator
: For visual and language pretraining. Initialized whenuse_vsl
is set toTrue
in the config file.
Supported Token Generators - Finetuning Mode
-
FinetuningTokenGenerator
: General-purpose fine-tuning. For multimodal finetuning,is_multimodal
is set to True. -
VSLFinetuningTokenGenerator
: Fine-tuning for visual and language tasks. Initialized whenuse_vsl
is set toTrue
in the config file.
Other Supported Token Generators
-
DPOTokenGenerator
: Focused on direct preference optimization (DPO) during token generation. Initialized whenmode
is set todpo
. -
NLGTokenGenerator
: Optimized for natural language generation tasks. Initialized whenmode
is set tonlg
.
Custom Token Generators
In addition to pre-built token generators, the Model Zoo allows users to implement custom token generators. This enables arbitrary transformations of the input data before tokenization.
To use custom token generators, ensure the configuration file is properly set up. Follow these steps:
1. Ensure that the mode
param is set to custom
, in order to be able to specify your own token generator.
2. Specify the path to the custom token generator class in the config file, in the token_generator
param, within the setup
section. This would look like:
The token_generator
path should be specified with the class name being separated with a colon : from the module name, for the custom token generator be instantiated correctly.
Class Implementation Guidelines
The custom token generator must adhere to the following guidelines:
1. The constructor’s signature must be as follows:
2. The custom token generator must implement an encode
method, which tokenizes and encodes the data according to the user definition. For more examples on how the encode
method looks like, refer to the code of pre-built token generators that are present in Model Zoo.
3. The signature of the encode
method is given below, where it takes in a semantic_data_array
:
Conclusion
Configuring token generators is an important step in the preprocessing pipeline for machine learning tasks on Cerebras Systems. By leveraging the comprehensive suite of pre-built token generators provided by Cerebras ModelZoo, you can efficiently handle various stages and tasks in the development of large language models. Additionally, the flexibility to implement custom token generators allows for tailored transformations of input data, meeting specific project requirements.
The introduction of on-the-fly data processing further enhances the preprocessing workflow by reducing storage needs and increasing adaptability during training and evaluation. The examples provided for pretraining and fine-tuning configurations illustrate how to set up these processes seamlessly.
Finally, the TokenFlow utility offers an invaluable tool for visualizing and debugging preprocessed data, ensuring data integrity and facilitating error detection. By following the guidelines and leveraging the tools outlined in this guide, you can optimize your preprocessing pipeline, leading to more efficient training and improved performance of your machine learning models on Cerebras Systems.