Training And Fine Tuning A Large Language Model LLM
Estimated time: 30 mins to 1 hour
This tutorial teaches you how to use the Cerebras Wafer-Scale cluster to train, fine-tune, and evaluate a Large Language Model (LLM) and visualize the results using TensorBoard. The model you will use is the GPT-3 111M with the WikiText-2 dataset. You need access to Cerebras’s Wafer-Scale cluster to train the model and fine-tune it from an existing checkpoint. Later, you will evaluate the models and visualize the results using TensorBoard. Finally, you will learn how to port the models to Hugging Face to generate outputs.
If you are interested in using a different dataset, you can still use the create HDF5 dataset tool from Cerebras Model Zoo used in step 2. In addition, if you are using a larger GPT model, change the YAML configuration used in steps 4 and 5. Here are some examples of GPT3 model configurations from Cerebras Model Zoo GPT and other GPT-3 configurations.
Prerequisites
You have access to the user node in the Cerebras Wafer-Scale cluster. Contact your sys admin if you face any issues in the system configuration.
Setting up the Cerebras Model Zoo
For convenience, you will set up the data and code inside a parent folder called demo
- a directory that you will have to create yourself. In addition, you will define an environment variable called PARENT_CS
to return to this parent directory at the anytime.
1. Create a parent directory, demo
, to include all the data, code, and checkpoints. Export an environment variable PARENT_CS
with the full path to the parent directory. This environment variable will be helpful when pointing to the absolute path during the execution.
2. Clone the Cerebras Model Zoo inside the demo folder, a repository with reference models and tools to run on the Cerebras Wafer-Scale cluster.
Output
At this point, you should have a folder called modelzoo
inside the demo
parent directory.
Preparing data for the Cerebras Wafer-Scale cluster
Cerebras Model Zoo contains customized scripts to download and prepare OWT and Pile datasets. If interested, refer to them when training GPT-style models with OWT or Pile datasets. Hierarchical Data Format, Version 5(HDF5) files provide an efficient way of loading data in the Cerebras’s Wafer-Scale cluster. The Cerebras Model Zoo contains a create_hdf5_dataset.py
script that tokenizes the input documents and creates token IDs, labels, and attention masks. For more information, refer to hdf5_dataset_gpt.
1. Download the WikiText-2 raw dataset by first creating a data
folder, and in the folder, download the dataset’s zip
file containing the dataset.
Output
Decompress the zip file.
Output
Remove the zip file.
2. Inside the wikitext-2-raw-v1
folder, you will find three files corresponding to train, validation, and test splits. Explore these files to understand the raw dataset format. The data is in txt
format.
Output
The create_hdf5_dataset.py
script expects all the files in the same folder to belong to the same split. Therefore, create three folders - train
, test
, and valid
, and move the corresponding files to each folder. In addition, add a .txt
file extension to each file since each file is in a raw format.
The create_hdf5_dataset.py
script also parses file types .json
, .jsonl.zst
, and .jsonl.zst.tar
. For more information on the input format, refer to hdf5_dataset_gpt
Preparing the data in HDF5 format
1. Before you prepare the data in an HDF5 format, set up a Python virtual environment with all the dependencies needed for preprocessing. Create this virtual environment in the parent directory demo
. For more informaion on HDF5 format, click here.
Note that now you should be in the (venv_cerebras_pt)
environment.
Output
Output
2. Now that your venv_cerebras_pt
virtual environment has been set up and activated, launch the create_hdf5_dataset.py
script. This script resides in the modelzoo
folder in the Cerebras Model Zoo repository. Launch this script for every data split you intend to use. In this case, you will preprocess the train
and valid
splits and save the preprocessed data in an HDF5 format in the data/wikitext-2-hdf5/
folder.
Prepare the data for the train set:
Output
The create_hdf5_dataset.py
is instrumented with multiple flags. Our version of the Python script uses the following flags:
Flag | Description |
---|---|
LMData | For processing language modelling datasets that are in .jsonl or .txt format. |
--params ./configs/autoregressive_lm_preprocessing.yaml | Path to YAML config file for setting dataset preprocessing parameters. Optional alternative for providing command line arguments. |
--input_dir data/wikitext-2-raw/train/ | Folder where all the .txt files are located. In this case, you only have one txt file |
--output_dir data/wikitext-2-hdf5/train/ | Destination folder |
--files_per_record 100 | Files per record is set to 100 in comparison to the default setting, which is 50,000. Given that there are 1180 training samples, the number of files per record should be smaller than the total number of samples. After preprocessing, the total number of samples will be floor(#samples/files_per_record) * files_per_record |
Table 2 HDF5 dataset flags#
After running the Python script, you should have an output directory: data/wikitext-2-hdf5/train/
. The following list shows the content of the output directory:
Output
3. Prepare the data in the validation set by changing the --input_dir
to data/wikitext-2-raw/valid/
and --output_dir
to data/wikitext-2-hdf5/valid/
:
You should have an output directory: data/wikitext-2-hdf5/valid/
. The following list shows the content of the output directory:
Output
Note
The create_hdf5_dataset.py
script only tokenizes your dataset. Perform data cleaning and shuffling before preparing the data in the HDF5 format, as it depends on the quality of your dataset. Additional resources available in Cerebras Model Zoo can be found in Data processing and dataloaders.
Training GPT-3 111M model from scratch
1. Create a copy of the Cerebras-GPT 111M YAML configuration file to point to the WikiText-2 dataset preprocessed in step 2. Make sure that the venv_cerebras_pt
is active.
2. Set the absolute path directory for data/wikitext-2-hdf5/train
. You can obtain the absolute path using realpath
, or appending data/wikitext-2-hdf5/train/
to the absolute path in $PARENT_CS
.
3. With the absolute path set, modify the custom_config_GPT111M.yaml
file.
4. Modify the data_dir
flag for training and evaluation inputs:
5. To shorten the runtime, modify the number of max_steps
for training, the eval_steps
for evaluation, and the checkpoint frequency.
With the new YAML configuration, you are ready to launch the training job.
Alternately, you can use screen, a terminal multiplex, so that your training does not stop if you lose access to the terminal. For this, start a new screen session called train_wsc
.
screen -S train_wsc
6. Inside this screen session, if not already active, activate the Cerebras virtual environment venv_cerebras_pt
.
7. Run the run.py
script file associated with the GPT-3 models in Cerebras Model Zoo.
Each model in Cerebras Model Zoo contains a run.py
script instrumented to easily launch training and evaluation in the Cerebras Wafer-Scale cluster and other AI accelerators. To learn more on how to launch a training, visit Launch your job, and to view the run.py
code for the GPT-3 model, visit here.
The following list describes the flags and their description:
Flag | Description |
---|---|
--params custom_config_GPT111M.yaml | Points to the YAML configuration file you customed with the training data paths |
--num_csx=1 | Number of CS-X systems used in the run |
--model_dir train_from_scratch_GPT111M | New directory containing all the checkpoints and logging information |
--mode train | Specifies that you are training the model as opposed to evaluation |
--mount_dirs $PARENT_CS $PARENT_CS/modelzoo | Mounts directories to the Cerebras Wafer-Scale cluster. In this case, all the data and code is in the parent directory demo (with absolute path $PARENT_CS ) |
--python_paths $PARENT_CS/modelzoo | Enables adding the Cerebras Model Zoo to the list of paths to be exported to PYTHONPATH |
--job_labels model=GPT111M | A list of equal-sign-separated key value pairs served as job label, to query using csctl. |
Table 3 GPT-3 model run.py training flags#
Note
Once you have submitted your job to execute in the Cerebras Wafer-Scale cluster, you can track the progress or kill your job using the csctl tool. You can also monitor the performance using a Grafana dashboard
8. Detach the screen using CTRL+A D
and reattach the screen using screen -r train_wsc
.
Once the training job is complete, you will find inside the train_from_scratch_GPT111M
folder:
Output
where latest_compile
and cerebras_logs
contain Cerebras-specific information collected while compiling and executing the model, checkpoint_10.mdl
is the checkpoint saved after ten steps, train
contains the metrics logged during execution, and the YAML configuration file, and run_<data>_######.log
contains the command line output during execution.
Output
Validating during Training
To enable periodic validation during training, set the --mode
flag to train_and_eval
and use the following additional flags in the runconfig
to configure the run:
Runconfig Param | Description |
---|---|
eval_frequency | Frequency at which the evaluation loop runs during training |
eval_steps | Number of steps to run in each evaluation loop |
If eval_steps
isn’t specified, it will loop over the entire evaluation dataloader if the dataloader has a length. Otherwise, it will error out.
Important considerations
1. Checkpoint and Evaluation Frequency:
When using train_and_eval
mode, make sure that checkpoint_steps
is either a multiple of eval_frequency
or vice versa, so that the frequency at which evaluations occurs lines up with the frequency at which checkpoints are taken.
-
Example 1:
eval_frequency=5
,checkpoint_steps=5
Take a checkpoint every 5 steps and run an evaluation every 5 steps. -
Example 2:
eval_frequency=5
,checkpoint_steps=10
Take a checkpoint every 10 steps, but perform an evaluation every 5 steps. Note that at steps 5, 15, 25, … there are no checkpoints available, but evaluation results are available. -
Example 3:
eval_frequency=10
,checkpoint_steps=5
Take a checkpoint every 5 steps, but perform an evaluation every 10 steps.
A checkpoint will always be taken at the end of training if the checkpoint_steps
is set to a positive number, regardless of whether the num_steps
is a multiple of checkpoint_steps
. Additionally, in the “Train and Eval” mode, an evaluation will always run at the final step, regardless of whether the num_steps
is a multiple of the eval_frequency
.
2. Evaluation Code in the Model:
Ensure that evaluation-specific code(such as eval metrics) is guarded by the PyTorch module’s training flag to distinguish between training and evaluation.
For example:
Limitations of Train and Eval mode
1. Single Dataloader: Only a single training dataloader and a single evaluation dataloader can be used in the same configuration file. To evaluate with multiple datasets, they must be concatenated into a single dataloader.
2. Eval Harness: Running the Eval Harness as part of train_and_eval
mode is not yet supported.
3. Runconfig Limitations: Settings in runconfig
(such as precision_opt_level
or POL) affect both training and evaluation graphs without nested configurations. This means that the settings specified in runconfig
apply to both training and evaluation (e.g., POL). Note, however, that micro_batch_size
is part of train_input
/ eval_input
. Therefore, it is possible to specify different micro-batch size settings for the train and eval
mode provided that the micro_batch_size
for evaluation is less than or equal to the micro_batch_size
for training.
4. Cluster Configuration: It’s not currently possible to run training and evaluation on separate portions of a cluster. Separate run.py
jobs must be manually executed for this.
Fine-tuning using checkpoints from Cerebras-GPT model
1. Download the Cerebras-GPT 111M checkpoint compatible with the Cerebras Wafer-Scale Cluster from Cerebras Model Zoo.
You can find all the Cerebras-GPT checkpoints in the Cerebras-GPT site in Cerebras Model Zoo.
Output
Once you have the checkpoint, you are ready to launch the training job. For this, you can start a new screen session or reuse the screen session used during training from scratch. To reattached the train_wsc
screen, use
screen -r train_wsc
2. Inside this screen session, if not already active, you will activate the Cerebras virtual environment venv_cerebras_pt
.
You will now run the same run.py
script used during the training from scratch with new flags.
3. Change the following flags:
Flag | Description |
---|---|
--model_dir finetune_GPT111M | Specifies a different model directory to save checkpoints and logging information |
--checkpoint_path Cerebras-checkpoint/cerebras-gpt-dense-111m -sp-checkpoint_final.mdl | Specifies the path where the Cerebras-GPT 111M model is. You will be using this checkpoint to initialize the model weights |
--load_checkpoint_states="model" | Flag used in conjunction with checkpoint_path , to enforce resetting of optimizer states and training steps after loading a given checkpoint. By setting this flag, all the model weights are initialized from checkpoint provided by checkpoint_path , training starts from step 0, and optimizer states present in the checkpoint are ignored. Useful for fine-tuning runs on different tasks (e.g., classification, Q&A, etc.) where weights from a pre-trained model trained on language modeling (LM) tasks are loaded or fine-tuning on a different dataset on the same LM task. |
Table 4 GPT-3 model run.py finetuning flags#
Once the training job is over, you will find inside the finetune_GPT111M
folder:
Output
The run_<data>_######.log
contains the command line output generated during execution.
Output
Evaluating the trained models and visualize using TensorBoard
For all models, you can use the run.py
script found in the Cerebras Model Zoo for evaluation purposes (i.e., only forward pass). GPT style models use the data specified in eval_input.data_dir
, which you had set up in step 2. The run.py
script provides three types of evaluation with the --mode
flag:
Flag | Description |
---|---|
eval | Evaluates a specific checkpoint. The latest checkpoint will be used if you don’t provide the --checkpoint_path flag |
eval_all | Evaluates all the checkpoints inside a model directory once the model has been trained |
train | Evaluates the model periodically during the training process. |
train_and_eval | Evaluates a model at a fixed frequency during training. This is convenient for identifying issues early in long training runs |
Table 5 GPT-3 model run.py evaluation flags#
Note
The train
and eval
modes require different fabric programming in the CS-X system. Therefore, using train_and_eval
mode in the Cerebras Wafer-Scale cluster results in additional overheads any time training is stopped to perform evaluation. When possible, we recommend using the eval_all
mode instead.
Given that you have already trained the model in this demo, you will use the eval_all
mode. To learn more about the different types of evaluation, visit eval.
Note
When evaluating a model with run.py
, the latest saved checkpoint will be used by default. If no checkpoint exists, then weights will be initialized as stated in the YAML file, and the model will be evaluated using these weights. If you want to evaluate a previously trained model, make sure that the checkpoints are available in the model_dir
or provide the --checkpoint_path
flag.
1. To launch evaluation, reattach the train_wsc
screen.
2. Inside this screen session, if not already active, activate the Cerebras virtual environment venv_cerebras_pt
.
Evaluating the model trained from scratch
To evaluate the model trained from scratch, run the run.py
script associated with the GPT-3 models in the Cerebras Model Zoo. This is the same run.py
script used during training from scratch, but you will change the following flag:
--mode eval_all | Specifies the evaluation mode |
After the validation job is complete, you will find additional files inside the train_from_scratch_GPT111M
folder, including an eval
folder that contains the metrics logged during the model evaluation and a new run_<data>_######.log
with the command line output during model evaluation. Here is an example of the command line output.
Evaluating the model fine-tuned from Cerebras-GPT
To evaluate the fine-tuned model, run the run.py
script associated with the GPT-3 models in the Cerebras Model Zoo. This is the same run.py
script used during evaluating model trained from scratch, but you will change the following flag:
--model_dir finetune_GPT111M | Specifies the model directory that contains the checkpoints from the fine-tuned model |
After the validation job is complete, you will find additional files inside the finetune_GPT_111M
folder, including an eval
folder that contains the metrics logged during the model evaluation and a new run_<data>_######.log
with the command line output during model evaluation. Here is an example of the command line output.
Visualizing the results using TensorBoard#
Launch TensorBoard to visualize the metrics recorded during training and evaluation of the models. Make sure that the Cerebras virtual environment is active.
Note
When hosting multiple TensorBoard instances concurrently, you may see the error Port 6006 is in use by another program. Either identify and stop that program, or start the server with a different port.
. As a work around, you can launch TensorBoard by specifying a different port with the --port
flag.
After launching TensorBoard, you will get a URL link to access the results in your local network.
Fig. 8 Metrics logged in GPT implementation in Cerebras Model Zoo#
Fig. 9 Value of training and validation loss after 10 steps for model train from scratch (train_from_scratch_GPT111M
) and fine-tuning from Cerebras-GPT checkpoint (finetune_GPT111M
)#
What’s next?
Now that you have trained and fine-tuned your GPT model in Cerebras’s Wafer-Scale Cluster, try porting your model to Hugging face to generate outputs. The Cerebras Model Zoo contains conversion scripts to convert from Cerebras Model Zoo to Hugging Face. For more information on conversion scripts, refer Convert checkpoints and model configs.
In addition, if you want to fine-tune your LLM model on a dataset using intstructions, refer to our How-To ../how_to_guides/train_with_instruction_tuning.