Pre Train Your First Model
Follow this guide to pre-train your first model on a Cerebras system.
Overview
This tutorial teaches you about Cerebras essentials like data preprocessing and training scripts, config files, and checkpoint conversion tools. To understand these concepts, you’ll pre-train Meta’s Llama 3 8B on 40,000 lines of Shakespeare.
In this quickstart guide, you will:
-
Setup your environment
-
Pre-process a small dataset
-
Pre-train and evaluate a model
-
Port your model to Hugging Face
In this tutorial, you will train your model for a short while on a small dataset. A high quality model requires a longer training run, as well as a much larger dataset.
Prerequisites
To begin this guide, you must have:
-
Cerebras system access. If you don’t have access, contact Cerebras Support.
-
Completed setup and installation.
Step 1: Setup
Set Environment Variables
Start by saving common paths in environment variables for easy access, including:
-
The parent directory above Model Zoo
-
The location of data preprocessing scripts
-
The location of training scripts (in this case, Llama 3)
-
The location of scripts for converting checkpoints to Hugging Face
Create Model Directory
Create a dedicated folder for assets (data/model configs) and generated files (processed data files, checkpoints, logs, etc.):
Copy Training and Eval Configs
Copy sample configs into your folder. You will use these to control Model Zoo scripts for efficient training and evaluation of large models.
For data preprocessing, we’ll create a config manually.
Create Data Config
-
Copy the code block below.
-
Create a YAML file with it. Name this file
train_data_config.yaml
. -
Place the file in the
pretraining_tutorial
model directory you created earlier.
This config file will process the ”train” split of the karpathy/tiny_shakespeare dataset from Hugging Face using the baseten/Meta-Llama-3-tokenizer.
An example of “train” looks as follows:
If you are interested, you can read more about the various parameters and pre-built utilities for preprocessing common data formats. You can also follow end-to-end tutorials for various use cases such as instruction fine-tuning and extending context lengths using position interpolation.
Inspect Model Config (optional)
Take a look at your model config:
Here’s what you should see in your terminal:
These parameters specify the full architecture of the Llama 3 8B model and help define a Trainer object for training, validation, and logging semantics.
If you are interested, learn more about model configs here, or dive into how to set up flexible training and evaluation. You can also follow end-to-end tutorials for various use cases.
Inspect Evaluation Config (optional)
Take a look at your evaluation config:
Here is what you should see in your terminal:
This file lets you evaluate your model via the multiple choice (non-generative) eval harness task winogrande
on a single CSX system.
If you are interested, you can learn more about validating models using the Eleuther or BigCode Evaluation Harness in our documentation.
Step 2: Preprocess data
Preprocess Training and Validation Data
Use your data configs to preprocess your “train” and “validation” datasets:
You should then see your preprocessed data in pretraining_tutorial/train_data/
and pretraining_tutorial/valid_data/
(see the output_dir
parameter in your data configs).
Inspect Preprocessed Data (optional)
Once you’ve preprocessed your data, you can visualize the outcome:
In your terminal, you will see a url like http://172.31.48.239:5000
. Copy and paste this into your browser to launch TokenFlow, a tool for interactively visualizing whether loss and attention masks were applied correctly:
Step 3: Train and Evaluate Model
Modify Configs
Set train_dataloader.data_dir
and val_dataloader.data_dir
in your model config to the absolute paths of your preprocessed data:
Submit Training Job
Train your model by passing your updated model configs, the location of important directories, and python packages to a run script. Click here for more information.
You should then see something like this in your terminal:
Once training is complete, you will find several artifacts in the pretraining_tutorial/model
folder (see the model_dir
parameter in your model config). These include:
-
Checkpoints
-
TensorBoard event files
-
Run logs
-
A copy of the model config
Inspect Training Logs (optional)
Monitor your training during the run or visualize TensorBoard event files afterwards:
Step 4: Port Model to Hugging Face
Convert Checkpoint and Configs
Once you train (and evaluate) your model, you can port it to Hugging Face to generate outputs:
This will create both Hugging Face config files and a converted checkpoint under pretraining_tutorial/to_hf
.
Validate Checkpoint and Configs (optional)
You can now generate outputs using Hugging Face:
As a reminder, in this quickstart, you did not train your model for very long. A high quality model requires a longer training run, as well as a much larger dataset.
Conclusion
Congratulations! In this tutorial, you followed an end-to-end workflow to pre-train a model on a Cerebras system and learn about essential tools and scripts.
As part of this, your learned how to:
-
Setup your environment
-
Pre-process a small dataset
-
Pre-train and evaluate a model
-
Port your model to Hugging Face
What’s Next?
-
Learn how to fine-tune your first model.
-
Learn more about data preprocessing.
-
Learn more about the Cerebras Model Zoo and the different models we support.