This tutorial teaches you about Cerebras essentials like data preprocessing and training scripts, config files, and checkpoint conversion tools. To understand these concepts, you’ll fine-tune Meta’s Llama 3 8B on a small dataset consisting of documents and their summaries.
In this quickstart guide, you will:
Setup your environment
Pre-process a small dataset
Port a trained model from Hugging Face
Fine-tune and evaluate a model
Test your model on downstream tasks
Port your model to Hugging Face
In this tutorial, you will train your model for a short while on a small dataset. A high quality model requires a longer training run, as well as a much larger dataset.
We use cp here to copy configs specifically designed for this tutorial. For general use with Model Zoo models, we recommend using cszoo config pull. See the CLI command reference for details.
2
Inspect Configs
Before moving on, inspect the configuration files you just copied to confirm that the parameters are set as expected.
To view the model config, run:
Copy
cat pretraining_tutorial/model_config.yaml
You should see the following content in your terminal:
These parameters specify the full architecture of the Llama 3 8B model and help define a Trainer object for training, validation, and logging semantics.
If you are interested, learn more about model configs here, or dive into how to set up flexible training and evaluation. You can also follow end-to-end tutorials for various use cases.
To view the evaluation config, run:
Copy
cat pretraining_tutorial/eeh_config.yaml
You should see the following content in your terminal:
Use your data configs to preprocess your “train” and “validation” datasets:
Copy
cszoo data_preprocess run --config finetuning_tutorial/train_data_config.yamlcszoo data_preprocess run --config finetuning_tutorial/valid_data_config.yaml
You should then see your preprocessed data in finetuning_tutorial/train_data/ and finetuning_tutorial/valid_data/ (see the output_dir parameter in your data configs).
When using the Hugging Face CLI to download a dataset, you may encounter the following error: KeyError: 'tags'
This issue occurs due to an outdated version of the huggingface_hub package. To resolve it, update the package to version 0.26.1 by running:
pip install --upgrade huggingface_hub==0.26.1
An example of “train” looks as follows:
Copy
{ "document": "In Wales, councils are responsible for funding..", "summary": "As Chancellor George Osborne announced...", "id": "35821725"}
Once you’ve preprocessed your data, you can visualize the outcome:
In your terminal, you will see a url like http://172.31.48.239:5000. Copy and paste this into your browser to launch TokenFlow, a tool for interactively visualizing whether loss and attention masks were applied correctly:
4
Download Checkpoint and Configs
Create a dedicated folder for the checkpoint and configuration files you’ll be downloading from Hugging Face.
Copy
mkdir finetuning_tutorial/from_hf
You can either fine-tune a model from a local pre-trained checkpoint or (as in this tutorial) from Hugging Face.
First, download checkpoint and configuration files from Hugging Face using the commands below. For the purposes of this tutorial, we’ll be using McGill’s Llama 3-8B-Web, a finetuned Meta-Llama-3-8B-Instruct model.
Your finetuning_tutorial/from_hf folder should now contain:
pytorch_model_to_cs-2.3.mdl: The converted model checkpoint.
config_to_cs-2.3.yaml: The converted configuration file.
While you will not need to do this in this quickstart since it has already been done for you, as a final step, you will usually point ckpt_path in your finetuning_tutorial/model_config.yaml to the location of this converted checkpoint.
6
Train and Evaluate Model
Set train_dataloader.data_dir and val_dataloader.data_dir in your model config to the absolute paths of your preprocessed data:
Train your model by passing your updated model configs, the location of important directories, and python packages to a run script. Click here for more information.
Copy
cszoo fit finetuning_tutorial/model_config.yaml
You should then see something like this in your terminal:
Once training is complete, you will find several artifacts in the finetuning_tutorial/model folder (see the model_dir parameter in your model config). These include:
This will create both Hugging Face config files and a converted checkpoint under finetuning_tutorial/to_hf.
9
Validate checkpoint and configs (optional)
You can now generate outputs using Hugging Face:
Copy
pip install 'transformers[torch]'
Copy
python
Copy
Python 3.8.16 (default, Mar 18 2024, 18:27:40)[GCC 8.4.0] on linuxType "help", "copyright", "credits" or "license" for more information.>>> from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig>>> from transformers import pipeline>>> tokenizer = AutoTokenizer.from_pretrained("baseten/Meta-Llama-3-tokenizer")>>> config = AutoConfig.from_pretrained("finetuning_tutorial/to_hf/model_config_to_hf.json")>>> model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="finetuning_tutorial/to_hf/checkpoint_0_to_hf.bin", config = config)>>> text = "Generative AI is ">>> pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)>>> generated_text = pipe(text, max_length=50, do_sample=False, no_repeat_ngram_size=2, eos_token_id=pipeline.tokenizer.eos_token_id, pad_token_id=pipeline.tokenizer.eos_token_id)[0]>>> print(generated_text['generated_text'])>>> exit()
As a reminder, in this quickstart, you did not train your model for very long. A high quality model requires a longer training run, as well as a much larger dataset.
Congratulations! In this tutorial, you followed an end-to-end workflow to fine-tune a model on a Cerebras system and learn about essential tools and scripts.