This tutorial introduces you to Cerebras essentials, including data preprocessing, training scripts, configuration files, and checkpoint conversion tools. You’ll learn these concepts by pretraining Meta’s Llama 3 8B on 40,000 lines of Shakespeare.
In this quickstart, you will:
Set up your environment
Preprocess a small dataset
Pretrain and evaluate a model
Convert your model checkpoint for Hugging Face
In this tutorial, you will train your model for a short while on a small dataset. A high quality model requires a longer training run, as well as a much larger dataset.
We use cp here to copy configs specifically designed for this tutorial. For general use with Model Zoo models, we recommend using cszoo config pull. See the CLI command reference for details.
2
Inspect Configs
Before moving on, inspect the configuration files you just copied to confirm that the parameters are set as expected.
To view the model config, run:
Copy
cat pretraining_tutorial/model_config.yaml
You should see the following content in your terminal:
Use your data configs to preprocess your “train” and “validation” datasets:
Copy
cszoo data_preprocess run --config pretraining_tutorial/train_data_config.yamlcszoo data_preprocess run --config pretraining_tutorial/valid_data_config.yaml
You should then see your preprocessed data in pretraining_tutorial/train_data/ and pretraining_tutorial/valid_data/ (see the output_dir parameter in your data configs).
When using the Hugging Face CLI to download a dataset, you may encounter the following error: KeyError: 'tags'
This issue occurs due to an outdated version of the huggingface_hub package. To resolve it, update the package by running:
pip install --upgrade huggingface_hub==0.26.1
An example of a “train” dataset looks as follows:
Copy
{ "text": "First Citizen:\nBefore we proceed any further, hear me "}
In your terminal, you will see a url like http://172.31.48.239:5000. Copy and paste this into your browser to launch TokenFlow, a tool for interactively visualizing whether loss and attention masks were applied correctly:
4
Train and Evaluate Model
Update train_dataloader.data_dir and val_dataloader.data_dir in your model config to use the absolute paths of your preprocessed data:
Now you’re ready to launch training. Use the cszoo fit command to submit a job, passing in your updated model config. This command automatically uses the locations and packages defined in your config. Click here for more information.
Copy
cszoo fit pretraining_tutorial/model_config.yaml --mgmt_namespace <namespace>
You should then see something like this in your terminal:
Once training is complete, you will find several artifacts in the pretraining_tutorial/model folder (see the model_dir parameter in your model config). These include:
This will create both Hugging Face config files and a converted checkpoint under pretraining_tutorial/to_hf.
6
Validate Checkpoint and Configs
You can now generate outputs using Hugging Face:
Copy
pip install 'transformers\[torch\]'
Copy
python
Copy
Python 3.8.16 (default, Mar 18 2024, 18:27:40) [GCC 8.4.0] on linuxType "help", "copyright", "credits" or "license" for more information.>>> from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig>>> from transformers import pipeline>>> tokenizer = AutoTokenizer.from_pretrained("baseten/Meta-Llama-3-tokenizer")>>> config = AutoConfig.from_pretrained("pretraining_tutorial/to_hf/model_config_to_hf.json")>>> model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="pretraining_tutorial/to_hf/checkpoint_0_to_hf.bin", config = config)>>> text = "Generative AI is ">>> pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)>>> generated_text = pipe(text, max_length=50, do_sample=False, no_repeat_ngram_size=2, eos_token_id=pipeline.tokenizer.eos_token_id, pad_token_id=pipeline.tokenizer.eos_token_id)[0]>>> print(generated_text['generated_text'])>>> exit()
As a reminder, in this quickstart, you did not train your model for very long. A high quality model requires a longer training run, as well as a much larger dataset.
Congratulations! In this tutorial, you followed an end-to-end workflow to pretrain a model on a Cerebras system and learn about essential tools and scripts.