To convert model implementations between Model Zoo and other external code repositories, such as Hugging Face, use the Model Zoo CLI’s checkpoint converter tool.

Use the tool to:

  • Convert a pretrained checkpoint from an external repo to Cerebras format for continued training on CS systems.

  • Export a model trained in the Cerebras ecosystem to Hugging Face for inference elsewhere.

  • Update older Cerebras checkpoints to work with newer Model Zoo releases.

Prerequisites

Make sure you’ve completed setup and installation.

Commands

 $ cszoo checkpoint --help

usage: cszoo checkpoint [-h]
                        {info,delete,copy,move,diff,convert,convert-config,list-converters,list}
                        ...

positional arguments:
  {info,delete,copy,move,diff,convert,convert-config,list-converters,list}
    info                Gives a high level summary of a checkpoint.
    delete              Delete a checkpoint
    copy                Copy a checkpoint
    move                Move a checkpoint.
    diff                Diff two checkpoints.
    convert             Convert a checkpoint between CS and HuggingFace or across CS releases.
    convert-config      Convert a config between CS and HuggingFace or across CS releases.
    list-converters     List available converters.

Support and Limitations

  • Currently, the convert argument supports conversion between Cerebras Model Zoo and Hugging Face implementations. If you have a separate, custom checkpoint format you would like to convert to, contact Cerebras for assistance.

  • To update checkpoints that are multiple versions behind, incrementally run conversion through the intermediate releases (e.g., 1.9 -> 2.0 -> 2.1 -> 2.2). A list of supported converters can be found on this page, below. A checkpoint from a previous release must first be “fixed” to be 2.1 compatible, prior to conversion. See Upgrading Checkpoints from Previous Versions for more information.

  • If you plan to convert a Cerebras Model Zoo model to another repository, we strongly recommend running a config conversion first using the convert-config argument before starting training. This helps you verify whether your model can be successfully adapted to another format. Keep in mind that other repositories may not support the same level of flexibility as the Cerebras Model Zoo and may not accommodate certain model modifications. For instance, Hugging Face’s NLP models have fixed positional embedding implementations, which prevent adding ALiBi to a LLaMA model.

Models Supported

The following is a list of models supported by the tool:

bertbert-sequence-classifierbert-token-classifier
bert-summarizationbert-q&abloom
bloom-headlessbtlmbtlm-headless
codegencodegen-headlesscode-llama
code-llama-headlessdpodpr
falconfalcon-headlessflan-ul2
gpt2gpt2-headlessgpt2 with muP
gptjgptj-headlessgpt-neox
gpt-neox-headlessjaisllama
llama-headlessllamaV2llamaV2-headless
llavamptmpt-headless
mistralmistral-headlessoctocoder
octocoder-headlessrobertasantacoder
santacoder-headlesssqlcodersqlcoder-headless
starcoderstarcoder-headlesst5
transformerul2wizardcoder
wizardcoder-headlesswizardlmwizardlm-headless

Usage

  • Before using the converter, run the list-converters argument and read its output notes. The output specifies which model classes are being converted from and to, and highlights any important caveats.

  • Checkpoints that do not require model changes bewteen Cerebras releases still need to be converted to the correct release. There are checkpoint metadata conversions needed as well as config changes.

  1. List all models/conversions that we support:
cszoo checkpoint list-converters
  1. To convert a config file only, use the following command:
cszoo checkpoint convert-config \
    --model <model name> \
    --src-fmt <format of input config> \
    --tgt-fmt <format of output config> \
    --output-dir <location to save output config> \
    <config file path>
  1. To convert a checkpoint and its corresponding config, use the following command:
cszoo checkpoint convert \
    --model <model name> \
    --src-fmt <format of input checkpoint> \
    --tgt-fmt <format of output checkpoint> \
    --output-dir <location to save output checkpoint> \
    <input checkpoint file path> \
    --config <input config file path>

To learn more about usage and optional parameters about a particular subcommand, pass the -h flag.

For example:

cszoo checkpoint convert -h

--src-fmt and --tgt-fmt can be automatically inferred for CS checkpoints. Use --src-fmt cs-auto to detect the Cerebras version from the checkpoint (works on version 2.1+ checkpoints). Use --tgt-fmt cs-current to specify checkpoint conversion to the current release version.

Examples

Convert an Eleuther AI GPT-J 6B checkpoint with a model card to Cerebras Model Zoo

Eleuther’s final GPT-J checkpoint can be accessed on Hugging Face at EleutherAI/gpt-j-6B. Rather than manually entering the values from the model architecture table into a config file and writing a script to convert their checkpoint, we can auto-generate these with a single command.

First, we need to download the config and checkpoint files from the model card locally:

mkdir opensource_checkpoints
wget -P opensource_checkpoints https://huggingface.co/EleutherAI/gpt-j-6B/raw/main/config.json
wget -P opensource_checkpoints https://huggingface.co/EleutherAI/gpt-j-6B/resolve/main/pytorch_model.bin

Use the appropriate https link when downloading files from Hugging Face model card pages. Use the path that contains …/raw/… for config files. Use the path that contains …/resolve/… for checkpoint files.

Hugging Face configs contain the architecture property, which specifies the class with which the checkpoint was generated. According to config.json, the HF checkpoint is from the GPTJForCausalLM class. Using this information, we can use the checkpoint converter tool’s list command to find the appropriate converter. In this case, we want to use the gptj model, with a source format of hf, and a target format of cs-2.0.

Now to convert the config & checkpoint, run the following command:

cszoo checkpoint convert \
    --model gptj \
    --src-fmt hf \
    --tgt-fmt cs-2.0 \
    --output-dir opensource_checkpoints/ \
    opensource_checkpoints/pytorch_model.bin \
    --config opensource_checkpoints/config.json

This produces two files:

  • opensource_checkpoints/pytorch_model_to_cs-2.0.mdl

  • opensource_checkpoints/config_to_cs-2.0.yaml

The output YAML config file contains the auto-generated model parameters from the Eleuther implementation. Before you can train/eval the model on the Cerebras cluster, add the train_input, eval_input, optimizer, and runconfig parameters to the YAML. Examples for these parameters can be found in the configs/ folder for each model within Model Zoo. In this case, we can copy the missing information from modelzoo/models/nlp/gptj/configs/params_gptj_6B.yaml into opensource_checkpoints/config_to_cs-2.0.yaml. Make sure you modify the dataset paths under train_input and eval_input if they are stored elsewhere.

Convert a Hugging Face model without a model card to Cerebras Model Zoo

Not all pretrained checkpoints on Hugging Face have corresponding model card web pages. You can still download these checkpoints and configs to convert them into a Model Zoo compatible format.

For example, Hugging Face has a model card for BertForMaskedLM accessible through the name bert-base-uncased. However, it doesn’t have a webpage for BertForPreTraining, which we’re interested in.

We can manually get the config and checkpoint for this model as follows:

>>> from cerebras.modelzoo.models.nlp.bert.model import BertForPreTraining
>>> model = BertForPreTraining.from_pretrained("bert-base-uncased")
>>> model.save_pretrained("bert_checkpoint")

This saves two files: bert_checkpoint/config.json and bert_checkpoint/pytorch_model.bin

Now that you have downloaded the required files, you can convert the checkpoints. Use the --model bert flag since the Hugging Face checkpoint is from the BertForPreTraining class. If you want to use another checkpoint from a different variant (such as a finetuning model), see the other bert- model converters.

The final conversion command is:

cszoo checkpoint convert \
    --model bert \
    --src-fmt hf \
    --tgt-fmt cs-2.0 \
    bert_checkpoint/pytorch_model.bin \
    --config bert_checkpoint/config.json

Convert a Cerebras Model Zoo GPT-2 checkpoint to Hugging Face

Suppose you just finished training GPT-2 on CS and want to run the model within the Hugging Face ecosystem. In this example, the configuration file is saved at model_dir/train/params_train.yaml and the checkpoint (corresponding to step 10k) is at model_dir/checkpoint_10000.mdl

To convert the Hugging Face, run the following command:

cszoo checkpoint convert \
    --model gpt2 \
    --src-fmt cs-2.0 \
    --tgt-fmt hf \
    model_dir/checkpoint_10000.mdl \
    --config model_dir/train/params_train.yaml

Since the --output-dir flag is omitted, the two output files are saved to the same directories as the original files:

  • model_dir/train/params_train_to_hf.json

  • model_dir/checkpoint_10000_to_hf.bin

YAML and Model Config Updates

As our Model Zoo implementations evolve over time, the changes may sometimes break out-of-the-box compatibility when moving to a new release. To ensure that you can continue using your old checkpoints, we offer converters that allow you to “upgrade” configs and checkpoints when necessary. The section below covers conversions that are required when moving to a particular release. If a converter doesn’t exist, no explicit conversion is necessary.

Release 2.2.0

In order to continue using your checkpoints & configs from release 2.1 in release 2.2.0+, you’ll need to upgrade them. See the example from the Release 2.1.0 section for additional details.

Release 2.1.0

We made many updates to our model implementations and runner API. In order to continue using your checkpoints & configs from release 2.0 in release 2.1.0, upgrade them using the following command:

cszoo checkpoint convert \
    --model <model type> \
    --src-fmt cs-2.0 \
    --tgt-fmt cs-2.1 \
    --config <config file path>
    <checkpoint path>

In the command above, --model should be the name of the model that you were training (for example gpt2).

Release 2.0.2

Upgrading Checkpoints From Previous Versions

For checkpoints pre-2.0, dataloader state files need to be converted to the dataloader checkpoint format for the new map and iterable dataloaders in Model Zoo in releases 2.0+. This allows for deterministic restart of the dataloader for model training jobs that move from releases prior to 2.0 to release 2.0+.

The dataloader state conversion will automatically be done during checkpoint conversion if this is set in the config:

cerebras:
  save_iter_state_path: <path-to-directory-containing-dataloader-state-files>

save_iter_state_path is the path to the directory containing data step file data_iter_checkpoint_state_file_global and worker checkpoint files of the format data_iter_state_file_worker_*_step_*.txt.

Streaming Conversion

In prior releases, conversion required both the input & output checkpoints to be stored in memory during conversion. This meant that large models required a prohibitively large amount of memory in order to perform conversion. In release 2.0.0, we introduce streaming conversion, which significantly reduces the peak memory usage by performing conversion incrementally. This is done by loading/saving one shard at a time for pickled checkpoints and loading/saving one tensor at a time for Cerebras H5 checkpoints. Streaming conversion is enabled by default; you don’t need to make any changes to the command line arguments. Thanks to this feature, you will now be able to convert massive checkpoints (e.g.: LLaMA 70B) on a small machine (~10GB of RAM).

Upgrading LLaMA, Transformer, T5

To make it easier to control the type of normalization layer used by Model Zoo models, we have replaced the use_rms_norm and use_biasless_norm flags in the model configs to instead use norm_type. To continue using rel 1.9 checkpoints in rel 2.0, you’ll need to update the config to reflect this change. You can do this automatically using the config converter tool as follows:

cszoo checkpoint convert-config \
    --model <model type> \
    --src-fmt cs-1.9 \
    --tgt-fmt cs-2.0 \
    <config file path>

In the command above, --model should be either llama, t5, or transformer, depending on which model you’re using (other models use the same configs as in 1.9, and as a result do not need to be upgraded). The config file path should point to the train/params_train.yaml file within your model directory.

Release 1.9.1

All configs and checkpoints from release 1.8.0 can continue to be used in release 1.9.1 without any conversion.

Release 1.8.0

T5 / Vanilla Transformer

As described in the release notes, the behavior of the use_pre_encoder_decoder_layer_norm flag has been flipped. To continue using rel 1.7 checkpoints in rel 1.8, you’ll need to update the config to reflect this change. You can do this automatically using the config converter tool as follows:

cszoo checkpoint convert-config \
    --model <model type> \
    --src-fmt cs-1.7 \
    --tgt-fmt cs-1.8 \
    <config file path>

In the command above, --model should be either t5 or transformer depending on which model you are using. The config file path should point to the train/params_train.yaml file within your model directory.

BERT

As described in the release notes, we expanded the BERT model configurations to expose two additional parameters: pooler_nonlinearity and mlm_nonlinearity. Due to a change in the default value of the mlm_nonlinearity parameter, you will need to update the config when using a rel 1.7 checkpoint in rel 1.8. You can do this automatically using the config converter tool as follows:

cszoo checkpoint convert-config \
    --model bert \
    --src-fmt cs-1.7 \
    --tgt-fmt cs-1.8 \
    <config file path>

The config file path should point to the train/params_train.yaml file within your model directory.