Cannot Load Cerebras Checkpoints In Gpus
You trained a model using Cerebras cluster and you would like to reload the model for continued gpu training/inference.
However, when setting a device different than the Cerebras Wafer-Scale cluster, the model loading is through pickle, but the Cerebras cluster model is in hdf5 format.
Work around
When moving to extremely large models reading, writing and manipulating checkpoints becomes a bottleneck. For that reason Cerebras has moved to using an HDF5 based file format in order to store checkpoints. Cerebras provides conversion scripts to convert between checkpoint file formats as explain in Work with Cerebras checkpoints.
Here is an example:
-
Train a GPT2 small model from Cerebras Model Zoo on a Cerebras cluster for 200 steps
-
Convert the checkpoint collected as part of the training in a Cerebras cluster CS2 to
.pkl
format. You should have created the Cerebras virtual environment created in Setup and installation.
-
Copy over checkpoint save in .pkl format to GPU setup
-
Checkout modelzoo in GPU env and install GPU dependencies for PyTorch, as explained in modelzoo and gpu-requirements.
-
Adjust
gpt2_small
params to settrain_input.batch_size=4
-
Resume training using converted checkpoint