Work with Cerebras Checkpoints
PyTorch Checkpoint Format
Our large model-optimized checkpoint format is based off the standard HDF5 file format. At a high-level, when saving a checkpoint, the Cerebras stack will take a PyTorch state dictionary, flatten it, and store it in an HDF5 file. For example, the following state dict:
Would be flattened and stored into the H5 file as follows
A model/optimizer state dict can be saved in the new checkpoint format using the cbtorch.save
method. e.g.
A checkpoint saved using the above can be loaded using the cbtorch.load
method. e.g.
If you’re using the run.py
scripts provided in ModelZoo, the configuration and setup mentioned earlier are already handled automatically by the built-in runners.
Converting Checkpoint Formats
If using cbtorch.load
is not a sufficient solution for loading the checkpoint into memory, a simple conversion can be done to the pickle format that PyTorch uses as follows