S3 Checkpointing
Save, load, and manage checkpoints using S3-compatible storage.
You can save and load model checkpoints using any S3-compatible storage. This lets you:
- Persist training progress remotely
- Share checkpoints across environments
- Manage checkpoints easily with CLI tools
This is ideal for large-scale or distributed training where local storage is limited or ephemeral.
Setup
Before saving checkpoints to S3, you need to configure credentials so your environment can authenticate with your S3-compatible service.
You can do this either by using an AWS-style config file or by setting environment variables. Both methods are supported, and you can switch between them based on your setup or deployment environment.
Option 1: AWS Config File (~/.aws/config
)
Use this method when you want to define multiple named profiles and keep credentials stored locally in a file.
To use a specific profile:
Or in YAML:
If csx.storage.s3.profile
is not specified or is set to None
, the system will default to the [default]
profile.
Option 2: Environment Variables
If you’re running in containers or ephemeral environments, exporting credentials directly as environment variables may be more convenient.
If the system can’t find valid credentials, it will raise an error.
For other methods of configuring endpoints and credentials, see the AWS documentation.
Use the S3 Checkpoint API
Once credentials are in place, you can save and load checkpoints simply by using an S3 path.
This integration works directly with cstorch.save and cstorch.load, so you don’t need to change your workflow — just update the path.
With Cerebras PyTorch
Specify an s3:// path in your calls to save and load model state dictionaries.
Upload Checkpoints to S3
Use the cszoo checkpoint copy
command to upload checkpoints to S3.
Your S3 path must:
- Start with s3://
- Include the bucket name
- Include the object path
- If the bucket doesn’t exist and you have permission, the system will create it automatically.
- If the bucket exists but your credentials do not have permission to access it, the system will raise a permission error.
With ModelZoo Trainer
You can configure the ModelZoo Trainer to save directly to S3 by setting checkpoint_root.
This makes checkpointing seamless — just point to an S3 path, and the trainer handles the rest.
If you don’t specify checkpoint_root
, the system saves to the local filesystem by default.
Or configure it in YAML:
This setup will save checkpoints as:
Load Checkpoints from S3 with ModelZoo Trainer
To resume training from a checkpoint stored in S3, just specify the checkpoint path using ckpt_path.
If your checkpoint was sourced externally and contains tensors with a different dtype than expected (e.g., BF16 instead of F32), it may trigger inefficient type casting that slows down performance—this issue does not affect checkpoints saved during a Cerebras training run.