S3 Checkpointing

You can save and load model checkpoints using any S3-compatible storage. This lets you:

Persist training progress remotely
Share checkpoints across environments
Manage checkpoints easily with CLI tools

This is ideal for large-scale or distributed training where local storage is limited or ephemeral.

Setup

Before saving checkpoints to S3, you need to configure credentials so your environment can authenticate with your S3-compatible service. You can do this either by using an AWS-style config file or by setting environment variables. Both methods are supported, and you can switch between them based on your setup or deployment environment.

Option 1: AWS Config File (`~/.aws/config`)

Use this method when you want to define multiple named profiles and keep credentials stored locally in a file.

[default]
aws_access_key_id = ...
aws_secret_access_key = ...
endpoint_url = ...

[profile truenas]
aws_access_key_id = ...
aws_secret_access_key = ...
aws_endpoint_url = ...

To use a specific profile:

cstorch.backends.csx.storage.s3.profile = "truenas"

Or in YAML:

trainer:
  init:
    callbacks:
      - GlobalFlags:
          csx.storage.s3.profile: truenas

If csx.storage.s3.profile is not specified or is set to None, the system will default to the [default] profile.

Option 2: Environment Variables

If you’re running in containers or ephemeral environments, exporting credentials directly as environment variables may be more convenient.

export AWS_CONFIG_FILE=...
export AWS_PROFILE=...
export AWS_ENDPOINT_URL=...
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...

If the system can’t find valid credentials, it will raise an error.

For other methods of configuring endpoints and credentials, see the AWS documentation.

Use the S3 Checkpoint API

Once credentials are in place, you can save and load checkpoints simply by using an S3 path. This integration works directly with cstorch.save and cstorch.load, so you don’t need to change your workflow — just update the path.

With Cerebras PyTorch

Specify an s3:// path in your calls to save and load model state dictionaries.

state_dict = {...}
cstorch.save(state_dict, "s3://bucket/prefix/checkpoint.mdl")

state_dict = cstorch.load("s3://bucket/prefix/checkpoint.mdl")

Upload Checkpoints to S3

Use the cszoo checkpoint copy command to upload checkpoints to S3.

cszoo checkpoint copy /path/to/local/checkpoint.mdl s3://bucket/path/to/remote/checkpoint.mdl

Your S3 path must:

Start with s3://
Include the bucket name
Include the object path
If the bucket doesn’t exist and you have permission, the system will create it automatically.
If the bucket exists but your credentials do not have permission to access it, the system will raise a permission error.

With ModelZoo Trainer

You can configure the ModelZoo Trainer to save directly to S3 by setting checkpoint_root. This makes checkpointing seamless — just point to an S3 path, and the trainer handles the rest.

If you don’t specify checkpoint_root, the system saves to the local filesystem by default.

from cerebras.modelzoo.trainer import Trainer
from cerebras.modelzoo.trainer.callbacks import Checkpoint

trainer = Trainer(
    ...,
    checkpoint=Checkpoint(checkpoint_root="s3://bucket/prefix")
)

Or configure it in YAML:

trainer:
  init:
    ...
    checkpoint:
      checkpoint_root: s3://bucket/prefix

This setup will save checkpoints as:

s3://bucket/prefix/checkpoint_{step}.mdl

Load Checkpoints from S3 with ModelZoo Trainer

To resume training from a checkpoint stored in S3, just specify the checkpoint path using ckpt_path.

trainer:
  ...
  fit:
    ckpt_path: s3://bucket/path/to/remote/checkpoint.mdl
    ...

If your checkpoint contains tensors with a different dtype than expected (e.g., BF16 instead of F32), it may trigger inefficient type casting that slows down performance—this issue does not affect checkpoints saved during a Cerebras training run.

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

Setup

Option 1: AWS Config File (`~/.aws/config`)

Option 2: Environment Variables

Use the S3 Checkpoint API

With Cerebras PyTorch

Upload Checkpoints to S3

With ModelZoo Trainer

Load Checkpoints from S3 with ModelZoo Trainer

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

​Setup

​Option 1: AWS Config File (~/.aws/config)

​Option 2: Environment Variables

​Use the S3 Checkpoint API

​With Cerebras PyTorch

​Upload Checkpoints to S3

​With ModelZoo Trainer

​Load Checkpoints from S3 with ModelZoo Trainer

Setup

Option 1: AWS Config File (`~/.aws/config`)

Option 2: Environment Variables

Use the S3 Checkpoint API

With Cerebras PyTorch

Upload Checkpoints to S3

With ModelZoo Trainer

Load Checkpoints from S3 with ModelZoo Trainer