Automatic Job Restart

Automatic Job Restart enables long-running jobs to automatically recover from intermittent failures, eliminating the need for constant monitoring. This feature tracks job progress using checkpoints while monitoring restart attempts. It automatically flags faulty systems that may have caused failures and restarts from the last saved checkpoint.

Restart Behavior

When Automatic Job Restart is enabled:

The system resets the failure count when a run makes progress (i.e., captures a new checkpoint).
For multiphase trainer configurations, each trainer config runs sequentially with autorestart. Subsequent configs only run if the previous one succeeds.

Prerequisites

Before using Automatic Job Restart, ensure that Checkpointing is configured. Restarts will commence from the last-saved checkpoint.

Configuration

Add the autorestart parameter to your Trainer config file:

trainer:
  init:
    # Your existing configuration
    autorestart:
      max_num_restarts: 3  # Required: Maximum number of automatic restarts
      min_num_csx: 1       # Optional: Minimum number of CSX systems (defaults to 1)

When you launch a training or eval job, the run will automatically restart if any recoverable failures are encountered if max_num_restarts is specified in the autorestart config. Learn more about launching jobs here.

Parameters

max_num_restarts: The maximum number of automatic restarts allowed without any progress (new checkpoints) before the run is considered failed. This parameter is required.
(Optional) min_num_csx: The minimum number of CSX systems with which to perform a restart. Defaults to 1 if not specified. In the event of a faulty system in the cluster, this feature will automatically remove that system from the usable pool and, unless a replacement system is found, restart with the remaining number of systems. This config establishes a lower bound on the number of systems from the usable pool with which to perform a restart.

If a redundant pool is required to ensure that a job has adequate resources in case of a faulty system, learn how to create one here.

Log Files and Monitoring

Restart logs are stored in /model_dir/<timestamped_dir>_restartable/run.log.

For multiphase trainer configs, each phase will have a separate timestamped restartable directory.

When automatic restart is enabled, the system will prefetch extra compile jobs in the background for num_csx - 1 and num_csx - 2 system configurations. As a result, you may see log messages like this:

[restartable_trainer.py:319] Prefetching compile for num_csx=1 completed successfully.

This helps speed up restarts after system failures and no action is required.

Limitations

Validation runs will only restart from scratch in the event of failures.
During training and eval runs, if a failure occurs during validation, restart will resume from the next training loop.
If a dataloader loads tokens out of bounds of the model’s vocab_size, the system will exhaust max_num_restarts before failing.
The system cannot automatically restart runs that are deadlocked at the system level.

Non-Recoverable Failures

The following errors will not trigger automatic restarts, regardless of how many max_num_restarts you’ve specified:

Compilation or lowering failures
Invalid user configurations
Failed assertions in Model Zoo

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

Automatic Job Restart

Restart Behavior

Prerequisites

Configuration

Parameters

Log Files and Monitoring

Limitations

Non-Recoverable Failures

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

​Restart Behavior

​Prerequisites

​Configuration

​Parameters

​Log Files and Monitoring

​Limitations

​Non-Recoverable Failures

Restart Behavior

Prerequisites

Configuration

Parameters

Log Files and Monitoring

Limitations

Non-Recoverable Failures