Automatic Job Restart enables long-running jobs to automatically recover from intermittent failures, eliminating the need for constant monitoring. This feature tracks job progress using checkpoints while monitoring restart attempts. It automatically flags faulty systems that may have caused failures and restarts from the last saved checkpoint.

Restart Behavior

When Automatic Job Restart is enabled:

  • The system resets the failure count when a run makes progress (i.e., captures a new checkpoint).

  • For multiphase trainer configurations, each trainer config runs sequentially with autorestart. Subsequent configs only run if the previous one succeeds.

Prerequisites

Before using Automatic Job Restart, ensure that Checkpointing is configured. Restarts will commence from the last-saved checkpoint.

Configuration

Add the autorestart parameter to your Trainer config file:

trainer:
  init:
    # Your existing configuration
    autorestart:
      max_num_restarts: 3  # Required: Maximum number of automatic restarts
      min_num_csx: 1       # Optional: Minimum number of CSX systems (defaults to 1)

When you launch a training or eval job, the run will automatically restart if any recoverable failures are encountered if max_num_restarts is specified in the autorestart config. Learn more about launching jobs here.

Parameters

  • max_num_restarts: The maximum number of automatic restarts allowed without any progress (new checkpoints) before the run is considered failed. This parameter is required.
  • (Optional) min_num_csx: The minimum number of CSX systems with which to perform a restart. Defaults to 1 if not specified. In the event of a faulty system in the cluster, this feature will automatically remove that system from the usable pool and, unless a replacement system is found, restart with the remaining number of systems. This config establishes a lower bound on the number of systems from the usable pool with which to perform a restart.

If a redundant pool is required to ensure that a job has adequate resources in case of a faulty system, learn how to create one here.

Log Files and Monitoring

Restart logs are stored in /model_dir/<timestamped_dir>_restartable/run.log.

For multiphase trainer configs, each phase will have a separate timestamped restartable directory.

Limitations

  • Validation runs will only restart from scratch in the event of failures.

  • During training and eval runs, if a failure occurs during validation, restart will resume from the next training loop.

  • If a dataloader loads tokens out of bounds of the model’s vocab_size, the system will exhaust max_num_restarts before failing.

  • The system cannot automatically restart runs that are deadlocked at the system level.

Non-Recoverable Failures

The following errors will not trigger automatic restarts, regardless of how many max_num_restarts you’ve specified:

  • Compilation or lowering failures
  • Invalid user configurations
  • Failed assertions in Model Zoo