Automatic Job Restart
Learn how to configure automatic job restart in your Trainer config.
Automatic Job Restart enables long-running jobs to automatically recover from intermittent failures, eliminating the need for constant monitoring. This feature tracks job progress using checkpoints while monitoring restart attempts. It automatically flags faulty systems that may have caused failures and restarts from the last saved checkpoint.
Restart Behavior
When Automatic Job Restart is enabled:
-
The system resets the failure count when a run makes progress (i.e., captures a new checkpoint).
-
For multiphase trainer configurations, each trainer config runs sequentially with autorestart. Subsequent configs only run if the previous one succeeds.
Prerequisites
Before using Automatic Job Restart, ensure that Checkpointing is configured. Restarts will commence from the last-saved checkpoint.
Configuration
Add the autorestart
parameter to your Trainer config file:
When you launch a training or eval job, the run will automatically restart if any recoverable failures are encountered if max_num_restarts
is specified in the autorestart config. Learn more about launching jobs here.
Parameters
max_num_restarts
: The maximum number of automatic restarts allowed without any progress (new checkpoints) before the run is considered failed. This parameter is required.- (Optional)
min_num_csx
: The minimum number of CSX systems with which to perform a restart. Defaults to 1 if not specified. In the event of a faulty system in the cluster, this feature will automatically remove that system from the usable pool and, unless a replacement system is found, restart with the remaining number of systems. This config establishes a lower bound on the number of systems from the usable pool with which to perform a restart.
If a redundant pool is required to ensure that a job has adequate resources in case of a faulty system, learn how to create one here.
Log Files and Monitoring
Restart logs are stored in /model_dir/<timestamped_dir>_restartable/run.log
.
For multiphase trainer configs, each phase will have a separate timestamped restartable directory.
Limitations
-
Validation runs will only restart from scratch in the event of failures.
-
During training and eval runs, if a failure occurs during validation, restart will resume from the next training loop.
-
If a dataloader loads tokens out of bounds of the model’s
vocab_size
, the system will exhaustmax_num_restarts
before failing. -
The system cannot automatically restart runs that are deadlocked at the system level.
Non-Recoverable Failures
The following errors will not trigger automatic restarts, regardless of how many max_num_restarts
you’ve specified:
- Compilation or lowering failures
- Invalid user configurations
- Failed assertions in Model Zoo