Trainer
with a Checkpoint
object. By the end you should have a cursory understanding on how to use the Checkpoint
class in conjunction with the Trainer
class.
Checkpoint
core callback. You can control the cadence at which you save checkpoints, the naming convention of checkpoints saved, and various other useful functionalities. For details on all options, see Checkpoint
.
An example of a checkpoint configuration is shown here:
autoload_last_checkpoint
can be used to autoload the most recent checkpoint from model_dir
. If you have the following checkpoints in model_dir
:
autoload_last_checkpoint
like in the example below, the run will automatically load from the checkpoint with the largest step value, in this case "checkpoint_20000.mdl"
.
disable_strict_checkpoint_loading
option can be used to loosen the validation done when loading a checkpoint. If True, the model will not raise an error if the checkpoint contains keys that are not present in the model.
SaveCheckpointState
callback, which allows us to:
"model"
state.
k
in SaveCheckpointState
refers to taking an alterative checkpoint every k
checkpoint steps, not every k
steps.LoadCheckpointStates
callback. The LoadCheckpointStates
callback allows us to:
"model"
state from any checkpoint.
KeepNCheckpoints
. The KeepNCheckpoints
callback allows us to: - Constrain the amount of storage space checkpoints take up while still allowing for recent restart points in case a run is interrupted. - If you want to still keep long-term checkpoints over a larger cadence for validation purposes, checkpoints generated by SaveCheckpointState
are ignored by KeepNCheckpoints
(see Selective Checkpoint State Saving for more details).
In the example below, only the 5 most recent checkpoints will be retained.
Trainer
instance using a YAML configuration file, you can check out:
Trainer
in some core workflows, you can check out:
To learn more about how you can extend the capabilities of the Trainer
class, you can check out: