Loop

This page will cover the two LoopCallback subclasses and how to configure the training/validation loop of the Trainer by using one of them.

Prerequisites

Make sure to have read through Trainer Overview and Trainer Configuration Overview which provide the basic overview of how to run Model Zoo models. In this document, you will be using the tools and configurations outlined in those pages.

Configure the Loop

The loop argument allows you to manage the training and/or validation loop. The Trainer takes in a LoopCallback subclass that is used to configure loop options such as number of steps/epochs to run for and how often to run validation. A LoopCallback cannot be instantiated directly, TrainingLoop or ValidationLoop must be used instead.

Configure for training

The TrainingLoop callback is used to configure the Trainer to run a fit task. The majority of loop arguments reference step. The step is simply a batch of training/validation data. Arguments

num_steps: The total number of steps to train for.
max_steps: The maximum number of global steps to train for. num_steps supersedes this.
num_epochs: The number of epochs to train for. Mutually exclusive with num_steps.
steps_per_epoch: The number of steps to train for in each epoch.
eval_frequency: The frequency at which validation is performed. See LoopCallback for more details on options.
eval_steps: The number of validation steps to perform.
grad_accum_steps: The number of steps to accumulate gradients before performing and optimizer step. Only relevant for "CPU" and "GPU" runs.

If you plan on running any kind of training (calling fit), you must use a TrainingLoop. If you plan on running only validation, you may use a ValidationLoop.

In the example below, we configure the Trainer to run for 1000 steps and run validation for 50 steps every 100 training steps.

trainer:
  init:
    ...
    loop:
      num_steps: 1000
      eval_steps: 50
      eval_frequency: 100
    ...
  fit:
    ...

Configure for Validation

The ValidationLoop callback is used to configure the Trainer to run a validate or validate_all task. Arguments

eval_steps: The number of validation steps to perform.
hook: The base name of the validation hooks to run. Used to extend validation functionality by implementing custom validation callbacks. See EleutherEvalHarnessLoop for an example. Defaults to "validate".

ValidationLoop can only be used if you plan on running only validation tasks (calling validate or validate_all). Otherwise, use TrainingLoop.

In the example below, we configure the Trainer to run validation for 100 steps. We do not need to set any training related options such as num_steps or eval_frequency since we are only running validation.

trainer:
  init:
    ...
    loop:
      eval_frequency: 100
    ...
  validate:
    ...

TrainingLoop supports both training and validation because it instantiates a ValidationLoop on initalization.

Everytime validation runs, we are restarting the validation dataloaders from scratch. This is not the same for training where we resume training from the where we left off in the training dataloader.

Conclusion

That covers how to configure the Trainer for training and/or validation. You should now understand how to use a LoopCallback subclass to configure training loop parameters such as number of steps and validation frequency.

Getting Started

Concepts

Model Zoo

CS Torch

Cluster Monitoring

Fundamentals

Support

Prerequisites

Configure the Loop

Configure for training

Configure for Validation

Conclusion

Further Reading

Getting Started

Concepts

Model Zoo

CS Torch

Cluster Monitoring

Fundamentals

Support

​Prerequisites

​Configure the Loop

​Configure for training

​Configure for Validation

​Conclusion

​Further Reading

Prerequisites

Configure the Loop

Configure for training

Configure for Validation

Conclusion

Further Reading