> ## Documentation Index
> Fetch the complete documentation index at: https://training-docs.cerebras.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Trainer Essentials

> ​Learn how to use the Model Zoo Trainer to simplift large-scale model training on the Cerebras Wafer-Scale Cluster.

The Model Zoo Trainer helps streamline the process of training large AI models by taking care of complex setup steps, like manual parallelism. It provides a fast, scalable way to train and validate models.

While you don’t need to use the Trainer to run the reference models included in the Model Zoo, it’s recommended—especially if you want to take advantage of Cerebras hardware performance.

This page introduces the [`Trainer`](https://training-api.cerebras.ai/en/latest/wsc/Model-zoo/api/index.html "cerebras.modelzoo.Trainer") class and gives you a basic understanding of how to use it.

## Prerequisites

Please ensure that you have installed the Cerebras Model Zoo package by going through the installation guide.

Optionally, you can also read through the [basic Cerebras PyTorch guide](../cs-torch/writing-a-custom-training-loop) to first gain an understanding of the underlying API that underpins the `Trainer` class.

## Basic Usage

The [`Trainer`](https://training-api.cerebras.ai/en/latest/wsc/Model-zoo/api/index.html "cerebras.modelzoo.Trainer") class can be imported and used as follows:

```Bash theme={null}

import cerebras.pytorch as cstorch
from cerebras.modelzoo import Trainer

# Any torch.nn.Module
model: torch.nn.Module = torch.nn.Linear(10, 10)
# Any Cerebras compliant optimizer
optimizer: cstorch.optim.Optimizer = cstorch.optim.SGD(
    model.parameters(), lr=0.01, momentum=0.9
)

trainer = Trainer(
    device="CSX",  # The device to run on
    model_dir="./model_dir",  # The directory at which to store artifacts
    model=model,
    optimizer=optimizer,
)
# Train the model over a single epoch of the train dataloader
# and then run validation over a single epoch of the val dataloader
trainer.fit(train_dataloader, val_dataloader)
```

As can be seen in the above example, at a minimum the `Trainer` class takes in the following:

<Info>
  Learn more about these parameters in our [Trainer Configuration guide](../../../rel-2.5.0/model-zoo/trainer-configuration-overview).
</Info>

* `device`: The device to run training/validation on.

* `model_dir`: The directory at which to store model related artifacts (e.g. model checkpoints).

* `model`: The [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "(in PyTorch v2.4)") instance that we are training/validating.

* `optimizer`: Optionally, a [`cerebras.pytorch.optim.Optimizer`](../cs-torch/cerebras-pytorch-api/cerebras-pytorch-optim "cerebras.pytorch.optim.Optimizer") instance can be passed in to optimize the model weights during the training phase.

At a minimum, the call to [`fit`](https://training-api.cerebras.ai/en/latest/wsc/Model-zoo/api/index.html?highlight=fit#cerebras.modelzoo.Trainer.fit "cerebras.modelzoo.Trainer.fit") takes in the following:

* `train_dataloader`: The [`cerebras.pytorch.utils.data.DataLoader`](..//cs-torch/cerebras-pytorch-api/cerebras-pytorch-optim#cerebras.pytorch.utils.data.DataLoader "cerebras.pytorch.utils.data.DataLoader") instance to use during training.

* `val_dataloader`: Optionally, a [`cerebras.pytorch.utils.data.DataLoader`](..//cs-torch/cerebras-pytorch-api/cerebras-pytorch-optim#cerebras.pytorch.utils.data.DataLoader "cerebras.pytorch.utils.data.DataLoader") instance can be passed in to run validation during and/or at the end of training.

The default behaviour of this minimally configured run is to train the model over a single epoch of the train dataloader and then run validation over a single epoch of the val dataloader.

There you have it! With this small sample of code, you can begin training your very first model using the Cerebras Model Zoo Trainer!

You can pause here to go try it out for yourself, or continue reading to learn more about how to more finely configure the `Trainer` to fit your needs.

## Configuring the Training loop

As mentioned above, if both a `train_dataloader` and `val_dataloader` are provided to the [`fit`](https://training-api.cerebras.ai/en/latest/wsc/Model-zoo/api/index.html?highlight=fit#cerebras.modelzoo.Trainer.fit "cerebras.modelzoo.Trainer.fit") call, the default behaviour is to run a single epoch of training followed by a single epoch of validation.

This behaviour can be configured by passing in a [`TrainingLoop`](../model-zoo/components/trainer-components/loop) instance to the Trainer as follows:

```python theme={null}
from cerebras.modelzoo.trainer.callbacks import TrainingLoop

trainer = Trainer(
      ...,
      loop=TrainingLoop(
          num_steps=1000,
          eval_steps=100,
          eval_frequency=100,
      ),
)
trainer.fit(train_dataloader, val_dataloader)
```

In this above example,

* `num_steps` represents the total number of batches to train for. If `num_steps` exceeds the number of available batches in the train dataloader, the dataloader is automatically repeated to be able to run training for `num_steps`.

* `eval_steps` represents the number of steps to run validation for every time we run validation. Similar to training, if `eval_steps` exceeds the number of available batches in the val dataloader, the dataloader is automatically repeated. Although, typically validation is never run for more than a single epoch. So, it is advised to set `eval_steps` to be less than the length of the validation dataloader. Otherwise, the validation metrics may be incorrect.

* `eval_frequency` represents how often validation is run during training. In the above example, validation is run every 100 steps of training. That is to say, throughout the 1000 steps of training, validation is run 10 times. Regardless of the value of `eval_frequency`, if `eval_frequency` is greater than zero, we always run validation at the end of training.

## Checkpointing

The `Trainer` can be further configured to save checkpoints at regular intervals by passing in a [`Checkpoint`](../model-zoo/components/trainer-components/checkpointing) instance as follows:

```Bash theme={null}
from cerebras.modelzoo.trainer.callbacks import Checkpoint

trainer = Trainer(
      ...,
      model_dir="./model_dir",
      checkpoint=Checkpoint(steps=100),
)
trainer.fit(train_dataloader, val_dataloader)
```

In the above example, a checkpoint is saved every 100 steps of training. A checkpoint is also saved at the end of training regardless of whether or not `num_steps` is a multiple of the checkpoint steps.

The checkpoints are saved in the `model_dir` directory that was passed to the `Trainer`.

```
model_dir/
├── checkpoint_100.mdl
├── checkpoint_200.mdl
├── checkpoint_300.mdl
└── ...
```

<Info>
  This checkpoint is meant for resuming training from the same point in the future. As such, it will contain the model weights, optimizer state, and any other state that is necessary to resume training. Please see [Selective Checkpoint State Saving](../model-zoo/components/trainer-components/checkpointing.mdx) for examples of how to configure what state is saved into the checkpoint.
</Info>

A saved checkpoint can be loaded again in the future by specifying the `ckpt_path` argument to the call to `fit`. For example,

```
trainer = Trainer(...)
trainer.fit(
    train_dataloader,
    val_dataloader,
    ckpt_path="/path/to/checkpoint",
)
```

The above code will load the checkpoint at that path before starting training.

<Info>
  If a `ckpt_path` is not provided, but a checkpoint is found inside the `model_dir`, then `Trainer` "cerebras.modelzoo.Trainer") will automatically load the latest checkpoint found in the `model_dir`.
</Info>

To learn more about how to configure checkpointing behavior using the `Trainer`, see [Checkpointing](../model-zoo/components/trainer-components/checkpointing).

## What’s next?

To learn about how to specify a schedule for learning rates, please see [Optimizer and Scheduler](../model-zoo/components/trainer-components/optimizer-and-scheduler).

To learn about how you can configure a `Trainer` instance using a YAML configuration file, you can check out:

* Trainer YAML Overview

To learn more about how you can use the `Trainer` in some core workflows, you can check out:

* [Pretraining with Upstream Validation](../model-zoo/core-workflows/pretraining-with-upstream-validation)

To learn more about how you can extend the capabilities of the `Trainer` class, you can check out:

* [Defer Weight Initialization](../model-zoo/components/trainer-components/defer-weight-initialization)

* [Numeric Precision](../model-zoo/components/trainer-components/numeric-precision)

* [Train a model with weight sparsity](../model-zoo/tutorials/train-a-model-with-weight-sparsity)

* [Checkpointing](../model-zoo/components/trainer-components/checkpointing)

* [Customizing the Trainer with Callbacks](../model-zoo/components/trainer-components/customizing-the-trainer-with-callbacks)

* [Logging](../model-zoo/components/trainer-components/logging)

* [Performance Flags](../model-zoo/components/trainer-components/performance-flags)

To learn more about what the `Trainer` class outputs during the run, you can check out:

* [Model Directory](../model-zoo/components/trainer-components/model-directory)
