> ## Documentation Index
> Fetch the complete documentation index at: https://training-docs.cerebras.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# DINOv2

> Self-supervised vision model that learns general-purpose visual features without labeled data, excelling in diverse image and pixel-level tasks

## Model Description

DINOv2 is a self-supervised vision transformer model by Meta that learns high-quality image representations without needing labeled data. It builds on the success of DINO by introducing architectural and training enhancements that deliver state-of-the-art performance across various computer vision tasks, including classification.

<Frame caption="DINOv2 data processing pipeline, from Oquab et al 2023.">
  <img src="https://mintcdn.com/cerebras-training/v-8ckzus28Y4flPh/rel-2.5.0/images/model-zoo/dinov2-data-pipeline.png?fit=max&auto=format&n=v-8ckzus28Y4flPh&q=85&s=120a042b2fce0f92d50a3b505a12f0b3" width="1320" height="434" data-path="rel-2.5.0/images/model-zoo/dinov2-data-pipeline.png" />
</Frame>

## Code Structure

The code for this model is located in the [`dino`](https://github.com/Cerebras/modelzoo/tree/rel-2.4.2/src/cerebras/modelzoo/models/vision/dino) directory within ModelZoo. Here’s how it's organized:

* [`configs/`](https://github.com/Cerebras/modelzoo/tree/rel-2.4.2/src/cerebras/modelzoo/models/vision/dino/configs): Contains YAML configuration files.

* [`scripts/`](https://github.com/Cerebras/modelzoo/tree/rel-2.4.2/src/cerebras/modelzoo/models/vision/dino/scripts): Contains scripts for various workflows, including checkpoint conversion and image resizing.

* [`model.py`](https://github.com/Cerebras/modelzoo/blob/rel-2.4.2/src/cerebras/modelzoo/models/vision/dino/model.py): The implementation of the DINOv2 model.

* [`DinoImageDataProcessor.py`](https://github.com/Cerebras/modelzoo/blob/rel-2.4.2/src/cerebras/modelzoo/models/vision/dino/DinoImageDataProcessor.py): Data processor for DINOv2.

## Available Configurations

| Configuration                                                                                                                                                                             | Description                                       |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------- |
| [`params_dinov2_large_224_bs1024.yaml`](https://github.com/Cerebras/modelzoo/blob/rel-2.4.2/src/cerebras/modelzoo/models/vision/dino/configs/params_dinov2_large_224_bs1024.yaml)         | Config for pretraining, batch size 1024           |
| [`params_dinov2_large_eval_linear.yaml`](https://github.com/Cerebras/modelzoo/blob/rel-2.4.2/src/cerebras/modelzoo/models/vision/dino/configs/params_dinov2_large_eval_linear.yaml)       | Config for finetuning with downstream evaluation. |
| [`params_dinov2_large_patch14_img224.yaml`](https://github.com/Cerebras/modelzoo/blob/rel-2.4.2/src/cerebras/modelzoo/models/vision/dino/configs/params_dinov2_large_patch14_img224.yaml) | Reference implementation config of DINOv2.        |

## Model Input Tensor Specifications

The tables below outline the expected input tensor formats for pretraining and fine-tuning. These formats are based on the configurations listed in the [Available Configurations](#available-configurations) section above. If you are using a custom configuration, you can check the tensor specifications by running:

```bash theme={null}
cszoo data_processor benchmark <path/to/config> --num_epochs 1 --steps_per_epoch 10
```

<Tabs>
  <Tab title="Pretraining">
    | **Input Name**   | **Shape**                     | **Data Type**   | **Description**                                                             |
    | ---------------- | ----------------------------- | --------------- | --------------------------------------------------------------------------- |
    | `collated_masks` | (batch\_size, 2, 256)         | `torch.bool`    | Boolean mask indicating which patches are masked during training.           |
    | `global_view`    | (batch\_size, 2, 3, 224, 224) | `torch.float32` | Global image views (2 samples per batch, 3-channel images of size 224x224). |
    | `local_view`     | (batch\_size, 8, 3, 98, 98)   | `torch.float32` | Local image views (8 samples per batch, 3-channel images of size 98x98).    |
  </Tab>

  <Tab title="Finetuning">
    | **Input Name** | **Shape**                  | **Data Type**   | **Description**                                                                                                                                                           |
    | -------------- | -------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | `images`       | (batch\_size, 3, 224, 224) | `torch.float32` | Preprocessed images. For training, images are augmented (e.g., random resized crop, horizontal flip) and normalized; for evaluation, they are resized and center-cropped. |
    | `labels`       | (batch\_size,)             | `torch.int32`   | Ground truth labels corresponding to each input image.                                                                                                                    |
  </Tab>
</Tabs>

## About this Implementation

This implementation of DINOv2 uses the `generic_image_encoders` architecture as its backbone. You can find the model architecture details in its [directory](https://github.com/Cerebras/modelzoo/tree/rel-2.4.2/src/cerebras/modelzoo/models/vision/generic_image_encoders).

### Differences Between Our Implementation and Meta's

Unlike Meta’s version, which includes KoLeo loss, this implementation only includes DinoDistillationLoss and iBOTPatchLoss, which was introduced in DINOv2.

Pretrained models from Meta and Hugging Face only include the backbone, meaning they cannot be used for continuous pretraining and are limited to downstream tasks. In contrast, our implementation provides everything needed for continuous pretraining.

## Workflow

In this workflow we'll demonstrate how to get started using DINOv2, inlcuding for pretraining, continuous pretraining, and finetuning tasks.

<Note>
  This workflow utilizes the ModelZoo CLI. For a list of all commands, please visit our [CLI page](https://training-docs.cerebras.ai/model-zoo/cli-overview).
</Note>

<Steps>
  <Step title="Prerequisites and Setup">
    Before getting started, ensure that you've gone through our [setup and installation guide](https://training-docs.cerebras.ai/getting-started/setup-and-installation).

    Next, create a dedicated folder for assets (configs, data) and generated files (processed data files, checkpoints, logs, etc.):

    ```
    mkdir dinov2
    ```

    <Tabs>
      <Tab title="Pretraining">
        Copy the sample model config for pretraining into your folder.&#x20;

        ```bash theme={null}
        cszoo config pull dinov2_large_224_bs1024 -o dinov2
        ```
      </Tab>

      <Tab title="Continuous Pretraining">
        Copy the sample model config for pretraining into your folder.&#x20;

        ```bash theme={null}
        cszoo config pull dinov2_large_224_bs1024 -o dinov2
        ```
      </Tab>

      <Tab title="Finetuning">
        Copy the sample model config for pretraining into your folder.&#x20;

        ```bash theme={null}
        cszoo config pull dinov2_large_eval_linear -o dinov2
        ```
      </Tab>
    </Tabs>
  </Step>

  <Step title="Data Preparation">
    Our implementation of DINOv2 supports all torchvision datasets. In our internal testing, we used ImageNet1K. To get started, set the dataset path to where your torchvision dataset is stored, ensuring it conforms to the torchvision standard. For more information on how to prepare datasets using `torchvision`, please visit our guide [here](https://github.com/Cerebras/monolith/tree/feacbef3c42fc6095c574deac7ed70d6c72e8ad2/src/models/src/cerebras/modelzoo/data/vision/classification/data).

    Once completed, your dataset directory should look as follows:

    ```
     root_directory
     │-- meta.bin
     │-- train/
     │   │-- n01440764
     │   │   │-- n01440764_10026.JPEG
     │   │   │-- ...
     │   │-- n01443537
     │   │   │-- ...
     │   │-- ...
     │   val/
     │   │-- n01440764
     │   │   │-- ILSVRC2012_val_00000946.JPEG
     │   │   │-- ...
     │   │-- n01443537
     │   │   │-- ...
     │   │-- ...
    ```

    <Warning>
      This implementation does not support on-demand downloading, so make sure to download the dataset beforehand.
    </Warning>

    Once your data directory is ready, modify the `root` parameter under `dataset` in the model config to point to the desired dataset location.
  </Step>

  <Step title="Running the Model">
    <Tabs>
      <Tab title="Pretraining">
        Run the pretraining process using the provided configuration.&#x20;

        <Tabs>
          <Tab title="CLI">
            ```bash theme={null}
            cszoo fit dinov2/params_dinov2_large_224_bs1024.yaml \
              --mgmt_namespace=<namespace>
            ```
          </Tab>

          <Tab title="run.py">
            ```bash theme={null}
            python src/cerebras/modelzoo/models/vision/dino/run.py  CSX \
            --mode train \
            --params dinov2/params_dinov2_large_224_bs1024.yaml \
            --mount_dirs <path/to/source> \
            --mgmt_namespace <namespace> \
            --python_paths <path/to/source>
            ```
          </Tab>
        </Tabs>
      </Tab>

      <Tab title="Continuous Pretraining">
        In addition to pretraining from scratch, you can continue training from an existing DINOv2 checkpoint. You can do this with your own pretrained checkpoint, using Meta's checkpoints (after converting to a Cerebras-compatible format) or by using the checkpoint we provide.

        In this workflow we will use our provided pretrained checkpoint. Download it before getting started.

        ```bash theme={null}
        wget -P dinov2 https://cerebras-public.s3.us-west-2.amazonaws.com/DINOv2/DINOv2Pretraining_ViTL_img224.mdl
        ```

        Once the checkpoint has finished downloading, run the model with the new checkpoint.

        <Tabs>
          <Tab title="CLI">
            ```bash theme={null}
            cszoo fit dinov2/params_dinov2_large_224_bs1024.yaml \
              --checkpoint_path dinov2/DINOv2Pretraining_ViTL_img224.mdl \
              --load_checkpoint model \
              --mgmt_namespace=<namespace>
            ```
          </Tab>

          <Tab title="run.py">
            ```bash theme={null}
            python src/cerebras/modelzoo/models/vision/dino/run.py  CSX \
              --checkpoint_path dinov2/DINOv2Pretraining_ViTL_img224.mdl \
              --mode train \
              --params dinov2/params_dinov2_large_224_bs1024.yaml \
              --mount_dirs <path/to/source> \
              --mgmt_namespace <namespace> \
              --python_paths <path/to/source>
            ```
          </Tab>
        </Tabs>
      </Tab>

      <Tab title="Finetuning">
        To begin finetuning, update the `data_dir` parameter in your configuration file. You can find this parameter under `fit` > `train_dataloader` and `val_dataloader`. Set it to the directory containing the data you want to use for fine-tuning.

        You can finetune DINOv2 using your own pretrained checkpoint or by using the checkpoint we provide.&#x20;

        To download our pretrained checkpoint:

        ```
        wget -P dinov2 https://cerebras-public.s3.us-west-2.amazonaws.com/DINOv2/DINOv2Pretraining_ViTL_img224.mdl
        ```

        Next, convert the pre-trained DINOv2 checkpoint into a ViT-compatible classification format. Since DINOv2 is a self-supervised model, it does not include a classification head by default. The conversion process extracts the ViT backbone and attaches the required classification head.

        To perform this conversion, run the `convert_dinov2_to_vit.py` script as follows:

        ```bash theme={null}
        python convert_dinov2_to_vit.py \
          --input_config dinov2/params_dinov2_large_eval_linear_cszoov2.yaml \
          --output_config dinov2/finetuning_params_vit_classification.yaml \
          --dataset_path <path/to/data>
          --input_ckpt dinov2/DINOv2Pretraining_ViTL_img224.mdl
          --output_ckpt dinov2/finetuning_ViTClassification_DINOv2_ViTL_img224.mdl
        ```

        Once the conversion script has finished running, open up the output configuration that was generated `finetuning_params_vit_classification.yaml` and specify the `backend` to be CSX in the `init` section as follows:

        ```
        trainer:
         init:
          backend:
           backend_type: CSX
           cluster_config:
            num_csx: 1
        ```

        <Warning>
          The classification head is randomly initialized after the conversion, so the classification accuracy can be expected to be low at the beginning of the training run.&#x20;
        </Warning>

        ### Running Finetuning

        <Tabs>
          <Tab title="CLI">
            Once the conversion script has finished running, use the output configuration and checkpoint to train the classification head.

            ```bash theme={null}
            cszoo fit dinov2/finetuning_params_vit_classification.yaml \
              --checkpoint_path dinov2/finetuning_ViTClassification_DINOv2_ViTL_img224.mdl \
              --load_checkpoint model \
              --mgmt_namespace=<namespace>
            ```
          </Tab>

          <Tab title="run.py">
            Once the conversion script has finished running, use the output configuration and checkpoint to train the classification head.

            ```bash theme={null}
            python src/cerebras/modelzoo/models/vision/dino/run.py  CSX \
              --checkpoint_path dinov2/finetuning_ViTClassification_DINOv2_ViTL_img224.mdl \
              --mode train \
              --params dinov2/finetuning_params_vit_classification.yaml \
              --mount_dirs <path/to/source> \ #optional
              --mgmt_namespace <namespace> \
              --python_paths <path/to/source> #optional
            ```
          </Tab>

          We also provide a finetuned ViT classification checkpoint that you can download and use:

          ```bash theme={null}
          wget -P dinov2 https://cerebras-public.s3.us-west-2.amazonaws.com/DINOv2/ViTClassification_DINOv2_ViTL_img224.mdl
          ```
        </Tabs>
      </Tab>
    </Tabs>
  </Step>
</Steps>

## Advanced Use Cases

In addition to the workflows outlined above, we provide a number of scripts for more advanced and experimental use cases.

### Adjusting Image Size

<Warning>
  Our DINOv2 implementation has been tested only with an image size of 224. Using other image sizes may lead to unexpected behavior, therefore this should be considered an experimental feature.
</Warning>

You can continue training from an existing DINOv2 checkpoint while adjusting parameters such as image size. For this purposes, we provide the `change_image_size.py` to modify the checkpoint and config.

```bash theme={null}
python change_image_size.py \
  --input_config dinov2/params_dinov2_large_patch14.yaml \
  --input_ckpt <path_to_old_checkpoint> \
  --output_config dinov2/params_dinov2_continuous_pretraining.yaml \
  --output_ckpt dinov2/dinov2_continuous_pretraining_chkpt.mdl \
  --global_size 518 \
  --local_size 224
```

After the script finishes running, it will generate new configuration and checkpoint files. Use these files to start your training run as follows:

```bash theme={null}
cszoo fit dinov2/params_dinov2_continuous_pretraining.yaml --checkpoint_path dinov2/output_config --load_checkpoint model
```

### Configuring Per-Layer Learning Rate Schedulers

As part of our DINOv2 model offering, we provide a [script](https://github.com/Cerebras/monolith/tree/rel-2.4.2/src/models/src/cerebras/modelzoo/models/vision/dino/scripts) for generating a config file that includes learning rate schedulers, following the approach used in Meta's original implementation of the model. Users can modify the learning rate settings to experiment with different schedules, but we recommend adhering to Meta’s specifications for optimal results. For a detailed explanation of Meta's training methods for DINOv2, please refer to the [paper](https://arxiv.org/pdf/2304.07193).

This script should be used with the [reference implementation config](https://github.com/Cerebras/modelzoo/blob/rel-2.4.2/src/cerebras/modelzoo/models/vision/dino/configs/params_dinov2_large_patch14_img224.yaml) that is provided. It will output a configuration similar to [`params_dinov2_large_224_bs1024_cszoov2.yaml`](https://github.com/Cerebras/monolith/blob/rel-2.4.2/src/models/src/cerebras/modelzoo/models/vision/dino/configs/mz/params_dinov2_large_224_bs1024_cszoov2.yaml).

**To run the script:**

<Note>
  To use Meta’s predefined learning rate schedulers without modifications, simply specify only the `input_file_name` and `output_file_name` flags.
</Note>

```
python /src/modelzoo/models/vision/dino/scripts/create_dinov2_config_with_schedulers.py \
  --input_file_name <path/to/your_input_config.yaml> \
  --output_file_name <desired_output_config.yaml> \
  --base_lr <base learning rate (e.g., 0.0005)> \
  --batch_size <batch size for training (e.g., 64)> \
  --total_iters <total number of iterations (e.g., 100000)> \
  --lr_decay_rate <learning rate decay factor (e.g., 0.1)> \
  --patch_emb_multiplier <patch embedding multiplier (e.g., 1.0)>
```

## References

* [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/pdf/2304.07193)

* [Emerging Properties in Self-Supervised Vision Transformers](https://arxiv.org/pdf/2104.14294)
