ViT
Implementation of Vision Transformers (ViT) for image classification on ImageNet-1K.
Model Description
The Vision Transformer (ViT) architecture applies transformer-based modeling, originally developed for NLP, to sequences of image patches for visual tasks. Instead of using convolutional layers, ViT treats an image as a sequence of non-overlapping patches, embeds them, and feeds them into a standard transformer encoder.
This implementation supports ViT models of various sizes trained on ImageNet-1K and provides flexible configuration options for patch sizes, model depth, and hidden dimensions. The transformer layers operate over patch embeddings with added positional information, enabling strong performance in image classification tasks when pretrained on large datasets.
Code Structure
The code for this model is located in the vit
directory within ModelZoo. Here’s how it’s organized:
configs/
: Contains YAML configuration files for different ViT variants.model.py
: Entry point that initializes and builds the model components used for training and evaluation.ViTModel.py
: Core implementation of the ViT architecture, including patch embedding, transformer encoder blocks, and classification head.ViTClassificationModel.py
: WrapsViTModel
for classification tasks, managing preprocessing, logits generation, and loss computation.
Available Configurations
Configuration | Description |
---|---|
params_vit_base_patch_16_imagenet_1k.yaml | ViT-Base model with 16×16 patch size trained on ImageNet-1K. |
params_vit_huge_patch_16_imagenet_1k.yaml | ViT-Huge model with 16×16 patch size trained on ImageNet-1K. |