Skip to main content

Spot The Mask

This project implements the base Vision Transformer (ViT) architecture as proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. The ViT model introduces a transformer-based approach to image classification, departing from traditional convolutional methods. By treating images as sequences of patches, ViT leverages the power of self-attention to achieve state-of-the-art performance in computer vision. This implementation stays true to the original base architecture while allowing users to modify key hyperparameters for experimentation and exploration.