📄️ Introduction
Spot the mask challenge
📄️ Implementations
Lets Jump In
📄️ Inference
Inference
📄️ Results
Evaluation Metric
This project implements the base Vision Transformer (ViT) architecture as proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. The ViT model introduces a transformer-based approach to image classification, departing from traditional convolutional methods. By treating images as sequences of patches, ViT leverages the power of self-attention to achieve state-of-the-art performance in computer vision. This implementation stays true to the original base architecture while allowing users to modify key hyperparameters for experimentation and exploration.
Spot the mask challenge
Lets Jump In
Inference
Evaluation Metric