Implementation
Implementation of Vision Transformer Paper (ViT)
Implementation of Vision Transformer Paper (ViT)
"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" paper, introduces a groundbreaking approach to image classification by leveraging the power of transformer architectures. Authored by Alexey Dosovitskiy et al., the paper challenges the conventional convolutional neural network (CNN) paradigm that has long dominated the field of computer vision. ViT extends the success of transformers from natural language processing to image analysis, demonstrating remarkable performance on various image recognition tasks. This introduction will provide an overview of the key concepts and contributions of the Vision Transformer paper, setting the stage for a deeper exploration of its innovative methodology and experimental results. This implementation will focus on the base architecure of ViT paper but will enable adjustment on the hyperparameters to the architecture.
- Data Preparations
- Architecture Implementation
- Training
- Visualizing Results
This implementation will focus on the base architecture with hidden size of 768 and 12 layers of the transformer encoder.

1. Data Preparation
In the Vision Transformer (ViT) paper, the authors propose a novel approach to handling image data by transforming it into a sequence of fixed-size patches before feeding it into a transformer model. This is a departure from traditional convolutional neural networks (CNNs), where the input images are processed using convolutional layers. This is the first step that should be considered before implementing the architecture.
The specific image transformation steps performed before training the Vision Transformer include:
-
Patch Extraction: The input image is divided into non-overlapping patches. Each patch is then treated as a token, forming a sequence of image patches.
-
Linear Projection: Each patch is linearly projected into a high-dimensional space, typically referred to as the embedding space. This projection allows the model to capture meaningful representations of image content.
-
Positional Encoding: To provide the transformer model with information about the spatial relationships between the patches, positional encodings are added. This helps the model understand the sequential order of the patches in the input sequence.
By converting the image into a sequence of patches and using a transformer architecture, the Vision Transformer effectively captures long-range dependencies and relationships within the image. This innovative approach has shown impressive results in image classification tasks, challenging the conventional wisdom that CNNs are the sole architecture suitable for computer vision applications.

After import the dataset and transform into tensor the main part of the data preparation is the patch extraction then linear projection then add a class token and position encoding of the patches. The following class transform and image tensor to the input shape that the transformer is needs as an input.
This is the first (1) equation that is shown on the paper. This is the most important part of this architecture am not saying that other other three equation are not but

2. Architecture Implementation
This part will be divided into two main:

2.1 Multi-self attention block
In the Vision Transformer (ViT) paper, the concept of a "Multi-head Self Attention Block" is a fundamental building block within the transformer architecture used for image recognition tasks. The ViT model employs multiple such attention blocks to capture rich contextual information from input sequences, in this case, sequences of image patches. Let's break down the key components of the Multi-head Self Attention Block based on the ViT paper:
-
Self Attention Mechanism:
- The core of the attention block is the self-attention mechanism, which allows the model to weigh different elements of the input sequence differently based on their relevance to each other.
- Self attention calculates attention scores for each element in the sequence relative to every other element, enabling the model to focus on different parts of the input sequence during processing.
-
Multi-head Attention:
- The ViT paper introduces the concept of using multiple attention heads within a single attention block. Each attention head learns a different set of attention weights, capturing diverse relationships in the input.
- The outputs from these multiple attention heads are concatenated and linearly projected to create a final set of representations.
-
Parameterized Linear Projections:
- The attention block includes linear projections to transform the input sequence into query, key, and value representations. These projections are learned during the training process.
- Multiple sets of projections are employed for each attention head, allowing the model to capture various aspects of the input information.
-
Normalization and Feedforward Layer:
- Normalization layers, such as layer normalization, are applied to the concatenated outputs of the attention heads.
- A feedforward neural network is then applied to further process the information, introducing non-linearity and capturing complex patterns.
2.2 Multi Layer Perceptron
In the Vision Transformer (ViT) paper, the "MLP (Multi-Layer Perceptron) Block" is another crucial component of the transformer architecture. The MLP Block is employed after the multi-head self-attention mechanism in each transformer layer, contributing to the model's ability to capture and process hierarchical features from the input sequence of image patches. Let's delve into the key characteristics of the MLP Block based on the ViT paper:
-
Position-wise Feedforward Networks:
- The MLP Block consists of one or more position-wise feedforward networks. These networks operate independently on each position in the sequence, allowing the model to capture local patterns and non-linear relationships within the data.
- The position-wise feedforward networks include fully connected layers with a non-linear activation function (commonly ReLU) in between but in this architure according to the paper the MLP contains two layers with a GELU non-linearity (section 3.1).
-
Parameterized Linear Projections:
- Similar to the multi-head self-attention block, the MLP Block includes linear projections to transform the input features. These projections are learned during the training process.
- The linear projections are typically followed by activation functions, introducing non-linearity to the model.
-
Normalization:
- Normalization layers, such as layer normalization, are applied to the output of the MLP Block. This helps stabilize the training process and improve the model's generalization.
-
Skip Connection and Residual Connection:
- To facilitate the flow of information through the model, skip connections (also known as residual connections) are commonly employed. These connections allow the output of the MLP Block to be added to the input, aiding in the gradient flow during training.
Residual Connection - connects the output of one earlier convolutional layer to the input of another future convolutional layer several layers later
In summary, the Multi-head Self Attention Block in the ViT paper incorporates multiple attention heads to enable the model to attend to different aspects of the input image patches simultaneously. This parallel processing enhances the model's ability to capture diverse and hierarchical features in the data, contributing to its success in image recognition tasks. The role of the MLP Block is to capture and process local features within the input sequence. While the multi-head self-attention mechanism focuses on capturing global contextual information, the MLP Block helps the model capture and leverage detailed local patterns. The combination of these components contributes to the success of the Vision Transformer in image recognition tasks.
Transformer Encoder Block
So the transformer encoder consider of Multi-self Attetion (MSA) block and Multi-Layer Perceptron (MlP) block together as on block.
Vision Transformer
After the first three parts of the architucture is time to put everything together but ending with the classifier function that translate the image to its label. The following
class VisionTransformer is everything put together.
3. Training Vision Transformer
After implementing the vision architecture it's time to train the model and see the result of the architecture. For example as you are dealing with few dataset the test accuracy will be low as the vision architecture performs while on large datasets.

Improvements can be done on the test accuracy by using transfer learning on the same architecture trained on larger datasets. This will help to improve the test accuracy of the models. In the following we will discuss the ways to improve the model.