An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

cs.CV cs.AI cs.LG cs.CV Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby · Oct 22, 2020
Local to this browser
What it does
This paper introduces Vision Transformer (ViT), which applies a standard Transformer encoder directly to sequences of image patches for image classification. The core insight is that convolutional inductive biases (locality and translation...
Why it matters
The core insight is that convolutional inductive biases (locality and translation equivariance) are unnecessary when models are pre-trained at sufficient scale—specifically on datasets containing 14M to 300M images. When transferred to...
Main concern
The paper presents a compelling demonstration that pure attention architectures can replace CNNs for image classification, provided one has access to massive pre-training datasets. The evidence shows ViT-H/14 achieves 88.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper introduces Vision Transformer (ViT), which applies a standard Transformer encoder directly to sequences of image patches for image classification. The core insight is that convolutional inductive biases (locality and translation equivariance) are unnecessary when models are pre-trained at sufficient scale—specifically on datasets containing 14M to 300M images. When transferred to downstream benchmarks, ViT matches or exceeds state-of-the-art CNNs while requiring substantially less computational resources to pre-train.

Critical review
Verdict
Bottom line

The paper presents a compelling demonstration that pure attention architectures can replace CNNs for image classification, provided one has access to massive pre-training datasets. The evidence shows ViT-H/14 achieves 88.55% on ImageNet, matching Noisy Student while using roughly 5× less compute. However, the approach is fundamentally data-hungry: without pre-training on 14M+ images, ViT significantly underperforms ResNets, as 'Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data' (Section 3.1).

“Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data.”
paper · Section 3.1
What holds up

The scaling analysis is rigorous and the efficiency claims are well-supported. The controlled study (Section 4.4) shows ViT uses 'approximately $2-4\times$ less compute to attain the same performance (average over 5 datasets)' compared to ResNets. The attention visualizations confirm the model compensates for missing inductive biases by learning spatial relationships from data: 'some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model' (Section 4.5). Additionally, the hybrid experiments validate that convolutional local feature processing only benefits smaller models.

“approximately $2-4\times$ less compute to attain the same performance (average over 5 datasets)”
paper · Section 4.4
“some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model”
paper · Section 4.5
Main concerns

The primary limitation is extreme data dependence that makes the method impractical for most practitioners. When trained on ImageNet alone (1.3M images), 'ViT-Large models underperform compared to ViT-Base models, despite (moderate) regularization. Only with JFT-300M, do we see the full benefit of larger models' (Section 4.3). The reliance on JFT-300M—a private in-house dataset with 303M images—severely limits reproducibility. Furthermore, self-supervised pre-training yields only 79.9% accuracy on ImageNet, a 'significant improvement ... but still 4% behind supervised pre-training' (Section 4.6), suggesting ViT's benefits are currently tied to large labeled datasets. The scope is also limited to classification.

“ViT-Large models underperform compared to ViT-Base models, despite (moderate) regularization. Only with JFT-300M, do we see the full benefit of larger models”
paper · Section 4.3
“significant improvement ... but still 4% behind supervised pre-training”
paper · Section 4.6
Evidence and comparison

The comparisons to state-of-the-art are fair but hinge on the dataset size axis. The authors compare against Big Transfer (BiT) and Noisy Student, noting ViT pre-trained on JFT-300M outperforms BiT-L trained on the same data. They appropriately caveat that 'pre-training efficiency may be affected not only by the architecture choice, but also other parameters, such as training schedule, optimizer, weight decay, etc.' (Section 4.2). The scaling study (Section 4.4) is well-controlled, though the Noisy Student comparison mixes supervised and semi-supervised methods.

“pre-training efficiency may be affected not only by the architecture choice, but also other parameters, such as training schedule, optimizer, weight decay, etc.”
paper · Section 4.2
Reproducibility

While code and pre-trained models are released, full reproduction of the best results is blocked by the reliance on JFT-300M, an internal Google dataset. The paper provides detailed hyperparameters—Adam with $\beta_1=0.9$, batch size 4096, weight decay 0.1, and specific learning rates per variant (Appendix B.1)—and training durations (e.g., ViT-L/16 requires 0.68k TPUv3-core-days). However, the computational requirements remain prohibitive for most academic labs. The masked patch prediction self-supervision setup is described but achieves inferior results, leaving a methodology gap for those without massive labeled data.

“All models are trained with a batch size of 4096 and learning rate warmup of 10k steps”
paper · Appendix B.1
Abstract

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.