An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
This paper introduces Vision Transformer (ViT), which applies a standard Transformer encoder directly to sequences of image patches for image classification. The core insight is that convolutional inductive biases (locality and translation equivariance) are unnecessary when models are pre-trained at sufficient scale—specifically on datasets containing 14M to 300M images. When transferred to downstream benchmarks, ViT matches or exceeds state-of-the-art CNNs while requiring substantially less computational resources to pre-train.
The paper presents a compelling demonstration that pure attention architectures can replace CNNs for image classification, provided one has access to massive pre-training datasets. The evidence shows ViT-H/14 achieves 88.55% on ImageNet, matching Noisy Student while using roughly 5× less compute. However, the approach is fundamentally data-hungry: without pre-training on 14M+ images, ViT significantly underperforms ResNets, as 'Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data' (Section 3.1).
The scaling analysis is rigorous and the efficiency claims are well-supported. The controlled study (Section 4.4) shows ViT uses 'approximately $2-4\times$ less compute to attain the same performance (average over 5 datasets)' compared to ResNets. The attention visualizations confirm the model compensates for missing inductive biases by learning spatial relationships from data: 'some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model' (Section 4.5). Additionally, the hybrid experiments validate that convolutional local feature processing only benefits smaller models.
The primary limitation is extreme data dependence that makes the method impractical for most practitioners. When trained on ImageNet alone (1.3M images), 'ViT-Large models underperform compared to ViT-Base models, despite (moderate) regularization. Only with JFT-300M, do we see the full benefit of larger models' (Section 4.3). The reliance on JFT-300M—a private in-house dataset with 303M images—severely limits reproducibility. Furthermore, self-supervised pre-training yields only 79.9% accuracy on ImageNet, a 'significant improvement ... but still 4% behind supervised pre-training' (Section 4.6), suggesting ViT's benefits are currently tied to large labeled datasets. The scope is also limited to classification.
The comparisons to state-of-the-art are fair but hinge on the dataset size axis. The authors compare against Big Transfer (BiT) and Noisy Student, noting ViT pre-trained on JFT-300M outperforms BiT-L trained on the same data. They appropriately caveat that 'pre-training efficiency may be affected not only by the architecture choice, but also other parameters, such as training schedule, optimizer, weight decay, etc.' (Section 4.2). The scaling study (Section 4.4) is well-controlled, though the Noisy Student comparison mixes supervised and semi-supervised methods.
While code and pre-trained models are released, full reproduction of the best results is blocked by the reliance on JFT-300M, an internal Google dataset. The paper provides detailed hyperparameters—Adam with $\beta_1=0.9$, batch size 4096, weight decay 0.1, and specific learning rates per variant (Appendix B.1)—and training durations (e.g., ViT-L/16 requires 0.68k TPUv3-core-days). However, the computational requirements remain prohibitive for most academic labs. The masked patch prediction self-supervision setup is described but achieves inferior results, leaving a methodology gap for those without massive labeled data.
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.