Accelerating Vision Transformers with Adaptive Patch Sizes | blank

Accelerating Vision Transformers With Adaptive Patch Sizes

1Carnegie Mellon University, 2KAIST, 3General Robotics
ICLR 2026 Submission

*Indicates Equal Contribution

Aliquam vitae elit ullamcorper tellus egestas pellentesque. Ut lacus tellus, maximus vel lectus at, placerat pretium mi. Maecenas dignissim tincidunt vestibulum. Sed consequat hendrerit nisl ut maximus.

Abstract

We propose Adaptive Patch Transformers (APT), a method to accelerate vision transformers (ViTs) by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by using larger patch sizes in more homogeneous image regions, and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, reducing throguhput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance. It can be applied to a previously fine-tuned ViT and converges in as little as 1 epoch, enabling training on high-resolution images with minimal compute budgets. It also significantly reduces training and inference time with no performance degradation on high-resolution dense visual tasks, achieving up to 30% faster training and inference on visual QA, object detection and semantic segmentation. We will release all code and trained models.

Method Overview

Method overview

An overview of our method.

Zero conv

An overview of zero conv.

Examples

Results

Table 1

Full Fine-Tuning on ImageNet. APT significantly reduces the wall-clock time to fine-tune a pre-trained backbone on ImageNet with no degradation in accuracy. We use the MAE training recipe for all cases. Note that ViT-B is trained for 2× more epochs than ViT-L.

Table 2

1-epoch Fine-Tuning on ImageNet. APT consistently achieves large speedups while matching or sometimes exceeding the original network’s performance after fine-tuning for 1 more epoch. Compared to only random masking or only resizing, APT offers the best tradeoff between speed and accuracy.

Figure 4

Accuracy vs. Throughput under different compute budgets. Comparison between APT and layer-level merging methods on ViT-L and ViT-H. For a fairer evaluation, we also include their re-implemented Advanced (Adv) versions with FlashAttention, shown with a dashed line. APT consistently outperforms the baselines in both throughput and accuracy across all compute budgets.

Table 3

Transfer to VQA. APT enables significant throughput increase while matching or exceeding performance to the baseline.

Table 4,5

Transfer to Object Detection and Semantic Segmentation. APT can be scaled to high-resolution dense image tasks supporting window attention. Also, APT can handle pixel-level fine-grained tasks without compromising visual acuity.

BibTeX

@inproceedings{choudhury2026apt,
  title={Accelerating Vision Transformers With Adaptive Patch Sizes},
  author={Rohan Choudhury and Jung Eun Kim and Jinhyung Park and Eunho Yang and L{\'a}szl{\'o} A. Jeni and Kris M. Kitani},
  booktitle={International Conference on Learning Representations},
  year={2026},
  url={/apt/}
}