Accelerating Vision Transformers with Adaptive Patch Sizes | blank

Accelerating Vision Transformers With Adaptive Patch Sizes

1Carnegie Mellon University, 2KAIST, 3General Robotics
*Indicates Equal Contribution
Standard Patches

Standard ViT patchification compared to APT (ours) run frame-by-frame on a video. We dynamically allocate different patch sizes, significantly reducing the number of input tokens while incurring no loss on downstream tasks.

Abstract

We propose Adaptive Patch Transformers (APT), a method to accelerate vision transformers (ViTs) by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by using larger patch sizes in more homogeneous image regions, and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance. It can be applied to a previously fine-tuned ViT and converges in as little as 1 epoch, enabling training on high-resolution images with minimal compute budgets. It also significantly reduces training and inference time with no performance degradation on high-resolution dense visual tasks, achieving up to 30% faster training and inference on visual QA, object detection and semantic segmentation. We will release all code and trained models.

Method Overview

Interactive Demo

APT demo image
Size: Base Tokens: APT Tokens:

Options

512 px
For S scales, provide S−1 thresholds: largest→smaller.
5.0

Object Detection Examples (EVA02-L-1536)

Semantic Segmentation Examples (EVA02-L-640)

BibTeX

@article{choudhury2025apt,
  title={Accelerating Vision Transformers with Adaptive Patch Sizes},
  author={Choudhury, Rohan and Kim, JungEun and Park, Jinhyung and Yang, Eunho and Jeni, L{\'a}szl{\'o} A. and Kitani, Kris M.},
  journal={arXiv preprint arXiv:2510.18091},
  year={2025},
  url={https://arxiv.org/abs/2510.18091},
  doi={10.48550/arXiv.2510.18091}
}