SkipSR

SkipSR: Faster Super-Resolution with Token Skipping

Rohan Choudhury^1,2* Shanchuan Lin² Jianyi Wang² Hao Chen² Qi Zhao²
Feng Cheng² Lu Jiang² Kris M. Kitani¹ László A. Jeni¹

¹ Carnegie Mellon University ² ByteDance Seed

* Work done during an internship at ByteDance Seed.

Browser Note: If videos don't load properly in Chrome, please try Safari for the best viewing experience.

Abstract

Diffusion-based super-resolution (SR) is a key component in video generation and video restoration, but is slow and expensive, limiting scalability to higher resolutions and longer videos. Our key insight is that many regions in video are inherently low-detail and gain little from refinement, yet current methods process all pixels uniformly. To take advantage of this, we propose SkipSR, a simple framework for accelerating video SR by identifying low-detail regions directly from low-resolution input, then skipping computation on them entirely, only super-resolving the areas that require refinement. This simple yet effective strategy preserves perceptual quality in both standard and one-step diffusion SR models while significantly reducing computation. In standard SR benchmarks, our method achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality.

Method

In this work, we focus on accelerating diffusion-based super-resolution, which is particularly expensive due to the larger size of the input and output, but crucial to video generation pipelines. Most prior works speed up diffusion transformers by reducing the number of steps or modify the attention mechanism. W We instead find that we can skip computation altogether for certain `simple' patches, accelerating computation significantly. We identify these patches with a lightweight CNN that predicts a binary mask, which is then used to route these patches entirely around the transformer. The un-skipped patches that are refined retain their relative positions through a mask-aware rotary positional encoding. Extensive experiments and ablations demonstrate that our method preserves visual quality while significantly accelerating end-to-end generation time. Our method is illustrated in the figure below, and several examples, along with the predicted mask, are visualized at the bottom of the page.

Detailed Video Comparisons

SkipSR

Ours (1.6x faster)

Skipped Patches