Don't Look Twice:


Faster Video Transformers with Run-Length Tokenization

1Carnegie Mellon University 2Fujitsu Research *Equal Contribution

Inspired by video compressors, RLT efficiently identifies patches from the input video that are repeated, and prunes them from the input. It then uses a duration encoding to tell the transformer what patches were removed. For all the videos on this page, the lighter colored patches represent patches that are masked out. These semantically correspond to parts of the video that are not moving and are almost exactly repeated over time.

Abstract

We present Run-Length Tokenization (RLT), a simple and efficient approach to speed up video transformers by removing redundant tokens from the input. Existing methods prune tokens progressively, incurring significant overhead and resulting in no speedup during training. Other approaches are content-agnostic: they reduce the number of tokens by a constant factor and thus require tuning for different datasets and videos for optimal performance. Our insight is that we can efficiently identify which patches are redundant before running the model. In contrast, RLT efficiently identifies and removes all tokens that are repeated over time before running the model, replacing them with a single token and a positional encoding to represent its new length. This approach is both content-aware, requiring no tuning for different datasets, and fast, incurring negligible overhead. RLT can increase the throughput of pre-trained transformers without any additional training, increasing throughput by 40% with only 0.1% drop in accuracy on action recognition. It also results in a large speedup in training, reducing the wall-clock time to fine-tune a video transformer by more than 40% while matching the baseline model performance. These benefits extend to video-language tasks, with RLT matching baseline performance on Epic Kitchens-100 multi-instance retrieval while reducing training time and throughput by 30%. On Kinetics-400, Something-Something-v2, and UCF101, RLT is able to reduce the total token count by 30%, and on longer videos or higher FPS settings, can reduce the token count by up to 80%.

Method Overview

RLT Overview

RLT works by first comparing the input frames to find redundant patches. We deem two patches to be redundant if the the mean L1 norm of their pixel difference is less than a input threshold. For every patch in each frame, we compare with the following frame, and remove the patches that are similar enough. Once those are removed, we compute the "run-length", or number of consecutive patches in a spatial location that were removed. This represents the 'duration' of the token, and is expressed to the transformer via the length embedding. We display this in the below plot through a toy example:

RLT Toy Example

In this example, each token consists of two consecutive patches in the same spatial location. Repeated tokens over time are replaced with a single token at the beginning of the sequence, and the duration of the token is encoded via the length embedding. For example, the green tokens on the top row all have a duration of 3. This enables RLT to remove 6 tokens from the input, while compensating for the loss of information. An added benefit of RLT is that it does not require running the model to determine the new length of the input. This enables us to avoid having to pad examples, and can use block-diagonal attention masks to efficiently run attention on large batches of examples with no overhead. This enables us to make the most of the token reduction, while other methods typically show lower theoretical GFLOPS but no wall-clock speedup in pracice.

Sample Visualizations

Below are sample visualizations of RLT on standard action recognition datasets. RLT consistently removes repeated tokens that correspond to regions with no motion, while retaining all tokens that provide new information. Videos with large amounts of motion thus contain more tokens than videos with minimal motion.

Kinetics-400

Something-Something v2