Zero-Shot Video Question Answering with Procedural Programs

Robotics Institute, Carnegie Mellon University

ProViQ generates a program from an input query, then executes it to find the answer to the question in the video.

Abstract

We propose to answer zero-shot questions about videos by generating short procedural programs that derive a final answer from solving a sequence of visual subtasks. We present Procedural Video Query (ProViQ), which uses a large language model to generate such programs from an input question and an API of visual modules in the prompt, then executes them to obtain the output. Recent similar procedural approaches have proven successful for image question answering, but videos remain challenging: we provide ProViQ with modules intended for video understanding, allowing it to generalize to a wide variety of videos. This code generation framework additionally enables ProViQ to perform other video tasks in addition to question answering, such as multi-object tracking or basic video editing. ProViQ achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, and multimodal video question-answering datasets.


Sample Results

We present some sample video results of ProViQ on a variety of video question answering benchmarks. Click the thumbnails to play the videos.

Sample results from the IVQA dataset.

Sample results from the MSR-VTT QA dataset.

Sample results from the MSVD-QA dataset.

Sample results from the ActivityNet-QA dataset.

Sample results from the TGIF-QA dataset.

Sample results from the TVQA dataset.


Long Video Summarization

An overview of our summarization module. One advantage of our approach is that we can design modules like this for specific tasks: in this case, we use video captioning models and LLMs to summarize long videos, which leads to a large improvement on the Egoschema question-answering benchmark.


Related Work

Our project was inspired by several other related papers, which we highly encourage reading as well.

BibTeX


        @inproceedings{choudhury2023zero,
          title={Zero-Shot Video Question Answering with Procedural Programs},
          author={Choudhury, Rohan and Niinuma, Koichiro and Kitani, Kris M. and Jeni, Laszlo A.},
          journal={arXiv preprint arXiv:2312.00937},
          year={2023}
        }
      }