Video Adapter

Efficient Adaption of Text-to-Video Foundation Models

Probabilistic Adaptation of Text-to-Video Models

1 Google DeepMind 2 UC Berkeley 3 MIT 4 University of Alberta

*indicates equal contribution.

Abstract

Large text-to-video models trained on internet-scale data have demonstrated exceptional capabilities in generating high-fidelity videos from arbitrary textual descriptions. However, adapting these models to tasks with limited domain-specific data, such as animation or robotics videos, poses a significant computational challenge, since finetuning a pretrained large model can be prohibitively expensive. Inspired by how a small modifiable component (e.g., prompts, prefix-tuning) can adapt a large language model to perform new tasks without requiring access to the model weights, we investigate how to adapt a large pretrained text-to-video model to a variety of downstream domains and tasks without finetuning. In answering this question, we propose Video Adapter, which leverages the score function of a large pretrained video diffusion model as a probabilistic prior to guide the generation of a task-specific small video model. Our experiments show that Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model using as few as 1.25% of the pretrained model. Video Adapter is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data.

Video Adapter Framework

Adaptation through Score Composition

Video Adapter only requires training a small domain-specific text-to-video model with orders of magnitude fewer parameters than a large video model pretrained from internet data. During sampling, Video Adapter composes the scores of the pretrained and the domain specific video models, achieving high-quality and flexible video synthesis.

Video Adapter for Animation and Robotics

We can train a small video model on an animation style of a particular artist (Detective Conan). The pretrained prior can maintain the artist's style while changing the background. We can also train task-specific small edge-to-sim and edge-to-real models on robotic videos. The pretrained prior can be used to modify the styles of the videos as a form of domain randomization.

Citation

	@article{
	  yang2023probabilistic,
	  title={Probabilistic Adaptation of Text-to-Video Models},
	  author={Yang, Mengjiao and Du, Yilun and Dai, Bo and
	          Schuurmans, Dale and Tenenbaum, Joshua B and Abbeel, Pieter},
	  journal={arXiv e-prints},
	  pages={arXiv--2306},
	  year={2023}
	  }