Gallery

Reference

Line art

Result

If you want to change the color of the girls' dresses, LongAnimation can generate a video of the girls wearing different colored dresses by changing the dress colors in the reference image. This indicates that our method can generate long-term color-consistent videos with a high degree of freedom by freely changing the colors of the reference image.

Motivation

Existing studies achieve local color consistency by fusing the overlaps of adjacent video segments, suffering low long-term color consistency. (b) Our dynamic global-local paradigm dynamically extracts color features of global historical segments as global memory and the color features of the latest generated segment as local memory, achieving high long-term color consistency. All segments are generated from one same reference image.

Abstract

Animation colorization is a crucial part of real animation industry production. Long animation colorization has high labor costs. Therefore, automated long animation colorization based on the video generation model has significant research value. Existing studies are limited to short-term colorization. These studies adopt a local paradigm, fusing overlapping features to achieve smooth transitions between local segments. However, the local paradigm neglects global information, failing to maintain long-term color consistency. In this study, we argue that ideal long-term color consistency can be achieved through a dynamic global-local paradigm, i.e., dynamically extracting global color consistent features relevant to the current generation. Specifically, we propose LongAnimation, a novel framework, which mainly includes a SketchDiT, a Dynamic Global-Local Memory (DGLM), and a Color Consistency Reward. The SketchDiT captures hybrid reference features to support the DGLM module. The DGLM module employs a long video understanding model to dynamically compress global historical features and adaptively fuse them with the current generation features. To refine the color consistency, we introduce a Color Consistency Reward. During inference, we propose a color consistency fusion to smooth the video segment transition. Extensive experiments on both short-term (14 frames) and long-term (average 500 frames) animations show the effectiveness of LongAnimation in maintaining short-term and long-term color consistency for open-domain animation colorization task.

Qualitative Comparison

Comparisons with previous methods

Reference Image

As shown in the following video, our method has better long-term color consistency compared to previous methods. LVCD* denotes LVCD guided by grayscale sketches, while the other methods all use binarized sketches.

In the video below, our method maintains better long-term color consistency for the girl's dress and the leaves, while other methods fail to achieve this.

In the video below, our method maintains better long-term color consistency.

Creative usage

Text-guided Background Generation

Given a segmented foreground reference image and line sketches, longAnimation can generate long-term dynamic background for the foreground based on the provided prompt. Previous methods could not achieve this function.

A boy and a girl are sitting on a sandy beach .

A boy and a girl are sitting in the forest .

A boy and a girl are sitting in the park .

Ablation Experiment

To demonstrate the effectiveness of essential components, we conduct extensive ablation experiments.

First, we evaluate the effectiveness of each module for LongAnimation. Compared to using only SketchDiT, our Dynamic Global-Local Memory (DGLM) mechanism significantly enhances the long-term color consistency of the animation (e.g., the constantly changing hair color of a little girl), while the color consistency reward (CCR) further refines certain details (e.g., the girl's hairband).

Next, we evaluate the effectiveness of Local Color Fusion (LCF) for LongAnimation . As shown in the following video, without using Local Color Fusion (LCF), there are some rapid changes between stitched frames. Applying LCF from the early denoising process (t_st=50) results in abnormal brightness changes. Implementing LCF from the later denoising process (t_st=20) achieves smoother transitions.

LongAnimation: Long Animation Generation with Dynamic Global-Local Memory