Weijia Wu1,Mingyu Liu2,Zeyu Zhu1,Xi Xia1,Haoen Feng1,Wen Wang2,Kevin Qinghong Lin1, 1Show Lab, National University of Singapore
Chunhua Shen2,Mike Zheng Shou1†
2Zhejiang University
Abstract
Recent advancements in video generation models, like Stable Video Diffusion, show promising results, but primarily focus on short, single-scene videos.These models struggle with generating long videos that involve multiple scenes, coherent narratives, and consistent characters.Furthermore, there is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models.In this paper, we present MovieBench: A Hierarchical Movie-Level Dataset for Long Video Generation,which addresses these challenges by providing unique contributions:(1) movie-length videos featuring rich, coherent storylines and multi-scene narratives,(2) consistency of character appearance and audio across scenes,and (3) hierarchical data structure contains high-level movie information and detailed shot-level descriptions.Experiments demonstrate that MovieBench brings some new insights and challenges, such as maintaining character ID consistency across multiple scenes for various characters.The dataset will be public and continuously maintained, aiming to advance the field of long video generation.Data can be found at:MovieBench.
1 Introduction
Video generation has seen rapid advancements in recent years, driven by improvements in generative models[6, 4, 16, 39], data scale[26, 54, 37] and computational power.Early successes in this domain were primarily based on diffusion process, such as Stable Video Diffusion[4], Video LDM[5] and I2vgen-xl[67], have demonstrated impressive results in generating high-quality short videos.Recently, spatial-temporal transformer models, exemplified by Sora[6], haveshown stronger performance by capturing both spatial and temporal dependencies within video sequences.However, these approaches have mostly been applied to short videos (text-video paradigms), typically limited to single scenes without intricate storylines or character development, as shown in Figure1 (a).
Despite recent advancements, generating long videos that maintain character consistency, cover multiple scenes, and follow a rich narrative remains an unsolved problem.Existing models fail to address the challenges of long video generation, including the need for maintaining character identity and ensuring logical progression through multiple scenes.Moreover, a major bottleneck lies in the limitations of current benchmarks, which still focus on the analysis, training, and evaluation of short videos.Datasets like WebVid-10M[3], Panda-70M[12], and HD-VILA-100M[62] primarily consist of short video clips ranging from to seconds.While recent efforts like MiraBench[26] have begun exploring longer video generation, the majority of the videos provided are still under one and a half minutes in length.More importantly, these benchmarks lack the crucial annotations required for long video generation, such as character ID information and the contextual relationships between video clips.This absence of character consistency tracking and logical progression across multiple scenes further restricts the development of models capable of handling the complexities of long-form narrative generation.As a result, the field is hindered by the absence of appropriate datasets tailored for movie/long video generation task.
To address these gaps, we introduce the MovieBench dataset, specifically designed for movie-level long video generation (script-movie generation paradigms), as shown in Figure1 (b).MovieBench provides three hierarchical levels of annotations: movie-level, scene-level, and shot-level.At the movie level, the annotations focus on high-level narrative structures, such as script synopsis and a comprehensive character bank.The character bank includes the name, portrait images, and audio samples of each character,which can support tasks like custom audio generation, ensuring character appearance and audio consistency across multiple scenes.Scene-level annotations encapsulate the progression of the story by detailing all the shots and events within a particular scene.Scene categories help maintain consistency in background, foreground, style, and character outfits across multi-view videos within the same scene.Finally, shot-level annotations capture specific moments, typically focusing on short sequences like close-ups or camera movements, usually lasting less than seconds.Shot-level annotations typically include the characters present, plot, camera motion, background descriptions, and time-aligned subtitles and audio information for each video segment.These annotations emphasize specific details of generated video, including the characters involved, their locations, dialogues, and actions, ensuring accurate alignment of this information.MovieBench consists of movies, with an average movie duration of minutes.MovieBench seeks to advance research in long video generation, illing a major gap in current benchmarks.
To summarize, the contributions of this paper are:
- •
We introduce MovieBench, the first benchmark designed for movie-level long video generation.It establishes a new paradigm for creating coherent narratives and enabling multi-scene progression.
- •
MovieBench provides three annotation levels: movie-level for scripts and character banks, scene-level for shot sequences and narratives, and shot-level for details like close-ups, plot, and camera movements.
- •
MovieBench provides character consistency and coherent, character-driven narrative development.Character banks include movie-level profiles with headshots, names, and audio samples, along with shot-level character sets, providing global and local consistency.
- •
Experiments demonstrate that MovieBench brings some new insights and challenges, such as maintaining character ID consistency, multi-view character ID coherence, and synchronized video generation with audio.
Dataset | Subtitle | Character | Avg Text Len / Avg Video Len | Total video len | Text | Res. | |||
Portrait | Audio | Movie-Level | Scene-Level | Shot-Level | |||||
HD-VG-130M[51] | ✗ | ✗ | ✗ | - | - | 9.6w / 5.1s | 184Khr | Generated | 720p |
WebVid-10M[3] | ✗ | ✗ | ✗ | - | - | 12.0w / 18.0s | 52Kh | Alt-Text | 360p |
YouCook2[73] | ✗ | ✗ | - | - | 8.8w / 19.6s | 176h | Manual | - | |
MSR-VTT[61] | ✗ | ✗ | ✗ | - | - | 9.3w / 15.0s | 40h | Manual | 240p |
VATEX[53] | ✗ | ✗ | ✗ | - | - | 15.2w / 10.0s | 115hr | Manual | - |
Panda-70M[12] | ✗ | ✗ | ✗ | - | - | 13.2w / 8.5s | 167Khr | Generated | 720p |
HD-VILA-100M[62] | ✗ | ✗ | ✗ | - | - | 17.6w / 11.7s | 760.3Khr | ASR | 720p |
InternVid[54] | ✗ | ✗ | ✗ | - | - | 32.5w / 13.4s | 371.5Khr | Generated | 720p |
MiraData[26] | ✗ | ✗ | ✗ | - | - | 318.0w / 72.1s | 16Khr | Generated | 1080p |
MovieBench(Ours) | 43.4Kw / 45.6m | 263.6w / 15.4s | 66.3w / 4.09s | 69.2hr | Generated | 1080p |
2 Related Works
2.1 Video-Text Datasets
Numerous video-text datasets have been developed, initially focusing primarily on video understanding[18, 23, 73, 53, 58], where research progress has been ahead of the video generation[26, 12].MSR-VTT[61], TextVR[60], ActivityNet[7],BOVText[56], How2[45], and VALUE[29] paved the way for advancing tasks like video retrieval, captioning, video text spotting, and video question answering.These datasets, while extensive in their scope and impact, are typically framed around short-form video content with an emphasis on understanding rather than generation.
More recent works, such as MAD[48], AutoAD[18], AutoAD II[19], and AutoAD III[20], focus on movie-level understanding and descriptions, capture the complexity of long-form video content.These datasets enhance the understanding of movies by offering comprehensive narrative, character, and scene descriptions.However, while they excel in video retrieval and description tasks, their focus remains largely on understanding rather than generation, where significant challenges still exist, for example, how to generate multiple scenes, coherent narratives, and consistent characters.
2.2 Video Generation
The field of video generation has experienced rapid advancements, both in terms of models[21, 22, 28, 59, 70] and datasets[12, 3], particularly in the generation of short video clips from textual descriptions.On the model front, numerous outstanding works have emerged, such as diffusion-based models like SVD[4], VDM[21], and SORA[6], as well as autoregressive models like VideoGPT[63], CogVideo[22], and VideoPoet[27].On the data front, notable datasets like WebVid-10M[3], Panda-70M[12], HD-VILA-100M[36], and InternVid[54] have been instrumental in establishing large-scale video-text datasets.WebVid-10M[3] has provided a substantial contribution to video generation tasks by offering a rich dataset of video-text pairs, enabling models to generate short, descriptive video clips.Similarly, Panda-70M[12] and HD-VILA-100M[36] expand on these efforts by incorporating diverse datasets, high-quality videos and more complex textual descriptions for generating visually and semantically rich video segments.MiraData[26], meanwhile, focuses on enabling fine-grained video understanding and generation, incorporating dense annotations for improving both accuracy and diversity in generated clips.
However, these works are primarily designed for generating short videos, which do not meet the requirements for movie-level generation.Movie-level generation is more complex, requiring the creation of longer video sequences while maintaining a coherent storyline, character consistency, and audio continuity.Our MovieBench try to address these challenges by introducing a hierarchical dataset specifically designed for movie-level generation.It provide a framework for generating complex storylines, character arcs, and consistent audio-visual elements.
3 MovieBench Dataset
Dataset | Aesthetic Score | Inception Score | |
InternVid[54] | 9.09 | 11.68 | |
MiraData[26] | 9.70 | 6.27 | |
MovieBench | 20.67 | 12.34 |
3.1 Data Collection
For movie source of MovieBench, we utilized movies from LSMDC[43], which includes notable films such as ‘Harry Potter and the Prisoner of Azkaban’.Using movies from LSMDC provides two key advantages:1) Pre-existing Manual Annotations:MAD[48] provides manually annotated movie audio descriptions for the movie clip of LSMDC.These annotations can serve as valuable references to further enhance the accuracy of the generated annotations, as shown in Figure3.Note: Movie audio descriptions cannot be directly used as plot annotations due to lack of character consistency, narrative coherence.2) Open-Source Video Data:The video data in LSMDC is publicly accessible, which allows us to avoid copyright risks, ensuring the long-term availability.Due to limited human annotation resources, we currently collected a total of movies, with movies as the training set, and as the test set.Table2 demonstrates that MovieBench offers advantages in both video quality and aesthetics.
3.2 Movie Level Elements
3.2.1 Script Synopsis
The Script Synopsis plays a crucial role in offering a quick understanding of the high-level narrative structure of a movie.For each movie in MovieBench, we collect the corresponding Script Synopsis from publicly available source,i.e., IMDb111https://www.imdb.com/.On average, each synopsis contains approximately words, capturing the core elements of the film while offering sufficient detail to guide video generation tasks.Script synopses can be used to generate scene and shot-level annotations with LLMs(e.g., GPT4-o), enhancing the efficiency of script-to-movie generation.
3.2.2 Character Bank
Movie generation needs to maintain character consistency throughout the entire movie.For the same character, when generating different scenes and shots, we usually require their face id to remain unchanged.Therefore, we are attempting to build a character bank.We scraped the cast list from IMDB for each movie to verify the characters and their corresponding actors in the character bank.Given a long-form movie , the character bank for this movie can be written as ,where denotes the number of characters, and are the -th character and actor names in the movie, respectively. and denotes the portrait images and audio samples of the character in the movie.
Portrait Images of Characters. With the character name, and stills of each character from the IMDB,we used a object detector(i.e., GroundingDINO[34]) to detect all individuals in each still image and isolate the characters, ensuring that each picture contains only a single person.For each character, we needed to select the corresponding still images and remove those not depicting the intended character.Therefore, two annotators with a background in computer vision are invited to filter out images that were either incorrect or of low quality.After the stills selection process, we also invited a professional quality inspector to conduct a quality check on the selected photos.Any non-compliant photos will be returned for re-annotation.Figure4 shows the frequency of character appearances.
After completing the data cleaning process, a quality inspector was invited to conduct a quality check. randomly sampled movies were evaluated on two key aspects: portrait quality and portrait-name relevance.The inspector rated the quality of each portrait on a scale of to , with as the highest.The average portrait quality score was , and portrait-name relevance scored .This demonstrates the high quality of the collected character bank.Detailed results can be found in the supplementary materials.
Audio Samples of Characters.To collect audio samples for each character, we developed a structured process:1) Face Detection and Recognition: a face detector[66] and recogniter[47] are used to detect and recognize all faces by matching with in each frame.2) Speaker Detection and Duration Identification: next, a speaker detector[31] was employed to determine which character is speaking, along with the duration.3) Audio Extraction: based on the identified durations, we extracted audio segments corresponding to each character.4) Quality Verification: finally, a annotator reviewed audios of each character to confirm that it indeed belongs to the intended character.Any mismatched audio segments were removed.Ultimately, for each character, a noise-free audio sample was collected.The annotation can be used in tasks such as audio customization and audio-driven video generation.
3.3 Scene Breakdown
The LSMDC movies are pre-segmented at the shot level, enabling us to classify each shot for scene breakdown.Given shot-level video clips , VLM(e.g., GPT-4-o) is used to obtain the corresponding scene breakdown results .Since adjacent video clips are likely to be the same scene, we began by classifying the first video clip in sequence.Then, we progressively classified the subsequent clips in chronological order.For non-initial clips, the previous clip and its scene label as additional input, allowing the model to determine if the current clip belongs to the same scene.If the current clip was not classified under the same scene, a new scene label was generated accordingly.For -th video clip , the scene breakdown results can be formulated as:
(1) |
This iterative approach accurately identifies scene boundaries while ensuring consistency across related shots.Figure5 shows the distribution of scene counts across different movies, ranging from to scenes.
Description | Model | Performance(Score 1-5) | |
Completeness | Hallucination | ||
Plot | MiniCPMV | 4.12 | 3.14 |
Gemini-1.5-pro | 4.58 | 1.76 | |
GPT4-o | 4.78 | 1.74 | |
Background | MiniCPMV | 4.36 | 2.52 |
Gemini-1.5-pro | 4.64 | 1.17 | |
GPT4-o | 4.81 | 1.24 | |
Camera | MiniCPMV | 4.32 | 2.48 |
Gemini-1.5-pro | 4.64 | 1.58 | |
GPT4-o | 4.88 | 1.17 | |
Style | MiniCPMV | 4.60 | 1.61 |
Gemini-1.5-pro | 4.88 | 1.11 | |
GPT4-o | 4.95 | 1.17 |
3.4 Shot-Level Temporal Annotations
3.4.1 Annotation Generation
Based on the movie-related works[23, 1], when aiming to generate a shot-level video, certain annotations seems indispensable:1) Appearing Characters;2) Plot;3) Scene/background;4) Shooting Style;5) Camera Motion.Some previous works have validated that vision-language models (VLMs)[11, 57, 26] can generate accurate descriptive annotations.Inspired by MovieSeq[32], VLMs (e.g., GPT-4-o) was used to generate relevant annotations.As shown in Figure3,to enable the VLM to better understand and summarize the video clip, four types of information are provided as the interleaved multimodal input:1) Video frame includes visual information;2) Character Bank includes images and names of characters (Sec.3.2.2).3) Subtitles.4) Movie Audio Description contains manually crafted visual descriptions.Using the interleaved multimodal sequence described above, we generate a corresponding interleaved instruction and input it into VLM.The detailed prompt strategy and interleaved sequence can be found in the supplementary materials.
3.4.2 Quality Evaluation and Correction
The generated annotations(e.g., plot description) are not always accurate and may suffer from hallucinations[2] or incomplete descriptions.Accordingly, we randomly sampled shot-level video clips and the annotations for quality evaluation.As shown in Table3, we evaluated the quality of description-based annotations in terms of Completeness and Hallucination.For appearing characters, we evaluate the performance using the character id consistency, i.e., (details see §3.4), as shown in Table4.
To further improve the accuracy of the generated annotations, we also enlisted two annotators with backgrounds in computer vision to refine the annotations for the test set.The details for correction and refinement guidelines can be found in the supplementary materials.Due to limited annotator resources and the relatively high accuracy of the GPT-4-o generated annotations, we only performed manual corrections on the test set ( movies).
Model | /% | ||
MiniCPMV | 23 | 66 | 34 |
Gemini-1.5-pro | 90 | 76 | 82 |
GPT4-o | 90 | 97 | 93 |
3.4.3 Audio and Subtitles
To obtain character-specific subtitles and audio in alignment with video timing, a approach is implemented as follows:1) Speaker Diarization and Segmentation: a speaker diarization tool[38] is utilized to divide the continuous audio into distinct audio segments, each representing a independent speaker.2) Matching via Audio Embeddings: For each segmented audio clip, the audio embedding was extracted with Pyannote[38].Then cosine similarity matching is used to identify the character, by comparing the embedding with embeddings from the Character Audio Bank .3) Subtitle Generation: Once character-specific audio segments were identified, Whisper[41] can be used to transcribe the speech into subtitles.
3.5 Metrics
The common metrics include: CLIP Score[72] measures the correlation between the generated video and the text description;Aesthetic Score, which quantifies the overall visual appeal;Frechet Image Distance (FID)[46], assesses how closely the distribution of frames resembles that of real frames;and Inception Score, which measures how distinguishable the generated videos are in terms of content.
However, metrics like CLIP Score, can not assess the coherent storyline or character consistency.Therefore, we introduce three new metrics: , , and .The metrics evaluate the consistency of character in the generated videos by comparing them to the character list of script.Specifically, we first use DeepFace[47] to detect and recognize characters in the generated video, obtaining a generated character set , where and refer to the total number and -th video shot.We then compute precision, recall, and F-score by comparing with the ground truth character set , providing a more precise evaluation of character consistency.With the sets and , we can calculate the following metrics:the False Positives (FP) as ,the False Negatives (FN) as ,and the True Positives (TP) as .Finally, , and can be calculated.
4 Experiments
4.1 Text to Keyframe/StoryBoard Generation
Text-to-Keyframe/Storyboard Generation refers to the creation of coherent, extended visual sequences based on plot, where character identity consistency is maintained.Baseline:As shown in Table5, StoryGen[33],StoryDiffusion[74], and AutoStory[52],as the three commonly keyframe generation models are evaluated on MovieBench.Since the training code for StoryDiffusion[74] is not publicly available, we test with the official pretrained weights directly on the test set.For the LoRA-based[44] AutoStory[52], we trained a unique LoRA model for each character and generated results with plot.Analysis:Given the plot of each shot-level video, the model generates a keyframe, as illustrated in Figure7.For evaluating specific metrics (e.g.,FID, Clip Score), we extract the middle frame from the original video as the ground truth.Table5 shows that generated keyframes still struggle to maintain character consistency, making it challenging to meet real-world application requirements.
Method | CLIP | Inception | Aesthetic | FID | FVD | VBench Metircs[25]/% | ||||||
Sub Cons. | Bg Cons. | M Smth. | Dyn. | |||||||||
Text to Keyframe/StoryBoard Generation | ||||||||||||
StoryGen | 20.37 | 7.01 | 22.46 | 16.17 | - | - | - | - | - | 77.00 | 1.40 | 2.80 |
StoryDiffusion | 21.56 | 9.13 | 26.08 | 11.84 | - | - | - | - | - | 78.17 | 37.26 | 50.47 |
AutoStory | 20.16 | 7.14 | 23.87 | 18.68 | - | - | - | - | - | 77.61 | 41.81 | 54.34 |
Text to Video Generation | ||||||||||||
DreamVideo | 22.39 | 11.63 | 19.13 | 7.99 | 853.36 | 88.05 | 92.97 | 96.19 | 69.47 | 8.07 | 2.43 | 3.74 |
Magic-Me | 21.52 | 10.81 | 20.87 | 8.63 | 789.12 | 96.90 | 96.31 | 98.24 | 15.97 | 41.30 | 5.80 | 10.17 |
Image to Video Generation | ||||||||||||
I2VGen-XL | 22.39 | 8.63 | 8.12 | 1.77 | 512.47 | 76.18 | 85.38 | 96.30 | 79.16 | 19.41 | 10.82 | 13.90 |
SVD | 22.28 | 7.36 | 11.45 | 1.25 | 190.48 | 92.97 | 94.48 | 98.17 | 84.67 | 20.49 | 12.54 | 15.56 |
CogVideoX | 22.43 | 7.54 | 14.16 | 1.23 | 228.80 | 90.37 | 93.78 | 98.60 | 50.42 | 24.80 | 15.63 | 19.17 |
Model | FID | FVD | Lip-sync |
Edtalk[49] | 2.39 | 504.49 | 2.41 |
Hallo2[15] | 1.68 | 475.13 | 1.66 |
4.2 Identity-Customized Long Video Generation
Identity-Customized Long Video Generation involves creating long videos that consistently feature specific characters, aligning with the plot.Baseline: DreamVideo[55] and Magic-Me[35], two identity-customized video generation models, were used to evaluate our dataset.During training, only the character bank is used to learn the features of each character.Analysis: Table5 demonstrates that the generated characters struggle significantly with consistency.Several factors contribute to this: 1) Video generation is significantly more challenging than image generation, and current models[67] often fail to produce coherent, high-quality videos.2) Both methods can maintain consistency for only one character per video, limiting their ability to handle multiple characters simultaneously.
4.3 Image-conditioned Video Generation
Image-conditioned video generation involves producing subsequent frames based on an initial frame.Baseline: I2VGen-XL[67], SVD[4], and CogVideoX[64] are used as baselines, requiring the first frame of the real video as the conditioning input.Analysis: Due to the use of real images, the generated videos are closer to real videos, which results in better FID and FVD scores.However, the is unsatisfactory for two main reasons:1) The first frame may not include all the characters that appear in the shot.2) Movies often feature shots from various angles, as shown in Figure8 (c), and the current model cannot effectively identify and maintain consistency of the character.
4.4 Talking Human with Audio Generation
Talking Human with Audio Generation refers to the task of creating video scenes featuring a specific character talking, based on provided subtitles or audio.Baseline:Edtalk[49] and Hallo2[15], two widely-used audio-driven video generation models, were applied to validate our dataset.Analysis: Current audio-driven talking generators still primarily focus on generating talking head animations; creating full-body or multi-person talking scenes remains a significant challenge.As shown in Figure6, ensuring audio and visual consistency with synchronized lip movements for multiple characters is highly challenging.
4.5 Ablation and Analysis
Emerging Challenge in Multi-Characters Consistency.Table5 shows that existing models struggle with multi-characters consistency, achieving a maximum accuracy of only .Figure8 (a) presents visual comparisons that highlight significant limitations in current models for multi-character generation.Additionally, certain models, such as StoryDiffusion, Magic-Me, and DreamVideo, can only maintain consistency for a single character at a time and struggle with multi-character consistency.Autostory[52] employs a Mix-of-Show[17] with pose-guided techniques to manage multiple character IDs.However, in real-world applications, pose sequences are difficult to obtain.
Emerging Challenge in Character-based Plot Following.MovieBench has introduced new challenge in character-based plot following, particularly in defining character relationships and accurately generating interactions of characters.Figure8 (b) illustrates the difficulties in following these plots and understanding character interactions.For instance, methods like Autostory, which rely on explicit constraints, often fail to generate distinct human poses and struggle to interpret actions like ‘ride’.
Emerging Challenge from Diverse Views GenerationFigure8 (c) highlights another challenge: generating scenes and characters from various views while maintaining consistency in appearance.Current models can only produce close-up, frontal shots of characters, struggling with other angles or wide shots.However, diverse views are essential in movies to convey different moods and purposes.
Emerging Challenge in Synchronized Video Generation with Audio.No existing method or dataset has explored the generation of synchronized video with audio in long videos,as shown in Figure8 (d).Some works, such as PersonaTalk[65] and Hallo2[15], have explored and advanced the task of talking head generation.However, these existing tasks primarily focus on single-person talking head generation and basic scene creation, leaving a substantial gap when it comes to realistic multi-character dialogue in movie contexts.MovieBench facilitates the exploration of complex talking scenes involving multiple characters.
5 Conclusion
This paper presents MovieBench, a hierarchical movie-level dataset specifically designed for long video generation.We reevaluated existing generation models and identified new insights and challenges in the field, including the need to maintain consistency across multiple character IDs, ensure multi-view character coherence, and align generated results with complex storylines.By facilitating the exploration of complex character interactions and rich storylines, MovieBench aims to advance research in long video generation and inspire future developments in the field.
\thetitle
Supplementary Material
6 Customized Audio Generation
Customized Audio Generation involves creating customized soundtracks for specific characters and emotional cues.We conduct experiments on movies from the test set, splitting the audio of each character into two parts: half as the test set and half as reference audio for evaluation.Following prior works[13, 10], we evaluate performance on a cross-sentence task, where the model synthesizes a reading of a reference text in the style of a given speech prompt.
Dataset | MCD | WER(%) | SIM-o |
YourTTS[9] | 8.41 | 0.26 | 0.97 |
xTTS[10] | 8.31 | 0.28 | 0.98 |
VALL-E-X[68] | 4.48 | 1.25 | 0.97 |
F5-TTS[13] | 3.12 | 0.20 | 0.98 |
6.1 Metric
Following prior work, three common metrics, namely Word Error Rate (WER), Speaker Similarity (SIM-o), and Mel Cepstral Distortion (MCD), are used to evaluate our dataset.For WER, Whisper-large-v3[41] is used to transcribe the audio to text, after which word error is calculated at the text level.For SIM-o, a WavLM-large-based speaker verification model[14] is used to extract speaker embeddings, enabling cosine similarity calculation between synthesized and ground truth speech.For MCD, an open-source PyTorch implementation222https://github.com/chenqi008/pymcd is used to evaluate the similarity between synthesized and real audio.For evaluation, each audio file is converted to a single-channel, 16-bit PCM WAV with a sample rate of 22050 Hz.
Movie | Portrait Quality | Name Relevance |
AS Good As It Gets | 4.56 | 5.00 |
Clerks | 4.20 | 4.92 |
Halloween | 4.00 | 4.89 |
The Hustler | 4.80 | 4.98 |
Chasing Amy | 4.42 | 4.78 |
The Help | 4.30 | 5.00 |
No Reservations | 4.86 | 4.93 |
An Education | 4.70 | 4.85 |
Harry Potter and theChamber of Secrets | 4.73 | 5.00 |
Seven Pounds | 4.71 | 4.87 |
6.2 Baseline and Analysis
The four audio customization methods—YourTTS[9], xTTS[10], VALL-E-X[68], and F5-TTS[13] were used in MovieBench for evaluation.We performed direct zero-shot testing without any additional training, with F5-TTS achieving the best performance, as shown in Table7.Notably, each evaluation was conducted individually for each character.However, the real challenge lies in scenes with multi-character interactions, as seen in movies.Generating audio that matches the tone and voice of each character in a way that ensures consistency with the visuals presents a significant challenge, especially in maintaining distinct voices across different audio tracks.
7 Quality Evaluation and Correction for Shot-Level Annotation
Correction for Description-based Annotations.The main text mentions that we required two annotators to manually correct the shot-level dataset in the test set. The specific correction rules are as follows:
- •
Check and Refine Descriptions: reviewing the descriptions of characters, style, plot, background, and camera motion, correcting any inaccuracies.
- •
Character Set Adjustments: Characters not belonging to the Character Bank were removed from the video clip’s character set, and any missing characters were added.
- •
Style Matching: Ensure that the style element accurately reflected the video clip’s content, avoiding subjective interpretations.
- •
Plot Alignment: The Plot was verified to align with the main content of video, with any hallucinated or irrelevant information removed.
- •
Grammatical Accuracy: Descriptions were refined to ensure grammatical correctness.
- •
Objectivity: The descriptions were made more objective, avoiding subjective terms or speculative phrases such as ”I think” or ”it might be.”
Two annotators were instructed to progressively refine the character set, style, and plot based on the above rules.The refinement interface is shown in Figure10.The interface provides character photos, names, key frames from the original video, and shot-level annotation details (such as plot, appearing character set, etc.).Annotators use this information to assess the accuracy of the annotations and identify areas for improvement.The refinement process took two annotators approximately one week.
Quality Evaluation for Shot-level Appearing Character Set.The main text presents that the character photos in our Character Bank were manually selected by two annotators.After completing the data annotation, we conducted a quality assessment focusing on Portrait Quality and Name-Portrait Relevance.Table8 shows the relevant experimental results.It can be observed that Portrait-Name Relevance scores significantly higher than Portrait Quality.This is mainly because manually selected images are generally consistent with their names, leaving little room for error in relevance.However, image quality is harder to guarantee, as not all image candidates are of consistently high quality.
8 Possible Directions? Single-Stage or Two-Stage
Movie/long video generation is typically not done in one go; instead, it is divided into multiple shot clips for individual generation.Currently, there are two main approaches for this task: one-stage and two-stage methods.
One Stage.Currently, there are no fully realized one-stage solutions for this task.Most open-source one-stage models[71, 67] focus on text-to-video generation, lacking the ability to maintain character consistency and connect storylines across different video clips.DreamVideo[55] and Magic-Me[35], two commonly used customizable video generation models, are utilized in our paper.The typical workflow involves first creating a script with character-specific details for each shot, generating corresponding video clips for each shot individually, and then stitching these clips together to produce a cohesive long-form video.
Metric | Better | Description |
higher | Quality assessment for character portraits, involving human raters scoring the image quality on a scale of 1 to 5, with 5 being the highest score. | |
higher | Portrait-Name relevance score for each character name and portrait pair, with human raters on a scale of 1 to 5, and 5 being the highest score. | |
higher | Descriptive completeness score, assessing the extent to which the annotation of textual descriptions (e.g., Plot, Background Description) reflects the completeness of the video content. | |
lower | Fantasy score, assessing the degree of hallucination in textual descriptions of video content (e.g., Plot, Background Description). | |
higher | The evaluation of semantic alignment between the plot and generated outputs | |
higher | The evaluation for aesthetic quality of an image by extracting visual features using the CLIP and comparing them with a pre-trained aesthetic model to quantify the score. | |
lower | The evaluation for the quality of generated images by comparing the feature distribution of features between real and generated images. | |
higher | The evaluation for the quality and diversity of generated images by Inception network. | |
lower | The total number of false positives. Formula: . | |
lower | The total number of false negative. Formula: . | |
higher | The total number of true positives. Formula: . | |
higher | Ratio of correct detections&recognitions to total number of GTs. Formula: | |
higher | Ratio of correct detections&recognitions to total number of predicted detections&recognitions. Formula: | |
higher | [42]. The ratio of correctly identified detections&recognitions over the average number of ground-truth and computed detections&recognitions. Formula: | |
Subject Consistency | higher | DINO[8] is used to assess whether the appearance remains consistent throughout the entire video. |
Background Consistency | higher | CLIP feature similarity[40] is used to evaluate the temporal consistency of the background scenes. |
Motion Smoothness | higher | Video frame interpolationmodel[30] is used to evaluate the smoothness of generated motions. |
Dynamic Degree | higher | Optical flow estimation[50] is used to evaluate the degree of dynamics in synthesized videos. |
Two Stages.Directly generating long-form videos is highly challenging.Therefore, the two-stage strategy has become a more practical solution:1) Firstly, Key frames/Storyboard generation models[52, 74, 69, 57] can be used to generate the key frame for every shot-level video.Figure11 provides additional visualizations of both successful and challenging cases for AutoStory[52] and StoryDiffusion[74].It can be observed that AutoStory[52] excels in maintaining consistency across multiple characters but struggles with certain background compatibility.On the other hand, StoryDiffusion[74] performs well in generating natural interactions between characters and backgrounds but has difficulty maintaining consistency across multiple characters.2) With key frames, image-conditioned video generation models (e.g., SVD), are employed to expand the key frames into full video clips.Finally, the various video clips are stitched together to form a coherent sequence.While this method addresses some issues of video continuity and narrative progression, it still faces difficulties with maintaining a smooth flow across clips and ensuring consistent character representation throughout the film.However, for certain shots with strong temporal dependencies, it is challenging to rely solely on keyframes for representation.Figure12 shows an example where generating only a single keyframe is clearly insufficient to capture the sequence of Harry’s appearance.
9 Metric Formulation
As shown in Table9, we summarize and formulate the evaluation metrics relevant to the tasks discussed in this paper.‘Portrait Quality’ and ‘Portrait-Name Relevance’ assess the accuracy of the Character Bank annotations, specifically evaluating the precision of manual image selection and labeling.‘Completeness’ and ‘Hallucination’ measure the accuracy of description-based annotations (e.g., plot and background descriptions), focusing on the completeness of details and hallucinations from VLM descriptions.The CLIP Score, Aesthetic Score, Frechet Image Distance, and Inception Score evaluate the quality of generated images/videos and their alignment with text descriptions.Additionally, this paper introduces new metrics—, , and —to assess character consistency.‘Subject Consistency’, ‘Background Consistency’, ‘Motion Smoothness’, and ‘Dynamic Degree’ are recently proposed metrics from VBen[25], aimed at evaluating various aspects of generated video.
10 Limitation
Hallucination of Plot from GPT4-o.Although GPT-4-o demonstrates high accuracy and rarely makes errors, its generated plot descriptions can still present issues, such as hallucination.Figure13 provides a clear example: in this video, Harry walks downstairs, yet there is no evidence to conclude that Harry is engaged in contemplation or reflection.However, the summary of GPT-4-o confidently suggests this, introducing a potential misinterpretation.Such hallucinations can reduce data reliability, misleading model training and potentially causing unstable convergence when using this data.
References
- Azzarelli etal. [2024]Adrian Azzarelli, Nantheera Anantrasirichai, and DavidR Bull.Reviewing intelligent cinematography: Ai research for camera-based video production.arXiv preprint arXiv:2405.05039, 2024.
- Bai etal. [2024]Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and MikeZheng Shou.Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024.
- Bain etal. [2021]Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman.Frozen in time: A joint video and image encoder for end-to-end retrieval.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
- Blattmann etal. [2023a]Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, etal.Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023a.
- Blattmann etal. [2023b]Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, SeungWook Kim, Sanja Fidler, and Karsten Kreis.Align your latents: High-resolution video synthesis with latent diffusion models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023b.
- Brooks etal. [2024]Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh.Video generation models as world simulators.2024.
- CabaHeilbron etal. [2015]Fabian CabaHeilbron, Victor Escorcia, Bernard Ghanem, and Juan CarlosNiebles.Activitynet: A large-scale video benchmark for human activity understanding.In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015.
- Caron etal. [2021]Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin.Emerging properties in self-supervised vision transformers.In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Casanova etal. [2022]Edresson Casanova, Julian Weber, ChristopherD Shulby, ArnaldoCandido Junior, Eren Gölge, and MoacirA Ponti.Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone.In International Conference on Machine Learning, pages 2709–2720. PMLR, 2022.
- Casanova etal. [2024]Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, etal.Xtts: a massively multilingual zero-shot text-to-speech model.arXiv preprint arXiv:2406.04904, 2024.
- Chen etal. [2023]Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, etal.Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023.
- Chen etal. [2024a]Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, ByungEun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, etal.Panda-70m: Captioning 70m videos with multiple cross-modality teachers.arXiv preprint arXiv:2402.19479, 2024a.
- Chen etal. [2024b]Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen.F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching.arXiv preprint arXiv:2410.06885, 2024b.
- Chen etal. [2022]Zhengyang Chen, Sanyuan Chen, Yu Wu, Yao Qian, Chengyi Wang, Shujie Liu, Yanmin Qian, and Michael Zeng.Large-scale self-supervised speech representation learning for automatic speaker verification.In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6147–6151. IEEE, 2022.
- Cui etal. [2024]Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, and Jingdong Wang.Hallo2: Long-duration and high-resolution audio-driven portrait image animation.arXiv preprint arXiv:2410.07718, 2024.
- Esser etal. [2023]Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis.Structure and content-guided video synthesis with diffusion models.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
- Gu etal. [2024]Yuchao Gu, Xintao Wang, JayZhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, etal.Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models.Advances in Neural Information Processing Systems, 36, 2024.
- Han etal. [2023a]Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, and Andrew Zisserman.Autoad: Movie description in context.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18930–18940, 2023a.
- Han etal. [2023b]Tengda Han, Max Bain, Arsha Nagrani, Gul Varol, Weidi Xie, and Andrew Zisserman.Autoad ii: The sequel-who, when, and what in movie audio description.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13645–13655, 2023b.
- Han etal. [2024]Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, and Andrew Zisserman.Autoad iii: The prequel-back to the pixels.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18164–18174, 2024.
- Ho etal. [2022]Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and DavidJ Fleet.Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022.
- Hong etal. [2022]Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang.Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022.
- Huang etal. [2020]Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin.Movienet: A holistic dataset for movie understanding.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 709–727. Springer, 2020.
- Huang etal. [2023]Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, etal.Vbench: Comprehensive benchmark suite for video generative models.arXiv preprint arXiv:2311.17982, 2023.
- Huang etal. [2024]Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, etal.Vbench: Comprehensive benchmark suite for video generative models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024.
- Ju etal. [2024]Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan.Miradata: A large-scale video dataset with long durations and structured captions.arXiv preprint arXiv:2407.06358, 2024.
- Kondratyuk etal. [2023a]Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, etal.Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023a.
- Kondratyuk etal. [2023b]Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, etal.Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023b.
- [29]Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, XinEric Wang, WilliamYang Wang, etal.Value: A multi-task benchmark for video-and-language understanding evaluation.In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
- Li etal. [2023]Zhen Li, Zuo-Liang Zhu, Ling-Hao Han, Qibin Hou, Chun-Le Guo, and Ming-Ming Cheng.Amt: All-pairs multi-field transforms for efficient frame interpolation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9801–9810, 2023.
- Liao etal. [2023]Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, and Liangyin Chen.A light weight model for active speaker detection.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22932–22941, 2023.
- Lin etal. [2024]KevinQinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, and MikeZheng Shou.Learning video context as interleaved multimodal sequences.arXiv preprint arXiv:2407.21757, 2024.
- Liu etal. [2024]Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yanfeng Wang, and Weidi Xie.Intelligent grimm-open-ended visual storytelling via latent diffusion models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6190–6200, 2024.
- Liu etal. [2023]Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, etal.Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023.
- Ma etal. [2024]Ze Ma, Daquan Zhou, Chun-Hsiao Yeh, Xue-She Wang, Xiuyu Li, Huanrui Yang, Zhen Dong, Kurt Keutzer, and Jiashi Feng.Magic-me: Identity-specific video customized diffusion.arXiv preprint arXiv:2402.09368, 2024.
- Miech etal. [2019]Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic.Howto100m: Learning a text-video embedding by watching hundred million narrated video clips.In Proc. IEEE Int. Conf. Comp. Vis., 2019.
- Nan etal. [2024]Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai.Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv preprint arXiv:2407.02371, 2024.
- Plaquet and Bredin [2023]Alexis Plaquet and Hervé Bredin.Powerset multi-class cross entropy loss for neural speaker diarization.In Proc. INTERSPEECH 2023, 2023.
- Polyak etal. [2024]Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, etal.Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024.
- Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Radford etal. [2023]Alec Radford, JongWook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever.Robust speech recognition via large-scale weak supervision.In International conference on machine learning, pages 28492–28518. PMLR, 2023.
- Ristani etal. [2016]Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi.Performance measures and a data set for multi-target, multi-camera tracking.In Workshops of ECCV, pages 17–35, 2016.
- Rohrbach etal. [2015]Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele.A dataset for movie description.In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015.
- Ryu [2023]Simo Ryu.Low-rank adaptation for fast text-to-image diffusion fine-tuning.Low-rank adaptation for fast text-to-image diffusion fine-tuning, 2023.
- Sanabria etal. [2018]Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and Florian Metze.How2: a large-scale dataset for multimodal language understanding.arXiv preprint arXiv:1811.00347, 2018.
- Seitzer [2020]Maximilian Seitzer.pytorch-fid: FID Score for PyTorch.https://github.com/mseitzer/pytorch-fid, 2020.Version 0.3.0.
- Serengil and Ozpinar [2020]SefikIlkin Serengil and Alper Ozpinar.Lightface: A hybrid deep face recognition framework.In 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), pages 23–27. IEEE, 2020.
- Soldan etal. [2022]Mattia Soldan, Alejandro Pardo, JuanLeón Alcázar, Fabian Caba, Chen Zhao, Silvio Giancola, and Bernard Ghanem.Mad: A scalable dataset for language grounding in videos from movie audio descriptions.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5026–5035, 2022.
- Tan etal. [2025]Shuai Tan, Bin Ji, Mengxiao Bi, and Ye Pan.Edtalk: Efficient disentanglement for emotional talking head synthesis.In European Conference on Computer Vision, pages 398–416. Springer, 2025.
- Teed and Deng [2020]Zachary Teed and Jia Deng.Raft: Recurrent all-pairs field transforms for optical flow.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
- Wang etal. [2023a]Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, and Jiaying Liu.Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation.arXiv preprint arXiv:2305.10874, 2023a.
- Wang etal. [2023b]Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, and Chunhua Shen.Autostory: Generating diverse storytelling images with minimal human effort.arXiv preprint arXiv:2311.11243, 2023b.
- Wang etal. [2019]Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and WilliamYang Wang.Vatex: A large-scale, high-quality multilingual dataset for video-and-language research.In Proc. IEEE Int. Conf. Comp. Vis., 2019.
- Wang etal. [2023c]Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, etal.Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023c.
- Wei etal. [2024]Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, and Hongming Shan.Dreamvideo: Composing your dream videos with customized subject and motion.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6537–6549, 2024.
- Wu etal. [2021]Weijia Wu, Debing Zhang, Yuanqiang Cai, Sibo Wang, Sibo Wang, Jiahong Li, Zhuang Li, Yejun Tang, and Hong Zhou.A bilingual, openworld video text dataset and end-to-end video text spotter with transformer.In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
- Wu etal. [2023]Weijia Wu, Zhuang Li, Yefei He, MikeZheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang, and Zhongyuan Wang.Paragraph-to-image generation with information-enriched diffusion model.arXiv preprint arXiv:2311.14284, 2023.
- Wu etal. [2024]Weijia Wu, Yuanqiang Cai, Chunhua Shen, Debing Zhang, Ying Fu, Hong Zhou, and Ping Luo.End-to-end video text spotting with transformer.International Journal of Computer Vision, 132(9):4019–4035, 2024.
- Wu etal. [2025a]Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, DavidJunhao Zhang, MikeZheng Shou, Yan Li, Tingting Gao, and Di Zhang.Draganything: Motion control for anything using entity representation.In European Conference on Computer Vision, pages 331–348. Springer, 2025a.
- Wu etal. [2025b]Weijia Wu, Yuzhong Zhao, Zhuang Li, Jiahong Li, Hong Zhou, MikeZheng Shou, and Xiang Bai.A large cross-modal video retrieval dataset with reading comprehension.Pattern Recognition, 157:110818, 2025b.
- Xu etal. [2016]Jun Xu, Tao Mei, Ting Yao, and Yong Rui.Msr-vtt: A large video description dataset for bridging video and language.In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
- Xue etal. [2022]Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo.Advancing high-resolution video-language representation with large-scale video transcriptions.In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2022.
- Yan etal. [2021]Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas.Videogpt: Video generation using vq-vae and transformers.arXiv preprint arXiv:2104.10157, 2021.
- Yang etal. [2024]Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, etal.Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024.
- Zhang etal. [2024]Longhao Zhang, Shuang Liang, Zhipeng Ge, and Tianshu Hu.Personatalk: Bring attention to your persona in visual dubbing.arXiv preprint arXiv:2409.05379, 2024.
- Zhang etal. [2017]Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and StanZ Li.S3fd: Single shot scale-invariant face detector.In Proceedings of the IEEE international conference on computer vision, pages 192–201, 2017.
- Zhang etal. [2023a]Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou.I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models.arXiv preprint arXiv:2311.04145, 2023a.
- Zhang etal. [2023b]Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, etal.Speak foreign languages with your own voice: Cross-lingual neural codec language modeling.arXiv preprint arXiv:2303.03926, 2023b.
- Zhao etal. [2024]Canyu Zhao, Mingyu Liu, Wen Wang, Jianlong Yuan, Hao Chen, Bo Zhang, and Chunhua Shen.Moviedreamer: Hierarchical generation for coherent long visual sequence.arXiv preprint arXiv:2407.16655, 2024.
- Zhao etal. [2025]Rui Zhao, Yuchao Gu, JayZhangjie Wu, DavidJunhao Zhang, Jia-Wei Liu, Weijia Wu, Jussi Keppo, and MikeZheng Shou.Motiondirector: Motion customization of text-to-video diffusion models.In European Conference on Computer Vision, pages 273–290. Springer, 2025.
- Zheng etal. [2024]Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You.Open-sora: Democratizing efficient video production for all.https://github.com/hpcaitech/Open-Sora, 2024.
- Zhengwentai [2023]SUN Zhengwentai.clip-score: CLIP Score for PyTorch.https://github.com/taited/clip-score, 2023.Version 0.1.1.
- Zhou etal. [2018]Luowei Zhou, Chenliang Xu, and Jason Corso.Towards automatic learning of procedures from web instructional videos.In Proc. AAAI Conf. Artificial Intell., 2018.
- Zhou etal. [2024]Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou.Storydiffusion: Consistent self-attention for long-range image and video generation.arXiv preprint arXiv:2405.01434, 2024.