CinePile Leaderboard
CinePile is a long video understanding dataset & benchmark. The leaderboard presents the evaluation results of various VLMs on the CinePile benchmark.
Currently, CinePile Leaderboard covers 24 different VLMs.
This leaderboard was last updated: 20th October 2024.
Main Evaluation Results
- Average Accuracy: The average accuracy on all question categories (normalized to 0 - 100, the higher the better).
- Average Rank: The average rank based on the average accuracy (the lower the better).
Model | Params (B) | Average Accuracy | Average Rank |
---|---|---|---|
Gemini 1.5 Flash-001 | 25.5 | 60.12 | 10 |
CinePile Data Construction
In May 2024, we launched CinePile, a long video QA dataset with about 300,000 training samples and 5,000 test samples. It's designed for question diversity, covering areas like temporal understanding, plot analysis, character dynamics, setting, and themes, pushing models to tackle diverse aspects of video comprehension. CinePile also focuses on question difficulty—humans outperform the best commercial vision models by 25% and open-source ones by 65%. Check out a sample scene and Q&A pairs from CinePile in the figure below.
CinePile stands out from earlier datasets by overcoming issues like small scale or focusing too much on simple perceptual questions (e.g., "What color is the car?"). Instead, it uses raw video from YouTube clips and detailed audio descriptions from platforms like AudioVault, which are designed for visually impaired audiences. These descriptions offer rich context beyond basic visuals, helping us create more complex questions. CinePile includes about 9,400 movie clips from various genres and eras, with both dialogue and visual descriptions (from audio transcriptions), which we call "scene-text-annotation.
The large size of CinePile is enabled by our novel pipeline (visualized in Figure 2) for automated question generation and verification using large language models. To automate question creation, we first built question templates by leveraging datasets like MovieQA and TVQA. We clustered the questions in the embedding space of a textual similarity model (WhereIsAI/UAE-Large-V1) and then prompted GPT-4 with 10 random examples from each cluster to generate a question template and a prototypical question for each. Since templates aren’t always relevant to every movie clip, we use Gemini to pick the best ones for each scene. Next, we feed a language model the scene’s text, selected template names (e.g., "Physical Possession"), sample questions, and a system prompt to create scene-specific questions. A well-designed prompt helps the model focus on the whole scene, generating deeper questions and avoiding superficial ones. We found that providing prototypical examples and including timestamps for the different dialogues and visual descriptions prevents GPT-4 from hallucinating and leads to more plausible multiple-choice question (MCQ) distractors. Additionally, asking the model to provide a rationale for its answers improves the quality of the questions. Using this approach, we generate about 32 questions per video.
While our process typically generates well-formed, answerable questions, some turn out to be trivial or based on basic concepts that don’t require watching the clip. To fix this, we used several large language models (LLMs) to filter or flag these issues.
-
Degeneracy: A question is "degenerate" if its answer is obvious from the question itself, like "What is the color of the pink house?" These made up a small portion of our dataset. Since manual review wasn’t feasible at our scale, we used three LLMs—Gemini, GPT-3.5, and Phi-1.5—to automate this. If all three got it right without context, the question was likely degenerate and excluded from the evaluation set.
-
Adversarial Refinement: The goal is to tweak the questions or answers so a Deaf-Blind LLM can't easily pick the right answer. To achieve this, we use the Deaf-Blind LLM not only to provide an answer but also to generate a rationale explaining why it selected that answer based solely on the question. These rationales help explicitly verbalize the implicit cues or biases embedded in the question. Once we have the rationale, we input it into our question-generation model, instructing it to modify the question and/or answer choices to remove any implicit clues. This loop repeats up to five times per question until the Deaf-Blind LLM's performance drops to random chance.
-
Vision Reliance: Some MCQs could be answered just from dialogue, without relying on any visual information. Using the Gemini model, we checked whether it could answer using only dialogue. If it did, we scored the question as 0 for visual reliance, otherwise 1.
-
Hardness: To measure hardness, we checked if a model could answer the question even with full context (i.e., both visual descriptions and subtitles).