CinePile Leaderboard

CinePile is a long video understanding dataset & benchmark. The leaderboard presents the evaluation results of various VLMs on the CinePile benchmark.

Currently, CinePile Leaderboard covers 24 different VLMs.

This leaderboard was last updated: 20th October 2024.

Model	Params (B)	Average Accuracy	Average Rank
Gemini 1.5 Flash-001	25.5	60.12	10

Model

Params (B)

Average Accuracy

Average Rank

Gemini 1.5 Flash-001

25.5

60.12

Model	Params (B)	Average Accuracy	Average Rank
Gemini 1.5 Pro-001	-	60.12	1
Gemini 1.5 Flash-001	-	58.75	2
GPT-40	-	56.06	3
GPT-4 Vision	-	55.35	4
LLaVA-OV	7	49.34	5
LLaVA-OV Chat	7	49.28	6
MiniCPM-V 2.6	8	46.91	7
Claude 3 Opus	-	45.6	8
VideoLLaMA2	7	44.57	9
InternVL2	26	43.86	10
LongVA DPO	7	42.78	11
Intern VL-V1.5	25.5	41.69	12
LongVA	7	41.04	13
InternVL2	4	39.89	14
mPLUG-Owl3	8	38.27	15
LLaVA-OV	0.5	33.82	16
InternVL2	8	32.28	17
InternVL2	2	30.34	18
VideoChat2	7	29.27	19
Video LLaVa	7	25.72	20
CogVLM2	19	17.16	21
InternVL2	1	15.97	22
Video-ChatGPT	7	15.08	23
mPLUG-Owl	7.2	13.93	24

Model

Params (B)

Average Accuracy

Average Rank

Gemini 1.5 Pro-001

60.12

Gemini 1.5 Flash-001

58.75

GPT-40

56.06

GPT-4 Vision

55.35

LLaVA-OV

49.34

LLaVA-OV Chat

49.28

MiniCPM-V 2.6

46.91

Claude 3 Opus

45.6

VideoLLaMA2

44.57

InternVL2

43.86

LongVA DPO

42.78

Intern VL-V1.5

25.5

41.69

LongVA

41.04

InternVL2

39.89

mPLUG-Owl3

38.27

LLaVA-OV

0.5

33.82

InternVL2

32.28

InternVL2

30.34

VideoChat2

29.27

Video LLaVa

25.72

CogVLM2

17.16

InternVL2

15.97

Video-ChatGPT

15.08

mPLUG-Owl

7.2

13.93

CinePile Data Construction

In May 2024, we launched CinePile, a long video QA dataset with about 300,000 training samples and 5,000 test samples. It's designed for question diversity, covering areas like temporal understanding, plot analysis, character dynamics, setting, and themes, pushing models to tackle diverse aspects of video comprehension. CinePile also focuses on question difficulty—humans outperform the best commercial vision models by 25% and open-source ones by 65%. Check out a sample scene and Q&A pairs from CinePile in the figure below.

Sample Scene

CinePile stands out from earlier datasets by overcoming issues like small scale or focusing too much on simple perceptual questions (e.g., "What color is the car?"). Instead, it uses raw video from YouTube clips and detailed audio descriptions from platforms like AudioVault, which are designed for visually impaired audiences. These descriptions offer rich context beyond basic visuals, helping us create more complex questions. CinePile includes about 9,400 movie clips from various genres and eras, with both dialogue and visual descriptions (from audio transcriptions), which we call "scene-text-annotation.

Adversarial Refinement Pipeline

The large size of CinePile is enabled by our novel pipeline (visualized in Figure 2) for automated question generation and verification using large language models. To automate question creation, we first built question templates by leveraging datasets like MovieQA and TVQA. We clustered the questions in the embedding space of a textual similarity model (WhereIsAI/UAE-Large-V1) and then prompted GPT-4 with 10 random examples from each cluster to generate a question template and a prototypical question for each. Since templates aren’t always relevant to every movie clip, we use Gemini to pick the best ones for each scene. Next, we feed a language model the scene’s text, selected template names (e.g., "Physical Possession"), sample questions, and a system prompt to create scene-specific questions. A well-designed prompt helps the model focus on the whole scene, generating deeper questions and avoiding superficial ones. We found that providing prototypical examples and including timestamps for the different dialogues and visual descriptions prevents GPT-4 from hallucinating and leads to more plausible multiple-choice question (MCQ) distractors. Additionally, asking the model to provide a rationale for its answers improves the quality of the questions. Using this approach, we generate about 32 questions per video.

While our process typically generates well-formed, answerable questions, some turn out to be trivial or based on basic concepts that don’t require watching the clip. To fix this, we used several large language models (LLMs) to filter or flag these issues.

Degeneracy: A question is "degenerate" if its answer is obvious from the question itself, like "What is the color of the pink house?" These made up a small portion of our dataset. Since manual review wasn’t feasible at our scale, we used three LLMs—Gemini, GPT-3.5, and Phi-1.5—to automate this. If all three got it right without context, the question was likely degenerate and excluded from the evaluation set.
Adversarial Refinement: The goal is to tweak the questions or answers so a Deaf-Blind LLM can't easily pick the right answer. To achieve this, we use the Deaf-Blind LLM not only to provide an answer but also to generate a rationale explaining why it selected that answer based solely on the question. These rationales help explicitly verbalize the implicit cues or biases embedded in the question. Once we have the rationale, we input it into our question-generation model, instructing it to modify the question and/or answer choices to remove any implicit clues. This loop repeats up to five times per question until the Deaf-Blind LLM's performance drops to random chance.
Vision Reliance: Some MCQs could be answered just from dialogue, without relying on any visual information. Using the Gemini model, we checked whether it could answer using only dialogue. If it did, we scored the question as 0 for visual reliance, otherwise 1.
Hardness: To measure hardness, we checked if a model could answer the question even with full context (i.e., both visual descriptions and subtitles).

CinePile Leaderboard

CinePile is a long video understanding dataset & benchmark. The leaderboard presents the evaluation results of various VLMs on the CinePile benchmark.

Currently, CinePile Leaderboard covers 24 different VLMs.

Main Evaluation Results

CinePile Data Construction