Evaluation results of Gemini-2.5-Pro

Thanks for the interesting work!

I tested ``gemini-2.5-pro-preview-05-06`` on the released Video-Holmes-test benchmark, and got the overall avg score of **62.3**, which seems differ from your reported Gemini-2.5-Pro result (**51.3**). I just the **raw video (with audio)** to gemini's api and use your provided prompt:
```
Based on the given video, reason and answer the single-choice question. Provide your reasoning between the <think> and </think> tags, and then give your final answer between the <answer> and </answer> tags. The question is: {question}. The options are: {options}. Your answer:
```

What could be the reasons for the differences in test results?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation results of Gemini-2.5-Pro #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluation results of Gemini-2.5-Pro #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions