Open
Description
Thanks for the interesting work!
I tested gemini-2.5-pro-preview-05-06
on the released Video-Holmes-test benchmark, and got the overall avg score of 62.3, which seems differ from your reported Gemini-2.5-Pro result (51.3). I just the raw video (with audio) to gemini's api and use your provided prompt:
Based on the given video, reason and answer the single-choice question. Provide your reasoning between the <think> and </think> tags, and then give your final answer between the <answer> and </answer> tags. The question is: {question}. The options are: {options}. Your answer:
What could be the reasons for the differences in test results?
Metadata
Metadata
Assignees
Labels
No labels