Create mini-dataset to compare against private VLM scores

given the massive investment in AI models, there is some question on how an open source model competes against things like agentic object detector workflow from LandingAI, gemini and all the others. I think we could have a tiny cherrypicked (difficult) test set to compare against, both from a zero-shot and a with data perspective. That way it wouldn't be too expensive to run and review.