Skip to content

Scaling Open Vocabulary Object Detection, Owlv2 finetuning #160

Open
@rbavery

Description

@rbavery

Search before asking

  • I have searched the Multimodal Maestro issues and found no similar feature requests.

Description

https://huggingface.co/docs/transformers/en/model_doc/owlv2

Be able to finetune Owlv2 for grounded object detection using JSONL referencing 3-channel imagery. N-channel imagery would be extra dope. Ideally with high bit depth TIFF support, since my imagery comes in .tif. I see Pillow in the requirements so high bit depth TIFF support might not be possible today without more work to change how imagery is loaded.

Use case

I've played around with OWLv2 a bit and compared it to GroundingDINO and Qwen 2.5 and it seems to do a better job at producing bounding boxes on hard images with small objects (satellite images) whereas the other models produce nothing. This makes me think it is a better candidate for fine-tuning potentially. But I'm definitely not certain and have more testing to do.

Additional

In the geospatial computer vision domain we are in the very earliest of days toward applying VLMs to solve actual problems on massive imagery corpuses. There have been some cool experiments recently that have inspired me to try fine-tuning VLMs to test their limits on remotely sensed imagery using modest sized datasets for fine-tuning.

Can't commit to a PR right now (but might be able to in the future.

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions