Recipes for shrinking, optimizing, customizing cutting edge vision and multimodal AI models.
NOTE: GitHub refuses to render notebooks for a long time now, so smol-vision now lives here. I still update this repository but it's inconvenient to read here.
Latest examples ๐๐ป
- Fine-tune Gemma-3n for all modalities (audio-text-image)
- Any-to-Any (Video) RAG with OmniEmbed and Qwen
- Fine-tune ColPali for Multimodal RAG
Note: The script and notebook are updated to fix few issues related to QLoRA!
Notebook | Description | |
---|---|---|
Quantization/ONNX | Faster and Smaller Zero-shot Object Detection with Optimum | Quantize the state-of-the-art zero-shot object detection model OWLv2 using Optimum ONNXRuntime tools. |
VLM Fine-tuning | Fine-tune PaliGemma | Fine-tune state-of-the-art vision language backbone PaliGemma using transformers. |
Intro to Optimum/ORT | Optimizing DETR with ๐ค Optimum | A soft introduction to exporting vision models to ONNX and quantizing them. |
Model Shrinking | Knowledge Distillation for Computer Vision | Knowledge distillation for image classification. |
Quantization | Fit in vision models using Quanto | Fit in vision models to smaller hardware using quanto |
Speed-up | Faster foundation models with torch.compile | Improving latency for foundation models using torch.compile |
VLM Fine-tuning | Fine-tune Florence-2 | Fine-tune Florence-2 on DocVQA dataset |
VLM Fine-tuning | QLoRA/Fine-tune IDEFICS3 or SmolVLM on VQAv2 | QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset |
VLM Fine-tuning (Script) | QLoRA Fine-tune IDEFICS3 on VQAv2 | QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset |
Multimodal RAG | Multimodal RAG using ColPali and Qwen2-VL | Learn to retrieve documents and pipeline to RAG without hefty document processing using ColPali through Byaldi and do the generation with Qwen2-VL |
Multimodal Retriever Fine-tuning | Fine-tune ColPali for Multimodal RAG | Learn to apply contrastive fine-tuning on ColPali to customize it for your own multimodal document RAG use case |
Any-to-Any Fine-tuning | Fine-tune Gemma-3n for all modalities (audio-text-image) | Fine-tune Gemma-3n model to handle any modality: audio, text, and image. |
Any-to-Any RAG | Any-to-Any (Video) RAG with OmniEmbed and Qwen | Do retrieval and generation across modalities (including video) using OmniEmbed and Qwen. |
| Speed-up/Memory Optimization | Vision language model serving using TGI (SOON) | Explore speed-ups and memory improvements for vision-language model serving with text-generation inference | | Quantization/Optimum/ORT | All levels of quantization and graph optimizations for Image Segmentation using Optimum (SOON) | End-to-end model optimization using Optimum |