Smol Vision 🐣

Recipes for shrinking, optimizing, customizing cutting edge vision and multimodal AI models.

NOTE: GitHub refuses to render notebooks for a long time now, so smol-vision now lives here. I still update this repository but it's inconvenient to read here.

Latest examples 👇🏻

Note: The script and notebook are updated to fix few issues related to QLoRA!

	Notebook	Description
Quantization/ONNX	Faster and Smaller Zero-shot Object Detection with Optimum	Quantize the state-of-the-art zero-shot object detection model OWLv2 using Optimum ONNXRuntime tools.
VLM Fine-tuning	Fine-tune PaliGemma	Fine-tune state-of-the-art vision language backbone PaliGemma using transformers.
Intro to Optimum/ORT	Optimizing DETR with 🤗 Optimum	A soft introduction to exporting vision models to ONNX and quantizing them.
Model Shrinking	Knowledge Distillation for Computer Vision	Knowledge distillation for image classification.
Quantization	Fit in vision models using Quanto	Fit in vision models to smaller hardware using quanto
Speed-up	Faster foundation models with torch.compile	Improving latency for foundation models using `torch.compile`
VLM Fine-tuning	Fine-tune Florence-2	Fine-tune Florence-2 on DocVQA dataset
VLM Fine-tuning	QLoRA/Fine-tune IDEFICS3 or SmolVLM on VQAv2	QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset
VLM Fine-tuning (Script)	QLoRA Fine-tune IDEFICS3 on VQAv2	QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset
Multimodal RAG	Multimodal RAG using ColPali and Qwen2-VL	Learn to retrieve documents and pipeline to RAG without hefty document processing using ColPali through Byaldi and do the generation with Qwen2-VL
Multimodal Retriever Fine-tuning	Fine-tune ColPali for Multimodal RAG	Learn to apply contrastive fine-tuning on ColPali to customize it for your own multimodal document RAG use case
Any-to-Any Fine-tuning	Fine-tune Gemma-3n for all modalities (audio-text-image)	Fine-tune Gemma-3n model to handle any modality: audio, text, and image.
Any-to-Any RAG	Any-to-Any (Video) RAG with OmniEmbed and Qwen	Do retrieval and generation across modalities (including video) using OmniEmbed and Qwen.

| Speed-up/Memory Optimization | Vision language model serving using TGI (SOON) | Explore speed-ups and memory improvements for vision-language model serving with text-generation inference | | Quantization/Optimum/ORT | All levels of quantization and graph optimizations for Image Segmentation using Optimum (SOON) | End-to-end model optimization using Optimum |

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
inference_gists		inference_gists
.gitignore		.gitignore
Any_to_Any_RAG.ipynb		Any_to_Any_RAG.ipynb
ColPali_+_Qwen2_VL.ipynb		ColPali_+_Qwen2_VL.ipynb
Faster_Zero_shot_Object_Detection_with_Optimum.ipynb		Faster_Zero_shot_Object_Detection_with_Optimum.ipynb
Faster_foundation_models_with_torch_compile.ipynb		Faster_foundation_models_with_torch_compile.ipynb
Fine_tune_Florence_2.ipynb		Fine_tune_Florence_2.ipynb
Fine_tune_PaliGemma.ipynb		Fine_tune_PaliGemma.ipynb
Fine_tune_SmolVLM2_on_Video.ipynb		Fine_tune_SmolVLM2_on_Video.ipynb
Finetune_ColPali.ipynb		Finetune_ColPali.ipynb
Fit_in_vision_models_using_quanto.ipynb		Fit_in_vision_models_using_quanto.ipynb
Gemma3n_Fine_tuning_on_All_Modalities.ipynb		Gemma3n_Fine_tuning_on_All_Modalities.ipynb
Gemma_3_for_Video_Understanding.ipynb		Gemma_3_for_Video_Understanding.ipynb
Gemma_3n_Video_Vibe_Tests.ipynb		Gemma_3n_Video_Vibe_Tests.ipynb
Idefics_FT.ipynb		Idefics_FT.ipynb
LICENSE		LICENSE
PaliGemma_DPO.ipynb		PaliGemma_DPO.ipynb
README.md		README.md
Reduce_any_model_to_fp16_using_🤗_Optimum_DETR.ipynb		Reduce_any_model_to_fp16_using_🤗_Optimum_DETR.ipynb
ShieldGemma_2_for_Vision_LM_Safety.ipynb		ShieldGemma_2_for_Vision_LM_Safety.ipynb
Smol_VLM_FT.ipynb		Smol_VLM_FT.ipynb
gemma3n_fine_tuning_on_all_modalities.py		gemma3n_fine_tuning_on_all_modalities.py
knowledge_distillation.md		knowledge_distillation.md
paligemma.py		paligemma.py
smolvlm.py		smolvlm.py
train_idefics2.py		train_idefics2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Smol Vision 🐣

About

Uh oh!

Releases

Packages

Contributors 4

Languages

License

merveenoyan/smol-vision

Folders and files

Latest commit

History

Repository files navigation

Smol Vision 🐣

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages