- [2025.06.03] 🤗 We release UniWorld-V1, a unified framework for understanding, generation, and editing. All data, models, training code, and evaluation code are open-sourced. Checking our report for more details. Welcome to watch 👀 this repository for the latest updates.
💡 We also have other image edit projects that may interest you ✨.
ImgEdit: A Unified Image Editing Dataset and Benchmark
Yang Ye and Xianyi He, etc.
![]()
![]()
![]()
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Yuwei Niu, Munan Ning, etc.
![]()
![]()
![]()
Open-Sora Plan: Open-Source Large Video Generation Model
Bin Lin, Yunyang Ge and Xinhua Cheng, etc.
![]()
![]()
![]()
UniWorld-V1 shows excellent performance in 20+ tasks.
Click to play
-
We fully open-source the models, data, training and evaluation code to facilitate rapid community exploration of unified architectures.
-
We curate 10+ CV downstream tasks, including canny, depth, sketch, MLSD, segmentation and so on.
-
We annotate 286K long-caption samples using Qwen2-VL-72B. We use GPT-4o to filter ImgEdit, result in 724K high-quality editing samples (all shortedge ≥ 1024 pix). Additionally, we organize and filter existing open-sourced datasets. The details can be found here.
-
Unlike prior approaches that use VAE-encoded reference images for low-level control, we advocate using contrastive visual encoders as control signals for reference images.
-
For such encoders, we observe that as resolution increases, global features approach saturation and model capacity shifts toward preserving fine details, which is crucial for maintaining fidelity in non-edited regions.
- We find that multimodal features encoded by VLMs can interpret instructions while retaining image priors. Due to causal attention, the format
<instruction><image>
is particularly important.
1.Set up environment
git clone https://github.com/PKU-YuanGroup/UniWorld-V1
cd UniWorld-V1
conda create -n univa python=3.10 -y
conda activate univa
pip install -r requirements.txt
pip install flash_attn --no-build-isolation
2.Download pretrained checkpoint
huggingface-cli download --resume-download LanguageBind/UniWorld-V1 --local-dir ${MODEL_PATH}
huggingface-cli download --resume-download black-forest-labs/FLUX.1-dev --local-dir ${FLUX_PATH}
huggingface-cli download --resume-download google/siglip2-so400m-patch16-512 --local-dir ${SIGLIP_PATH}
3.Run with cli
MODEL_PATH="path/to/model"
FLUX_PATH="path/to/flux"
SIGLIP_PATH="path/to/siglip"
CUDA_VISIBLE_DEVICES=0 python -m univa.serve.cli \
--model_path ${MODEL_PATH} \
--flux_path ${FLUX_PATH} \
--siglip_path ${SIGLIP_PATH}
4.Run with gradio Highly recommend trying out our web demo by the following command.
python app.py --model_path ${MODEL_PATH} --flux_path ${FLUX_PATH} --siglip_path ${SIGLIP_PATH}
For 24G VRAM GPU on Linux, use NF4 quantization. Thank you @gluttony-10 very much for contribution! Then you can run the following command:
python app.py --model_path ${MODEL_PATH} --flux_path ${FLUX_PATH} --siglip_path ${SIGLIP_PATH} --nf4
Or download wikeeyang/UniWorld-V1-NF4 to ${MODEL_PATH}, and download diffusers/FLUX.1-dev-bnb-4bit to ${FLUX_PATH} instead.
For 24G VRAM GPU on Windows, use NF4 quantization and offload. It just cost 20G VRAM. Then you can run the following command:
python app.py --model_path ${MODEL_PATH} --flux_path ${FLUX_PATH} --siglip_path ${SIGLIP_PATH} --nf4 --offload
In order to use the Chinese language, run with --zh.
5.Run with ComfyUI
Thank you @judian17 very much for contribution! ComfyUI-UniWorld-jd17 is a ComfyUI implementation provided by the open-source community. Please note that the required transformers version is 4.50.0.
Download the data from LanguageBind/UniWorld-V1. The dataset consists of two parts: source images and annotation JSON files.
Prepare a data.txt
file in the following format:
-
The first column is the root path to the image.
-
The second column is the corresponding annotation JSON file.
-
The third column indicates whether to enable the region-weighting strategy. We recommend setting it to True for edited data and False for others.
data/BLIP3o-60k,json/blip3o_t2i_58859.json,false
data/coco2017_caption_canny-236k,coco2017_canny_236574.json,false
data/imgedit,json/imgedit/laion_add_part0_edit.json,true
We have prepared a data.txt
file about ImgEdit for your reference.
`data.txt` for ImgEdit
data/imgedit/action/action,json/imgedit/pandam_action_edit.json,true
data/imgedit/action/action_part2,json/imgedit/pandam2_action_edit.json,true
data/imgedit/action/action_part3,json/imgedit/pandam3_action_edit.json,true
data/imgedit/action/action_part4,json/imgedit/pandam4_action_edit.json,true
data/imgedit/add/add_part0,json/imgedit/laion_add_part0_edit.json,true
data/imgedit/add/add_part1,json/imgedit/laion_add_part1_edit.json,true
data/imgedit/add/add_part4,json/imgedit/results_add_laion_part4_edit.json,true
data/imgedit/add/add_part5,json/imgedit/results_add_laion_part5_edit.json,true
data/imgedit/adjust/adjust_part0,json/imgedit/results_adjust_canny_laion_part0_edit.json,true
data/imgedit/adjust/adjust_part2,json/imgedit/results_adjust_canny_laion_part2_edit.json,true
data/imgedit/adjust/adjust_part3,json/imgedit/results_adjust_canny_laion_part3_edit.json,true
data/imgedit/adjust/adjust_part4,json/imgedit/laion_adjust_canny_part4_edit.json,true
data/imgedit/background/background_part0,json/imgedit/results_background_laion_part0_edit.json,true
data/imgedit/background/background_part2,json/imgedit/results_background_laion_part2_edit.json,true
data/imgedit/background/background_part3,json/imgedit/laion_background_part3_edit.json,true
data/imgedit/background/background_part5,json/imgedit/laion_background_part5_edit.json,true
data/imgedit/background/background_part7,json/imgedit/laion_background_part7_edit.json,true
data/imgedit/compose/compose_part0,json/imgedit/results_compose_part0_edit.json,false
data/imgedit/compose/compose_part2,json/imgedit/results_compose_part2_edit.json,false
data/imgedit/compose/compose_part6,json/imgedit/results_compose_part6_fix_edit.json,false
data/imgedit/refine_replace/refine_replace_part1,json/imgedit/results_extract_ref_part1_refimg_edit.json,true
data/imgedit/remove/remove_part0,json/imgedit/laion_remove_part0_edit.json,true
data/imgedit/remove/remove_part1,json/imgedit/results_remove_laion_part1_edit.json,true
data/imgedit/remove/remove_part4,json/imgedit/results_remove_laion_part4_edit.json,true
data/imgedit/remove/remove_part5,json/imgedit/results_remove_laion_part5_edit.json,true
data/imgedit/replace/replace_part0,json/imgedit/laion_replace_part0_edit.json,true
data/imgedit/replace/replace_part1,json/imgedit/laion_replace_part1_edit.json,true
data/imgedit/replace/replace_part4,json/imgedit/results_replace_laion_part4_edit.json,true
data/imgedit/replace/replace_part5,json/imgedit/results_replace_laion_part5_edit.json,true
data/imgedit/transfer/transfer,json/imgedit/results_style_transfer_edit.json,false
data/imgedit/transfer/transfer_part0,json/imgedit/results_style_transfer_part0_cap36472_edit.json,false
We provide a simple online verification tool to check whether your paths are set in data.txt
correctly.
python univa/serve/check_data.py
Text-to-Image Generation
- BLIP3o-60k: We add text-to-image instructions to half of the data. [108 GB storage usage.]
- OSP1024-286k: Sourced from internal data of the Open-Sora Plan, with captions generated using Qwen2-VL-72B. Images have an aspect ratio between 3:4 and 4:3, aesthetic score ≥ 6, and a short side ≥ 1024 pixels. [326 GB storage usage.]
Image Editing
- imgedit-724k: Data is filtered using GPT-4o, retaining approximately half. [2.8T storage usage.]
- OmniEdit-368k: For image editing data, samples with edited regions smaller than 1/100 were filtered out; images have a short side ≥ 1024 pixels. [204 GB storage usage.]
- SEED-Data-Edit-Part1-Openimages-65k: For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [10 GB storage usage.]
- SEED-Data-Edit-Part2-3-12k: For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [10 GB storage usage.]
- PromptfixData-18k: For image restoration data and some editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [9 GB storage usage.]
- StyleBooth-11k: For transfer style data, images have a short side ≥ 1024 pixels. [4 GB storage usage.]
- Ghibli-36k: For transfer style data, images have a short side ≥ 1024 pixels. Warning: This data has not been quality filtered. [170 GB storage usage.]
Extract & Try-on
- viton_hd-23k: Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.]
- deepfashion-27k: Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.]
- shop_product-23k: Sourced from internal data of the Open-Sora Plan, focusing on product extraction and virtual try-on, with images having a short side ≥ 1024 pixels. [12 GB storage usage.]
Image Perception
- coco2017_caption_canny-236k: img->canny & canny->img [25 GB storage usage.]
- coco2017_caption_depth-236k: img->depth & depth->img [8 GB storage usage.]
- coco2017_caption_hed-236k: img->hed & hed->img [13 GB storage usage.]
- coco2017_caption_mlsd-236k: img->mlsd & mlsd->img [ GB storage usage.]
- coco2017_caption_normal-236k: img->normal & normal->img [10 GB storage usage.]
- coco2017_caption_openpose-62k: img->pose & pose->img [2 GB storage usage.]
- coco2017_caption_sketch-236k: img->sketch & sketch->img [15 GB storage usage.]
- unsplash_canny-20k: img->canny & canny->img [2 GB storage usage.]
- open_pose-40k: img->pose & pose->img [4 GB storage usage.]
- mscoco-controlnet-canny-less-colors-236k: img->canny & canny->img [13 GB storage usage.]
- coco2017_seg_box-448k: img->detection & img->segmentation (mask), instances with regions smaller than 1/100 were filtered out. We visualise masks on the original image as gt-image. [39 GB storage usage.]
- viton_hd-11k: img->pose [1 GB storage usage.]
- deepfashion-13k: img->pose [1 GB storage usage.]
Download black-forest-labs/FLUX.1-dev to $FLUX_PATH
.
Download Qwen/Qwen2.5-VL-7B-Instruct to $QWENVL_PATH
. We also support other sizes of Qwen2.5-VL.
SAVE_PATH="path/to/save/UniWorld-Qwen2.5-VL-7B-Instruct-FLUX.1-dev-fp32"
python scripts/make_univa_qwen2p5vl_weight.py \
--origin_flux_ckpt_path $FLUX_PATH \
--origin_qwenvl_ckpt_path $QWENVL_PATH \
--save_path ${SAVE_PATH}
You need to specify pretrained_lvlm_name_or_path
to ${SAVE_PATH}
in flux_qwen2p5vl_7b_vlm_stage1_512.yaml
.
We recommend using optimizer: prodigy
with learning_rate: 1.0
in flux_qwen2p5vl_7b_vlm_stage1_512.yaml
.
For training with 512×512 scale images (batch size 1), it consume about 74G in 1 node (8 GPUs).
Setting ema_pretrained_lvlm_name_or_path: null
can saving memory if you want to train the higher resolution (e.g, 1024×1024 scale) or larger batch size.
# stage 1
# if use prodigy, pip install prodigy
bash scripts/denoiser/flux_qwen2p5vl_7b_vlm_stage1_512.sh
Download flux-redux-siglipv2-512.bin and set its path to pretrained_siglip_mlp_path
in flux_qwen2p5vl_7b_vlm_stage2_512.yaml
. The weight is sourced from ostris/Flex.1-alpha-Redux, we just re-organize the weight.
Download google/siglip2-so400m-patch16-512 and set its path to pretrained_siglip_name_or_path
in flux_qwen2p5vl_7b_vlm_stage2_512.yaml
.
You also need to specify pretrained_mlp2_path
, which is trained by stage 1.
For training with 512×512 scale images (batch size 1), it consume about 78G in 1 node (8 GPUs).
Setting ema_pretrained_lvlm_name_or_path: null
can saving memory if you want to train the higher resolution (e.g, 1024×1024 scale) or larger batch size. Using more nodes also can save memory because we use zero2 for main model in stage 2.
# stage 2
bash scripts/denoiser/flux_qwen2p5vl_7b_vlm_stage2_512.sh
GenEval
cd univa/eval/geneval
# follow the instruction in univa/eval/geneval/README.md
WISE
cd univa/eval/wise
# follow the instruction in univa/eval/wise/README.md
GenAI-Bench
cd univa/eval/genai
# follow the instruction in univa/eval/genai/README.md
DPG-Bench
cd univa/eval/dpgbench
# follow the instruction in univa/eval/dpgbench/README.md
ImgEdit
We have updated gpt-4.1
results instead of gpt-4o-2024-08-06
here. See here for more details.
cd univa/eval/imgedit
# follow the instruction in univa/eval/imgedit/README.md
GEdit
We discuss the scores related to GEdit-Bench here.
cd univa/eval/gdit
# follow the instruction in univa/eval/gdit/README.md
We greatly appreciate your contributions to the UniWorld-V1 open-source community and helping us make it even better than it is now!
For more details, please refer to the Contribution Guidelines.
- ImgEdit: ImgEdit is a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs.
- Open-Sora Plan: An open‑source text-to-image/video foundation model, which provides a lot of caption data.
- SEED-Data-Edit: A hybrid dataset for instruction-guided image editing.
- Qwen2.5-VL: The new flagship vision-language model of Qwen.
- FLUX.1-Redux-dev: Given an input image, FLUX.1 Redux can reproduce the image with slight variation, allowing to refine a given image.
- SigLIP 2: New multilingual vision-language encoders.
- Step1X-Edit: A state-of-the-art image editing model.
- BLIP3-o: A unified multimodal model that combines the reasoning and instruction following strength of autoregressive models with the generative power of diffusion models.
- BAGEL: An open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data.
- Visual Encoder: #5 #15 #18
- Data Setup: #17
- Editing Evaluation: #6 #16
- Training Process and Analysis: #3 #9 #14 #28
- See LICENSE for details. The FLUX weights fall under the FLUX.1 [dev] Non-Commercial License.
@article{lin2025uniworld,
title={UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation},
author={Lin, Bin and Li, Zongjian and Cheng, Xinhua and Niu, Yuwei and Ye, Yang and He, Xianyi and Yuan, Shenghai and Yu, Wangbo and Wang, Shaodong and Ge, Yunyang and others},
journal={arXiv preprint arXiv:2506.03147},
year={2025}
}
@article{ye2025imgedit,
title={ImgEdit: A Unified Image Editing Dataset and Benchmark},
author={Ye, Yang and He, Xianyi and Li, Zongjian and Lin, Bin and Yuan, Shenghai and Yan, Zhiyuan and Hou, Bohan and Yuan, Li},
journal={arXiv preprint arXiv:2505.20275},
year={2025}
}
@article{niu2025wise,
title={Wise: A world knowledge-informed semantic evaluation for text-to-image generation},
author={Niu, Yuwei and Ning, Munan and Zheng, Mengren and Lin, Bin and Jin, Peng and Liao, Jiaqi and Ning, Kunpeng and Zhu, Bin and Yuan, Li},
journal={arXiv preprint arXiv:2503.07265},
year={2025}
}
@article{yan2025gpt,
title={Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation},
author={Yan, Zhiyuan and Ye, Junyan and Li, Weijia and Huang, Zilong and Yuan, Shenghai and He, Xiangyang and Lin, Kaiqing and He, Jun and He, Conghui and Yuan, Li},
journal={arXiv preprint arXiv:2504.02782},
year={2025}
}
@article{lin2024open,
title={Open-Sora Plan: Open-Source Large Video Generation Model},
author={Lin, Bin and Ge, Yunyang and Cheng, Xinhua and Li, Zongjian and Zhu, Bin and Wang, Shaodong and He, Xianyi and Ye, Yang and Yuan, Shenghai and Chen, Liuhan and others},
journal={arXiv preprint arXiv:2412.00131},
year={2024}
}