📣 News

UniWorld-V1: High-Resolution Semantic Encoders for
Unified Visual Understanding and Generation

📣 News

[2025.06.03] 🤗 We release UniWorld-V1, a unified framework for understanding, generation, and editing. All data, models, training code, and evaluation code are open-sourced. Checking our report for more details. Welcome to watch 👀 this repository for the latest updates.

💡 We also have other image edit projects that may interest you ✨.

ImgEdit: A Unified Image Editing Dataset and Benchmark
Yang Ye and Xianyi He, etc.

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Yuwei Niu, Munan Ning, etc.

Open-Sora Plan: Open-Source Large Video Generation Model
Bin Lin, Yunyang Ge and Xinhua Cheng, etc.

😍 Gallery

UniWorld-V1 shows excellent performance in 20+ tasks.

Click to play

😮 Highlights

1. All Resources Fully Open-Sourced

We fully open-source the models, data, training and evaluation code to facilitate rapid community exploration of unified architectures.
We curate 10+ CV downstream tasks, including canny, depth, sketch, MLSD, segmentation and so on.
We annotate 286K long-caption samples using Qwen2-VL-72B. We use GPT-4o to filter ImgEdit, result in 724K high-quality editing samples (all shortedge ≥ 1024 pix). Additionally, we organize and filter existing open-sourced datasets. The details can be found here.

2. Contrastive Semantic Encoders as Reference Control Signals

Unlike prior approaches that use VAE-encoded reference images for low-level control, we advocate using contrastive visual encoders as control signals for reference images.
For such encoders, we observe that as resolution increases, global features approach saturation and model capacity shifts toward preserving fine details, which is crucial for maintaining fidelity in non-edited regions.

3. Image Priors via VLM Encoding Without Learnable Tokens

We find that multimodal features encoded by VLMs can interpret instructions while retaining image priors. Due to causal attention, the format <instruction><image> is particularly important.

🔥 Quick Start

1.Set up environment

git clone https://github.com/PKU-YuanGroup/UniWorld-V1
cd UniWorld-V1
conda create -n univa python=3.10 -y
conda activate univa
pip install -r requirements.txt
pip install flash_attn --no-build-isolation

2.Download pretrained checkpoint

huggingface-cli download --resume-download LanguageBind/UniWorld-V1 --local-dir ${MODEL_PATH}
huggingface-cli download --resume-download black-forest-labs/FLUX.1-dev --local-dir ${FLUX_PATH}
huggingface-cli download --resume-download google/siglip2-so400m-patch16-512 --local-dir ${SIGLIP_PATH}

3.Run with cli

MODEL_PATH="path/to/model"
FLUX_PATH="path/to/flux"
SIGLIP_PATH="path/to/siglip"
CUDA_VISIBLE_DEVICES=0 python -m univa.serve.cli \
    --model_path ${MODEL_PATH} \
    --flux_path ${FLUX_PATH} \
    --siglip_path ${SIGLIP_PATH}

4.Run with gradio Highly recommend trying out our web demo by the following command.

python app.py --model_path ${MODEL_PATH} --flux_path ${FLUX_PATH} --siglip_path ${SIGLIP_PATH}

For 24G VRAM GPU on Linux, use NF4 quantization. Thank you @gluttony-10 very much for contribution! Then you can run the following command:

python app.py --model_path ${MODEL_PATH} --flux_path ${FLUX_PATH} --siglip_path ${SIGLIP_PATH} --nf4

Or download wikeeyang/UniWorld-V1-NF4 to ${MODEL_PATH}, and download diffusers/FLUX.1-dev-bnb-4bit to ${FLUX_PATH} instead.

For 24G VRAM GPU on Windows, use NF4 quantization and offload. It just cost 20G VRAM. Then you can run the following command:

python app.py --model_path ${MODEL_PATH} --flux_path ${FLUX_PATH} --siglip_path ${SIGLIP_PATH} --nf4 --offload

In order to use the Chinese language, run with --zh.

5.Run with ComfyUI

Thank you @judian17 very much for contribution! ComfyUI-UniWorld-jd17 is a ComfyUI implementation provided by the open-source community. Please note that the required transformers version is 4.50.0.

🗝️ Training

We release the log for stage1 at 512 reslution.

Data preparation

Download the data from LanguageBind/UniWorld-V1. The dataset consists of two parts: source images and annotation JSON files.

Prepare a data.txt file in the following format:

The first column is the root path to the image.
The second column is the corresponding annotation JSON file.
The third column indicates whether to enable the region-weighting strategy. We recommend setting it to True for edited data and False for others.

data/BLIP3o-60k,json/blip3o_t2i_58859.json,false
data/coco2017_caption_canny-236k,coco2017_canny_236574.json,false
data/imgedit,json/imgedit/laion_add_part0_edit.json,true

We have prepared a data.txt file about ImgEdit for your reference.

`data.txt` for ImgEdit

data/imgedit/action/action,json/imgedit/pandam_action_edit.json,true
data/imgedit/action/action_part2,json/imgedit/pandam2_action_edit.json,true
data/imgedit/action/action_part3,json/imgedit/pandam3_action_edit.json,true
data/imgedit/action/action_part4,json/imgedit/pandam4_action_edit.json,true
data/imgedit/add/add_part0,json/imgedit/laion_add_part0_edit.json,true
data/imgedit/add/add_part1,json/imgedit/laion_add_part1_edit.json,true
data/imgedit/add/add_part4,json/imgedit/results_add_laion_part4_edit.json,true
data/imgedit/add/add_part5,json/imgedit/results_add_laion_part5_edit.json,true
data/imgedit/adjust/adjust_part0,json/imgedit/results_adjust_canny_laion_part0_edit.json,true
data/imgedit/adjust/adjust_part2,json/imgedit/results_adjust_canny_laion_part2_edit.json,true
data/imgedit/adjust/adjust_part3,json/imgedit/results_adjust_canny_laion_part3_edit.json,true
data/imgedit/adjust/adjust_part4,json/imgedit/laion_adjust_canny_part4_edit.json,true
data/imgedit/background/background_part0,json/imgedit/results_background_laion_part0_edit.json,true
data/imgedit/background/background_part2,json/imgedit/results_background_laion_part2_edit.json,true
data/imgedit/background/background_part3,json/imgedit/laion_background_part3_edit.json,true
data/imgedit/background/background_part5,json/imgedit/laion_background_part5_edit.json,true
data/imgedit/background/background_part7,json/imgedit/laion_background_part7_edit.json,true
data/imgedit/compose/compose_part0,json/imgedit/results_compose_part0_edit.json,false
data/imgedit/compose/compose_part2,json/imgedit/results_compose_part2_edit.json,false
data/imgedit/compose/compose_part6,json/imgedit/results_compose_part6_fix_edit.json,false
data/imgedit/refine_replace/refine_replace_part1,json/imgedit/results_extract_ref_part1_refimg_edit.json,true
data/imgedit/remove/remove_part0,json/imgedit/laion_remove_part0_edit.json,true
data/imgedit/remove/remove_part1,json/imgedit/results_remove_laion_part1_edit.json,true
data/imgedit/remove/remove_part4,json/imgedit/results_remove_laion_part4_edit.json,true
data/imgedit/remove/remove_part5,json/imgedit/results_remove_laion_part5_edit.json,true
data/imgedit/replace/replace_part0,json/imgedit/laion_replace_part0_edit.json,true
data/imgedit/replace/replace_part1,json/imgedit/laion_replace_part1_edit.json,true
data/imgedit/replace/replace_part4,json/imgedit/results_replace_laion_part4_edit.json,true
data/imgedit/replace/replace_part5,json/imgedit/results_replace_laion_part5_edit.json,true
data/imgedit/transfer/transfer,json/imgedit/results_style_transfer_edit.json,false
data/imgedit/transfer/transfer_part0,json/imgedit/results_style_transfer_part0_cap36472_edit.json,false

We provide a simple online verification tool to check whether your paths are set in data.txt correctly.

python univa/serve/check_data.py

Data details

Text-to-Image Generation

BLIP3o-60k: We add text-to-image instructions to half of the data. [108 GB storage usage.]
OSP1024-286k: Sourced from internal data of the Open-Sora Plan, with captions generated using Qwen2-VL-72B. Images have an aspect ratio between 3:4 and 4:3, aesthetic score ≥ 6, and a short side ≥ 1024 pixels. [326 GB storage usage.]

Image Editing

imgedit-724k: Data is filtered using GPT-4o, retaining approximately half. [2.8T storage usage.]
OmniEdit-368k: For image editing data, samples with edited regions smaller than 1/100 were filtered out; images have a short side ≥ 1024 pixels. [204 GB storage usage.]
SEED-Data-Edit-Part1-Openimages-65k: For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [10 GB storage usage.]
SEED-Data-Edit-Part2-3-12k: For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [10 GB storage usage.]
PromptfixData-18k: For image restoration data and some editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [9 GB storage usage.]
StyleBooth-11k: For transfer style data, images have a short side ≥ 1024 pixels. [4 GB storage usage.]
Ghibli-36k: For transfer style data, images have a short side ≥ 1024 pixels. Warning: This data has not been quality filtered. [170 GB storage usage.]

Extract & Try-on

viton_hd-23k: Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.]
deepfashion-27k: Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.]
shop_product-23k: Sourced from internal data of the Open-Sora Plan, focusing on product extraction and virtual try-on, with images having a short side ≥ 1024 pixels. [12 GB storage usage.]

Image Perception

coco2017_caption_canny-236k: img->canny & canny->img [25 GB storage usage.]
coco2017_caption_depth-236k: img->depth & depth->img [8 GB storage usage.]
coco2017_caption_hed-236k: img->hed & hed->img [13 GB storage usage.]
coco2017_caption_mlsd-236k: img->mlsd & mlsd->img [ GB storage usage.]
coco2017_caption_normal-236k: img->normal & normal->img [10 GB storage usage.]
coco2017_caption_openpose-62k: img->pose & pose->img [2 GB storage usage.]
coco2017_caption_sketch-236k: img->sketch & sketch->img [15 GB storage usage.]
unsplash_canny-20k: img->canny & canny->img [2 GB storage usage.]
open_pose-40k: img->pose & pose->img [4 GB storage usage.]
mscoco-controlnet-canny-less-colors-236k: img->canny & canny->img [13 GB storage usage.]
coco2017_seg_box-448k: img->detection & img->segmentation (mask), instances with regions smaller than 1/100 were filtered out. We visualise masks on the original image as gt-image. [39 GB storage usage.]
viton_hd-11k: img->pose [1 GB storage usage.]
deepfashion-13k: img->pose [1 GB storage usage.]

Training

Prepare pretrained weights

Download black-forest-labs/FLUX.1-dev to $FLUX_PATH. Download Qwen/Qwen2.5-VL-7B-Instruct to $QWENVL_PATH. We also support other sizes of Qwen2.5-VL.

SAVE_PATH="path/to/save/UniWorld-Qwen2.5-VL-7B-Instruct-FLUX.1-dev-fp32"
python scripts/make_univa_qwen2p5vl_weight.py \
    --origin_flux_ckpt_path $FLUX_PATH \
    --origin_qwenvl_ckpt_path $QWENVL_PATH \
    --save_path ${SAVE_PATH}

Stage 1

You need to specify pretrained_lvlm_name_or_path to ${SAVE_PATH} in flux_qwen2p5vl_7b_vlm_stage1_512.yaml.

We recommend using optimizer: prodigy with learning_rate: 1.0 in flux_qwen2p5vl_7b_vlm_stage1_512.yaml.

For training with 512×512 scale images (batch size 1), it consume about 74G in 1 node (8 GPUs).

Setting ema_pretrained_lvlm_name_or_path: null can saving memory if you want to train the higher resolution (e.g, 1024×1024 scale) or larger batch size.

# stage 1
# if use prodigy, pip install prodigy
bash scripts/denoiser/flux_qwen2p5vl_7b_vlm_stage1_512.sh

Stage 2

Download flux-redux-siglipv2-512.bin and set its path to pretrained_siglip_mlp_path in flux_qwen2p5vl_7b_vlm_stage2_512.yaml. The weight is sourced from ostris/Flex.1-alpha-Redux, we just re-organize the weight.

Download google/siglip2-so400m-patch16-512 and set its path to pretrained_siglip_name_or_path in flux_qwen2p5vl_7b_vlm_stage2_512.yaml.

You also need to specify pretrained_mlp2_path, which is trained by stage 1.

For training with 512×512 scale images (batch size 1), it consume about 78G in 1 node (8 GPUs).

Setting ema_pretrained_lvlm_name_or_path: null can saving memory if you want to train the higher resolution (e.g, 1024×1024 scale) or larger batch size. Using more nodes also can save memory because we use zero2 for main model in stage 2.

# stage 2
bash scripts/denoiser/flux_qwen2p5vl_7b_vlm_stage2_512.sh

⚡️ Evaluation

Text-to-Image Generation

GenEval

cd univa/eval/geneval
# follow the instruction in univa/eval/geneval/README.md

WISE

cd univa/eval/wise
# follow the instruction in univa/eval/wise/README.md

GenAI-Bench

cd univa/eval/genai
# follow the instruction in univa/eval/genai/README.md

DPG-Bench

cd univa/eval/dpgbench
# follow the instruction in univa/eval/dpgbench/README.md

Image Editing

ImgEdit

We have updated gpt-4.1 results instead of gpt-4o-2024-08-06 here. See here for more details.

cd univa/eval/imgedit
# follow the instruction in univa/eval/imgedit/README.md

GEdit

We discuss the scores related to GEdit-Bench here.

cd univa/eval/gdit
# follow the instruction in univa/eval/gdit/README.md

📊 Benchmarks

💡 How to Contribute

We greatly appreciate your contributions to the UniWorld-V1 open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines.

👍 Acknowledgement and Related Work

ImgEdit: ImgEdit is a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs.
Open-Sora Plan: An open‑source text-to-image/video foundation model, which provides a lot of caption data.
SEED-Data-Edit: A hybrid dataset for instruction-guided image editing.
Qwen2.5-VL: The new flagship vision-language model of Qwen.
FLUX.1-Redux-dev: Given an input image, FLUX.1 Redux can reproduce the image with slight variation, allowing to refine a given image.
SigLIP 2: New multilingual vision-language encoders.
Step1X-Edit: A state-of-the-art image editing model.
BLIP3-o: A unified multimodal model that combines the reasoning and instruction following strength of autoregressive models with the generative power of diffusion models.
BAGEL: An open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data.

🧐 FAQ

Visual Encoder: #5 #15 #18
Data Setup: #17
Editing Evaluation: #6 #16
Training Process and Analysis: #3 #9 #14 #28

🔒 License

See LICENSE for details. The FLUX weights fall under the FLUX.1 [dev] Non-Commercial License.

✏️ Citing

@article{lin2025uniworld,
  title={UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation},
  author={Lin, Bin and Li, Zongjian and Cheng, Xinhua and Niu, Yuwei and Ye, Yang and He, Xianyi and Yuan, Shenghai and Yu, Wangbo and Wang, Shaodong and Ge, Yunyang and others},
  journal={arXiv preprint arXiv:2506.03147},
  year={2025}
}
@article{ye2025imgedit,
  title={ImgEdit: A Unified Image Editing Dataset and Benchmark},
  author={Ye, Yang and He, Xianyi and Li, Zongjian and Lin, Bin and Yuan, Shenghai and Yan, Zhiyuan and Hou, Bohan and Yuan, Li},
  journal={arXiv preprint arXiv:2505.20275},
  year={2025}
}
@article{niu2025wise,
  title={Wise: A world knowledge-informed semantic evaluation for text-to-image generation},
  author={Niu, Yuwei and Ning, Munan and Zheng, Mengren and Lin, Bin and Jin, Peng and Liao, Jiaqi and Ning, Kunpeng and Zhu, Bin and Yuan, Li},
  journal={arXiv preprint arXiv:2503.07265},
  year={2025}
}
@article{yan2025gpt,
  title={Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation},
  author={Yan, Zhiyuan and Ye, Junyan and Li, Weijia and Huang, Zilong and Yuan, Shenghai and He, Xiangyang and Lin, Kaiqing and He, Jun and He, Conghui and Yuan, Li},
  journal={arXiv preprint arXiv:2504.02782},
  year={2025}
}
@article{lin2024open,
  title={Open-Sora Plan: Open-Source Large Video Generation Model},
  author={Lin, Bin and Ge, Yunyang and Cheng, Xinhua and Li, Zongjian and Zhu, Bin and Wang, Shaodong and He, Xianyi and Ye, Yang and Yuan, Shenghai and Chen, Liuhan and others},
  journal={arXiv preprint arXiv:2412.00131},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
assets		assets
docs		docs
scripts		scripts
univa		univa
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
train_denoiser.py		train_denoiser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UniWorld-V1: High-Resolution Semantic Encoders for
Unified Visual Understanding and Generation

📣 News

😍 Gallery

😮 Highlights

1. All Resources Fully Open-Sourced

2. Contrastive Semantic Encoders as Reference Control Signals

3. Image Priors via VLM Encoding Without Learnable Tokens

🔥 Quick Start

🗝️ Training

Data preparation

Data details

Training

Prepare pretrained weights

Stage 1

Stage 2

⚡️ Evaluation

Text-to-Image Generation

Image Editing

📊 Benchmarks

💡 How to Contribute

👍 Acknowledgement and Related Work

🧐 FAQ

🔒 License

✏️ Citing

🤝 Community contributors

✨ Star History

About

Uh oh!

Releases 1

Packages

Contributors 6

Languages

License

PKU-YuanGroup/UniWorld-V1

Folders and files

Latest commit

History

Repository files navigation

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

📣 News

😍 Gallery

😮 Highlights

1. All Resources Fully Open-Sourced

2. Contrastive Semantic Encoders as Reference Control Signals

3. Image Priors via VLM Encoding Without Learnable Tokens

🔥 Quick Start

🗝️ Training

Data preparation

Data details

Training

Prepare pretrained weights

Stage 1

Stage 2

⚡️ Evaluation

Text-to-Image Generation

Image Editing

📊 Benchmarks

💡 How to Contribute

👍 Acknowledgement and Related Work

🧐 FAQ

🔒 License

✏️ Citing

🤝 Community contributors

✨ Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 6

Languages

UniWorld-V1: High-Resolution Semantic Encoders for
Unified Visual Understanding and Generation

Packages