Skip to content

Commit ecb5984

Browse files
committed
Release training scripts and dataset.
1 parent fda0665 commit ecb5984

File tree

9 files changed

+126
-95
lines changed

9 files changed

+126
-95
lines changed

README.md

Lines changed: 42 additions & 95 deletions
Original file line numberDiff line numberDiff line change
@@ -159,7 +159,9 @@ python -m llava.serve.cli \
159159

160160
## Train
161161

162-
LLaVA training consists of two stages: (1) feature alignment stage: use approximately 600K filtered CC3M to connect a *frozen pretrained* vision encoder to a *frozen LLM*; (2) visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following to teach the model to follow multimodal instructions.
162+
*Below is the latest training configuration for LLaVA v1.5. For legacy models, please refer to README of [this](https://github.com/haotian-liu/LLaVA/tree/v1.0.1) version for now. We'll add them in a separate doc later.*
163+
164+
LLaVA training consists of two stages: (1) feature alignment stage: use approximately 600K filtered CC3M to connect a *frozen pretrained* vision encoder to a *frozen LLM*; (2) visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data (with VQA data from academic-oriented tasks) to teach the model to follow multimodal instructions.
163165

164166
LLaVA is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`.
165167

@@ -170,126 +172,75 @@ We use a similar set of hyperparameters as Vicuna in finetuning. Both hyperpara
170172

171173
| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
172174
| --- | ---: | ---: | ---: | ---: | ---: |
173-
| LLaVA-13B | 256 | 1e-3 | 1 | 2048 | 0 |
175+
| LLaVA-v1.5-13B | 256 | 1e-3 | 1 | 2048 | 0 |
174176

175177
2. Finetuning
176178

177179
| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
178180
| --- | ---: | ---: | ---: | ---: | ---: |
179-
| LLaVA-13B | 128 | 2e-5 | 1 | 2048 | 0 |
180-
181-
### Prepare Vicuna checkpoints
181+
| LLaVA-v1.5-13B | 128 | 2e-5 | 1 | 2048 | 0 |
182182

183-
Before you start, prepare our base model Vicuna, which is an instruction-tuned chatbot. Please download its weights [here](https://github.com/lm-sys/FastChat#model-weights).
183+
### Download Vicuna checkpoints (automatically)
184184

185-
Vicuna has two versions: v0 and v1, the main difference between them is the prompt of format. We support both. To ensure the best performance, you need to specify the correct prompt version corresponding to the weights you download: `v0` for `v0` weights, and `v1` for all Vicuna `v1.x` models.
185+
Our base model Vicuna v1.5, which is an instruction-tuned chatbot, will be downloaded automatically when you run our provided training scripts. No action is needed.
186186

187187
### Pretrain (feature alignment)
188188

189-
Please download the subset of the CC3M dataset we use in the paper [here](https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K).
190-
191-
Pretrain takes around 4 hours for LLaVA-13B on 8x A100 (80G). It takes around 2 hours for 7B checkpoints.
189+
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain).
192190

193-
We recommend training with DeepSpeed as it can save a lot of GPU RAM. We provide training script with DeepSpeed [here](https://github.com/haotian-liu/LLaVA/blob/main/scripts/pretrain.sh).
191+
Pretrain takes around 5.5 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 3.5 hours for LLaVA-v1.5-7B.
194192

195-
You may run this with a single A100 GPU with the following code. Please note that the `per_device_train_batch_size` * `gradient_accumulation_steps` should be equal to 128 to keep the global batch size the same.
196-
197-
<details>
198-
<summary>Pretrain: LLaVA-13B, 1x A100 (80G). Time: ~33 hours.</summary>
199-
200-
```Shell
201-
python llava/train/train_mem.py \
202-
--model_name_or_path ./checkpoints/vicuna-13b \
203-
--version [v0 or v1] \
204-
--data_path /path/to/cc3m_595k.json \
205-
--image_folder /path/to/cc3m_595k_images \
206-
--vision_tower openai/clip-vit-large-patch14 \
207-
--tune_mm_mlp_adapter True \
208-
--mm_vision_select_layer -2 \
209-
--mm_use_im_start_end False \
210-
--mm_use_im_patch_token False \
211-
--bf16 True \
212-
--output_dir ./checkpoints/llava-13b-pretrain \
213-
--num_train_epochs 1 \
214-
--per_device_train_batch_size 16 \
215-
--per_device_eval_batch_size 4 \
216-
--gradient_accumulation_steps 8 \
217-
--evaluation_strategy "no" \
218-
--save_strategy "steps" \
219-
--save_steps 2400 \
220-
--save_total_limit 1 \
221-
--learning_rate 2e-3 \
222-
--weight_decay 0. \
223-
--warmup_ratio 0.03 \
224-
--lr_scheduler_type "cosine" \
225-
--logging_steps 1 \
226-
--tf32 True \
227-
--model_max_length 2048 \
228-
--gradient_checkpointing True \
229-
--lazy_preprocess True \
230-
--report_to wandb
231-
```
232-
</details>
193+
Training script with DeepSpeed ZeRO-2: [`pretrain.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/pretrain.sh).
233194

195+
`--mm_projector_type mlp2x_gelu` is the only new option for pretraining LLaVA-v1.5, which indicates the two-layer MLP vision-language connector.
234196

235197
### Visual Instruction Tuning
236198

237199
1. Prepare data
238200

239-
Please download the annotation of our instruction tuning data [llava_instruct_158k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_150k.json), and download the COCO train2017 images [here](https://cocodataset.org/#download).
240-
241-
2. Start training!
242-
243-
You may download our pretrained projectors in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md). It is not recommended to use legacy projectors, as they may be trained with a different version of the codebase, and if any option is off, the model will not function/train as we expected.
244-
245-
When we initially released our paper, we used a full 3-epoch schedule on the LLaVA-Instruct-158K dataset. The scripts are provided [here](https://github.com/haotian-liu/LLaVA/blob/main/scripts/finetune_full_schedule.sh).
246-
247-
In our later exploration, we introduced LLaVA-Lightning, as we find that a much faster 1-epoch schedule on LLaVA-Instruct-80K can achieve fast convergence and good performance. With LLaVA Lightning, we are able to train, validate, and release LLaVA-LLaMA-2 checkpoints preview on the same day as LLaMA-2 release. If you are interested to learn more about LLaVA Lightning, please continue to the following section.
248-
249-
### Lightning
250-
251-
LLaVA-Lightning can be trained on 8x A100 GPUs in just 3 hours, including both pretraining and finetuning. When using spot instances, it costs just ~$40.
252-
253-
For LLaVA Lightning, we create two distilled subset to ensure both a broad concept coverage, and the efficiency in training. Furthermore, we only perform instruction tuning for 1 epoch, in contrast to 3 epochs in the paper. We find such schedule is effective and can achieve fast convergence and good performance.
254-
255-
For pretraining, we create a concept-balanced subset of LAION-CC-SBU. It consists of 558K images. Download data [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/tree/main).
201+
Please download the annotation of the final mixture our instruction tuning data [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json), and download the images from constituting datasets:
256202

257-
For instruction tuning, we create a subset of LLaVA-Instruct-150K. It consists of 80K image-instruction pairs, consisting of 40K conversation and 40K complex reasoning data, with non-overlapping images. Download `llava_instruct_80k.json` [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_80k.json).
203+
- COCO: [train2017](http://images.cocodataset.org/zips/train2017.zip)
204+
- GQA: [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
205+
- OCR-VQA: [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing)
206+
- TextVQA: [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
207+
- VisualGenome: [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)
258208

259-
#### Hyperparameters
209+
After downloading all of them, organize the data as follows in `./playground/data`,
260210

261-
1. Pretraining ([script](https://github.com/haotian-liu/LLaVA/blob/main/scripts/pretrain.sh))
262-
263-
| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
264-
| --- | ---: | ---: | ---: | ---: | ---: |
265-
| LLaVA-Lightning | 128 | 2e-3 | 1 | 2048 | 0 |
266-
267-
2. Visual Instruction Tuning ([script](https://github.com/haotian-liu/LLaVA/blob/main/scripts/finetune.sh))
211+
```
212+
├── coco
213+
│ └── train2017
214+
├── gqa
215+
│ └── images
216+
├── ocr_vqa
217+
│ └── images
218+
├── textvqa
219+
│ └── train_images
220+
└── vg
221+
├── VG_100K
222+
└── VG_100K_2
223+
```
268224

269-
| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
270-
| --- | ---: | ---: | ---: | ---: | ---: |
271-
| LLaVA-Lightning | 128 | 2e-5 | 1 | 2048 | 0 |
225+
2. Start training!
272226

273-
#### LLaVA-MPT-7b
274-
Thanks to LLaVA-Lightning, we are able to train a checkpoint based on MPT-7B-Chat on 8x A100 GPUs in just 3 hours, including both pretraining and finetuning.
227+
You may download our pretrained projectors in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md). It is not recommended to use legacy projectors, as they may be trained with a different version of the codebase, and if any option is off, the model will not function/train as we expected.
275228

276-
**NOTE**: This is a research preview of the LLaVA-Lightning based on MPT-7B-chat checkpoint. The usage of the model should comply with MPT-7B-chat license and agreements.
229+
Visual instruction tuning takes around 20 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 10 hours for LLaVA-v1.5-7B on 8x A100 (40G).
277230

278-
1. Usage
231+
Training script with DeepSpeed ZeRO-3: [`finetune.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune.sh).
279232

280-
You do not need to download our checkpoint, it will directly load from our Hugging Face model: [`liuhaotian/LLaVA-Lightning-MPT-7B-preview`](https://huggingface.co/liuhaotian/LLaVA-Lightning-MPT-7B-preview).
233+
New options to note:
281234

282-
```Shell
283-
python -m llava.serve.controller --host 0.0.0.0 --port 10000
284-
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/LLaVA-Lightning-MPT-7B-preview
285-
python -m llava.serve.gradio_web_server --controller http://localhost:10000
286-
```
235+
- `--mm_projector_type mlp2x_gelu`: the two-layer MLP vision-language connector.
236+
- `--image_aspect_ratio pad`: it slightly reduces hallucination.
237+
- `--group_by_modality_length True`: this should only be used when your instruction tuning dataset contains both language (e.g. ShareGPT) and multimodal (e.g. LLaVA-Instruct). It makes the training sampler only sample a single modality (either image or language) during training, which we observe to speed up training by ~25%, and does not affect the final outcome.
287238

288-
2. Training
239+
## Evaluation
289240

290-
We use the same set of training dataset, and the hyperparameters as other *Lightning* checkpoints.
241+
In LLaVA-1.5, we evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.
291242

292-
## Evaluation
243+
Detailed evaluation scripts coming soon.
293244

294245
### GPT-assisted Evaluation
295246

@@ -327,10 +278,6 @@ OPENAI_API_KEY="sk-***********************************" python llava/eval/eval_g
327278
python summarize_gpt_review.py
328279
```
329280

330-
## ScienceQA
331-
332-
Please check out the documentation [here](https://github.com/haotian-liu/LLaVA/blob/main/docs/ScienceQA.md).
333-
334281
## Citation
335282

336283
If you find LLaVA useful for your research and applications, please cite using this BibTeX:

scripts/finetune.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#!/bin/bash
22

3+
# IMPORTANT: this is the training script for the original LLaVA, NOT FOR LLaVA V1.5!
4+
35
# Uncomment and set the following variables correspondingly to run this script:
46

57
################## VICUNA ##################

scripts/finetune_full_schedule.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#!/bin/bash
22

3+
# IMPORTANT: this is the training script for the original LLaVA, NOT FOR LLaVA V1.5!
4+
35
# Uncomment and set the following variables correspondingly to run this script:
46

57
################## VICUNA ##################

scripts/finetune_lora.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#!/bin/bash
22

3+
# IMPORTANT: this is the training script for the original LLaVA, NOT FOR LLaVA V1.5!
4+
35
# Uncomment and set the following variables correspondingly to run this script:
46

57
################## VICUNA ##################

scripts/finetune_qlora.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#!/bin/bash
22

3+
# IMPORTANT: this is the training script for the original LLaVA, NOT FOR LLaVA V1.5!
4+
35
# Uncomment and set the following variables correspondingly to run this script:
46

57
################## VICUNA ##################

scripts/finetune_sqa.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#!/bin/bash
22

3+
# IMPORTANT: this is the training script for the original LLaVA, NOT FOR LLaVA V1.5!
4+
35
deepspeed llava/train/train_mem.py \
46
--deepspeed ./scripts/zero2.json \
57
--model_name_or_path lmsys/vicuna-13b-v1.3 \

scripts/pretrain.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#!/bin/bash
22

3+
# IMPORTANT: this is the training script for the original LLaVA, NOT FOR LLaVA V1.5!
4+
35
# Uncomment and set the following variables correspondingly to run this script:
46

57
# MODEL_VERSION=vicuna-v1-3-7b

scripts/v1_5/finetune.sh

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#!/bin/bash
2+
3+
deepspeed llava/train/train_mem.py \
4+
--deepspeed ./scripts/zero3.json \
5+
--model_name_or_path lmsys/vicuna-13b-v1.5 \
6+
--version v1 \
7+
--data_path ./playground/data/llava_v1_5_mix665k.json \
8+
--image_folder ./playground/data \
9+
--vision_tower openai/clip-vit-large-patch14-336 \
10+
--pretrain_mm_mlp_adapter ./checkpoints/llava-v1.5-13b-pretrain/mm_projector.bin \
11+
--mm_projector_type mlp2x_gelu \
12+
--mm_vision_select_layer -2 \
13+
--mm_use_im_start_end False \
14+
--mm_use_im_patch_token False \
15+
--image_aspect_ratio pad \
16+
--group_by_modality_length True \
17+
--bf16 True \
18+
--output_dir ./checkpoints/llava-v1.5-13b \
19+
--num_train_epochs 1 \
20+
--per_device_train_batch_size 16 \
21+
--per_device_eval_batch_size 4 \
22+
--gradient_accumulation_steps 1 \
23+
--evaluation_strategy "no" \
24+
--save_strategy "steps" \
25+
--save_steps 50000 \
26+
--save_total_limit 1 \
27+
--learning_rate 2e-5 \
28+
--weight_decay 0. \
29+
--warmup_ratio 0.03 \
30+
--lr_scheduler_type "cosine" \
31+
--logging_steps 1 \
32+
--tf32 True \
33+
--model_max_length 2048 \
34+
--gradient_checkpointing True \
35+
--dataloader_num_workers 4 \
36+
--lazy_preprocess True \
37+
--report_to wandb

scripts/v1_5/pretrain.sh

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
#!/bin/bash
2+
3+
deepspeed llava/train/train_mem.py \
4+
--deepspeed ./scripts/zero2.json \
5+
--model_name_or_path lmsys/vicuna-13b-v1.5 \
6+
--version plain \
7+
--data_path ./playground/data/LLaVA-Pretrain/blip_laion_cc_sbu_558k.json \
8+
--image_folder ./playground/data/LLaVA-Pretrain/images \
9+
--vision_tower openai/clip-vit-large-patch14-336 \
10+
--mm_projector_type mlp2x_gelu \
11+
--tune_mm_mlp_adapter True \
12+
--mm_vision_select_layer -2 \
13+
--mm_use_im_start_end False \
14+
--mm_use_im_patch_token False \
15+
--bf16 True \
16+
--output_dir ./checkpoints/llava-v1.5-13b-pretrain \
17+
--num_train_epochs 1 \
18+
--per_device_train_batch_size 32 \
19+
--per_device_eval_batch_size 4 \
20+
--gradient_accumulation_steps 1 \
21+
--evaluation_strategy "no" \
22+
--save_strategy "steps" \
23+
--save_steps 24000 \
24+
--save_total_limit 1 \
25+
--learning_rate 1e-3 \
26+
--weight_decay 0. \
27+
--warmup_ratio 0.03 \
28+
--lr_scheduler_type "cosine" \
29+
--logging_steps 1 \
30+
--tf32 True \
31+
--model_max_length 2048 \
32+
--gradient_checkpointing True \
33+
--dataloader_num_workers 4 \
34+
--lazy_preprocess True \
35+
--report_to wandb

0 commit comments

Comments
 (0)