You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+42-95Lines changed: 42 additions & 95 deletions
Original file line number
Diff line number
Diff line change
@@ -159,7 +159,9 @@ python -m llava.serve.cli \
159
159
160
160
## Train
161
161
162
-
LLaVA training consists of two stages: (1) feature alignment stage: use approximately 600K filtered CC3M to connect a *frozen pretrained* vision encoder to a *frozen LLM*; (2) visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following to teach the model to follow multimodal instructions.
162
+
*Below is the latest training configuration for LLaVA v1.5. For legacy models, please refer to README of [this](https://github.com/haotian-liu/LLaVA/tree/v1.0.1) version for now. We'll add them in a separate doc later.*
163
+
164
+
LLaVA training consists of two stages: (1) feature alignment stage: use approximately 600K filtered CC3M to connect a *frozen pretrained* vision encoder to a *frozen LLM*; (2) visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data (with VQA data from academic-oriented tasks) to teach the model to follow multimodal instructions.
163
165
164
166
LLaVA is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`.
165
167
@@ -170,126 +172,75 @@ We use a similar set of hyperparameters as Vicuna in finetuning. Both hyperpara
170
172
171
173
| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
172
174
| --- | ---: | ---: | ---: | ---: | ---: |
173
-
| LLaVA-13B | 256 | 1e-3 | 1 | 2048 | 0 |
175
+
| LLaVA-v1.5-13B | 256 | 1e-3 | 1 | 2048 | 0 |
174
176
175
177
2. Finetuning
176
178
177
179
| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
178
180
| --- | ---: | ---: | ---: | ---: | ---: |
179
-
| LLaVA-13B | 128 | 2e-5 | 1 | 2048 | 0 |
180
-
181
-
### Prepare Vicuna checkpoints
181
+
| LLaVA-v1.5-13B | 128 | 2e-5 | 1 | 2048 | 0 |
182
182
183
-
Before you start, prepare our base model Vicuna, which is an instruction-tuned chatbot. Please download its weights [here](https://github.com/lm-sys/FastChat#model-weights).
183
+
### Download Vicuna checkpoints (automatically)
184
184
185
-
Vicuna has two versions: v0 and v1, the main difference between them is the prompt of format. We support both. To ensure the best performance, you need to specify the correct prompt version corresponding to the weights you download: `v0` for `v0` weights, and `v1` for all Vicuna `v1.x` models.
185
+
Our base model Vicuna v1.5, which is an instruction-tuned chatbot, will be downloaded automatically when you run our provided training scripts. No action is needed.
186
186
187
187
### Pretrain (feature alignment)
188
188
189
-
Please download the subset of the CC3M dataset we use in the paper [here](https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K).
190
-
191
-
Pretrain takes around 4 hours for LLaVA-13B on 8x A100 (80G). It takes around 2 hours for 7B checkpoints.
189
+
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain).
192
190
193
-
We recommend training with DeepSpeed as it can save a lot of GPU RAM. We provide training script with DeepSpeed [here](https://github.com/haotian-liu/LLaVA/blob/main/scripts/pretrain.sh).
191
+
Pretrain takes around 5.5 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 3.5 hours for LLaVA-v1.5-7B.
194
192
195
-
You may run this with a single A100 GPU with the following code. Please note that the `per_device_train_batch_size` * `gradient_accumulation_steps` should be equal to 128 to keep the global batch size the same.
Training script with DeepSpeed ZeRO-2: [`pretrain.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/pretrain.sh).
233
194
195
+
`--mm_projector_type mlp2x_gelu` is the only new option for pretraining LLaVA-v1.5, which indicates the two-layer MLP vision-language connector.
234
196
235
197
### Visual Instruction Tuning
236
198
237
199
1. Prepare data
238
200
239
-
Please download the annotation of our instruction tuning data [llava_instruct_158k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_150k.json), and download the COCO train2017 images [here](https://cocodataset.org/#download).
240
-
241
-
2. Start training!
242
-
243
-
You may download our pretrained projectors in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md). It is not recommended to use legacy projectors, as they may be trained with a different version of the codebase, and if any option is off, the model will not function/train as we expected.
244
-
245
-
When we initially released our paper, we used a full 3-epoch schedule on the LLaVA-Instruct-158K dataset. The scripts are provided [here](https://github.com/haotian-liu/LLaVA/blob/main/scripts/finetune_full_schedule.sh).
246
-
247
-
In our later exploration, we introduced LLaVA-Lightning, as we find that a much faster 1-epoch schedule on LLaVA-Instruct-80K can achieve fast convergence and good performance. With LLaVA Lightning, we are able to train, validate, and release LLaVA-LLaMA-2 checkpoints preview on the same day as LLaMA-2 release. If you are interested to learn more about LLaVA Lightning, please continue to the following section.
248
-
249
-
### Lightning
250
-
251
-
LLaVA-Lightning can be trained on 8x A100 GPUs in just 3 hours, including both pretraining and finetuning. When using spot instances, it costs just ~$40.
252
-
253
-
For LLaVA Lightning, we create two distilled subset to ensure both a broad concept coverage, and the efficiency in training. Furthermore, we only perform instruction tuning for 1 epoch, in contrast to 3 epochs in the paper. We find such schedule is effective and can achieve fast convergence and good performance.
254
-
255
-
For pretraining, we create a concept-balanced subset of LAION-CC-SBU. It consists of 558K images. Download data [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/tree/main).
201
+
Please download the annotation of the final mixture our instruction tuning data [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json), and download the images from constituting datasets:
256
202
257
-
For instruction tuning, we create a subset of LLaVA-Instruct-150K. It consists of 80K image-instruction pairs, consisting of 40K conversation and 40K complex reasoning data, with non-overlapping images. Download `llava_instruct_80k.json`[here](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_80k.json).
| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
270
-
| --- | ---: | ---: | ---: | ---: | ---: |
271
-
| LLaVA-Lightning | 128 | 2e-5 | 1 | 2048 | 0 |
225
+
2. Start training!
272
226
273
-
#### LLaVA-MPT-7b
274
-
Thanks to LLaVA-Lightning, we are able to train a checkpoint based on MPT-7B-Chat on 8x A100 GPUs in just 3 hours, including both pretraining and finetuning.
227
+
You may download our pretrained projectors in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md). It is not recommended to use legacy projectors, as they may be trained with a different version of the codebase, and if any option is off, the model will not function/train as we expected.
275
228
276
-
**NOTE**: This is a research preview of the LLaVA-Lightning based on MPT-7B-chat checkpoint. The usage of the model should comply with MPT-7B-chat license and agreements.
229
+
Visual instruction tuning takes around 20 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 10 hours for LLaVA-v1.5-7B on 8x A100 (40G).
277
230
278
-
1. Usage
231
+
Training script with DeepSpeed ZeRO-3: [`finetune.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune.sh).
279
232
280
-
You do not need to download our checkpoint, it will directly load from our Hugging Face model: [`liuhaotian/LLaVA-Lightning-MPT-7B-preview`](https://huggingface.co/liuhaotian/LLaVA-Lightning-MPT-7B-preview).
-`--mm_projector_type mlp2x_gelu`: the two-layer MLP vision-language connector.
236
+
-`--image_aspect_ratio pad`: it slightly reduces hallucination.
237
+
-`--group_by_modality_length True`: this should only be used when your instruction tuning dataset contains both language (e.g. ShareGPT) and multimodal (e.g. LLaVA-Instruct). It makes the training sampler only sample a single modality (either image or language) during training, which we observe to speed up training by ~25%, and does not affect the final outcome.
287
238
288
-
2. Training
239
+
## Evaluation
289
240
290
-
We use the same set of training dataset, and the hyperparameters as other *Lightning* checkpoints.
241
+
In LLaVA-1.5, we evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.
0 commit comments