Skip to content

Commit 181a2f8

Browse files
committed
Release v1.0.0
1 parent 7ace501 commit 181a2f8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

70 files changed

+2699
-2846
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ dist
88
*.log
99
*.log.*
1010
*.json
11+
*.jsonl
1112

1213
# Data
1314
!**/alpaca-data-conversation.json

README.md

Lines changed: 9 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,12 @@
1414

1515

1616
## Release
17-
- [6/26] 🔥 [CVPR 2023 Tutorial](https://vlp-tutorial.github.io/) on **Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4**! Please check out [[Slides](https://datarelease.blob.core.windows.net/tutorial/vision_foundation_models_2023/slides/Chunyuan_cvpr2023_tutorial_lmm.pdf)] [[Notes](https://arxiv.org/abs/2306.14895)] [[YouTube](https://youtu.be/mkI7EPD1vp8)] [[Bilibli](https://www.bilibili.com/video/BV1Ng4y1T7v3/)].
18-
- [6/11] 🔥 We released the preview for the mostly requested feature: DeepSpeed and LoRA support! Please see documentations [here](./docs/LoRA.md).
19-
- [6/1] 🔥 We released **LLaVA-Med: Large Language and Vision Assistant for Biomedicine**, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities. Checkout the [paper](https://arxiv.org/abs/2306.00890) and [page](https://github.com/microsoft/LLaVA-Med).
20-
- [5/13] 🔥 Interested in quantifying the emerged **zero-shot OCR** performance of LLaVA and open-sourced LMM? Please check out the paper ["On the Hidden Mystery of OCR in Large Multimodal Models"](https://arxiv.org/abs/2305.07895), where LLaVA consistently outperforms miniGPT4 on 17 out of 18 datasets, despite LlaVA being trained with an order of magnitude smaller training data.
21-
- [5/6] 🔥 We are releasing [LLaVA-Lighting-MPT-7B-preview](https://huggingface.co/liuhaotian/LLaVA-Lightning-MPT-7B-preview), based on MPT-7B-Chat! See [here](#LLaVA-MPT-7b) for more details.
17+
- [7/19] 🔥 We release a major upgrade, including support for LLaMA-2, LoRA training, 4-/8-bit inference, higher resolution (336x336), and a lot more. We also support and verify training with RTX 3090 and RTX A6000. Check out [LLaVA-from-LLaMA-2](https://github.com/haotian-liu/LLaVA/blob/main/docs/LLaVA_from_LLaMA2.md), [release notes](https://github.com/haotian-liu/LLaVA/blob/main/docs/Release_Notes.md#7192023), and our [model zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)!
18+
- [6/26] [CVPR 2023 Tutorial](https://vlp-tutorial.github.io/) on **Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4**! Please check out [[Slides](https://datarelease.blob.core.windows.net/tutorial/vision_foundation_models_2023/slides/Chunyuan_cvpr2023_tutorial_lmm.pdf)] [[Notes](https://arxiv.org/abs/2306.14895)] [[YouTube](https://youtu.be/mkI7EPD1vp8)] [[Bilibli](https://www.bilibili.com/video/BV1Ng4y1T7v3/)].
19+
- [6/11] We released the preview for the mostly requested feature: DeepSpeed and LoRA support! Please see documentations [here](./docs/LoRA.md).
20+
- [6/1] We released **LLaVA-Med: Large Language and Vision Assistant for Biomedicine**, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities. Checkout the [paper](https://arxiv.org/abs/2306.00890) and [page](https://github.com/microsoft/LLaVA-Med).
21+
- [5/13] Interested in quantifying the emerged **zero-shot OCR** performance of LLaVA and open-sourced LMM? Please check out the paper ["On the Hidden Mystery of OCR in Large Multimodal Models"](https://arxiv.org/abs/2305.07895), where LLaVA consistently outperforms miniGPT4 on 17 out of 18 datasets, despite LlaVA being trained with an order of magnitude smaller training data.
22+
- [5/6] We are releasing [LLaVA-Lighting-MPT-7B-preview](https://huggingface.co/liuhaotian/LLaVA-Lightning-MPT-7B-preview), based on MPT-7B-Chat! See [here](#LLaVA-MPT-7b) for more details.
2223
- [5/2] 🔥 We are releasing LLaVA-Lighting! Train a lite, multimodal GPT-4 with just $40 in 3 hours! See [here](#train-llava-lightning) for more details.
2324
- [5/2] We upgrade LLaVA package to v0.1 to support Vicuna v0 and v1 checkpoints, please upgrade following instructions [here](#install).
2425
- [4/30] Our checkpoint with Vicuna-7b-v0 has been released [here](#llava-7b)! This checkpoint is more accessible and device friendly. Stay tuned for a major upgrade next week!
@@ -60,7 +61,7 @@ pip install -e .
6061
3. Install additional packages for training cases
6162
```
6263
pip install ninja
63-
pip install flash-attn==1.0.2
64+
pip install flash-attn --no-build-isolation
6465
```
6566

6667
### Upgrade to latest code base
@@ -359,11 +360,6 @@ For pretraining, we create a concept-balanced subset of LAION-CC-SBU. It consist
359360

360361
For instruction tuning, we create a subset of LLaVA-Instruct-150K. It consists of 80K image-instruction pairs, consisting of 40K conversation and 40K complex reasoning data, with non-overlapping images. Download `llava_instruct_80k.json` [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_80k.json).
361362

362-
363-
```Shell
364-
bash ./scripts/train_lightning.sh {v0,v1}
365-
```
366-
367363
#### Hyperparameters
368364

369365
1. Pretraining
@@ -403,10 +399,6 @@ python -m llava.serve.gradio_web_server --controller http://localhost:10000
403399

404400
We use the same set of training dataset, and the hyperparameters as other Lightning checkpoints.
405401

406-
```Shell
407-
bash ./scripts/train_lightning_mpt.sh
408-
```
409-
410402
### ScienceQA
411403
**NOTE**: Due to that ScienceQA experiments were done earlier, the current checkpoints are trained *without* `<im_start>` and `<im_end>` tokens. Here we provide our training scripts for the current checkpoints.
412404

@@ -569,8 +561,8 @@ python -m llava.eval.model_vqa_science \
569561
--question-file /path/to/ScienceQA/data/scienceqa/llava_test.json \
570562
--image-folder /path/to/ScienceQA/data/scienceqa/images/test \
571563
--answers-file vqa/results/ScienceQA/test_llava-13b.jsonl \
572-
--answer-prompter
573-
--conv-mode simple
564+
--answer-prompter \
565+
--conv-mode llava_v0
574566
```
575567

576568
(b) Evaluate the generated responses

docs/LLaVA_Bench.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# LLaVA-Bench
2+
3+
**-Introduction-** Large commercial multimodal chatbots have been released in this week, including
4+
- [Multimodal Bing-Chat by Microsoft](https://blogs.bing.com/search/july-2023/Bing-Chat-Enterprise-announced,-multimodal-Visual-Search-rolling-out-to-Bing-Chat) (July 18, 2023)
5+
- [Multimodal Bard by Google](https://bard.google.com/).
6+
7+
These chatbots are presumably supported by proprietary large mulitmodal models (LMM). Compared with the open-source LMM such as LLaVA, proprietary LMM represent the scaling success upperbound of the current SoTA techniques. They are shared the goal of developing multimodal chatbots that follow human intents to complete various daily-life visual tasks in the wild. While it remains less unexplored how to evaluate multimodal chat ability, it provides useful feedback to study open-source LMMs against the commercial multimodal chatbots. In addition to the *LLaVA-Bench (COCO)* dataset we used to develop the early versions of LLaVA, we are releasing *LLaVA-Bench (In-the-Wild)* to the community for the public use.
8+
9+
## LLaVA-Bench (In-the-Wild)
10+
11+
To evaluate the model's capability in more challenging tasks and generalizability to novel domains, we collect a diverse set of 24 images with 60 questions in total, including indoor and outdoor scenes, memes, paintings, sketches, etc, and associate each image with a highly-detailed and manually-curated description and a proper selection of questions. Such design also assesses the model's robustness to different prompts. In this release, we also categorize questions into three categories: conversation (simple QA), detailed description, and complex reasoning. We continue to expand and improve the diversity of the LLaVA-Bench (In-the-Wild). We mannually query Bing-Chat and Bard to get the responses.
12+
13+
### Results
14+
15+
The score is measured by comparing against a reference answer generated by text-only GPT-4. It is generated by feeding the question, along with the ground truth image annotations as the context. A text-only GPT-4 evaluator rates both answers. We query GPT-4 by putting the reference answer first, and then the answer generated by the candidate model. We upload images at their original resolution to Bard and Bing-Chat to obtain the results.
16+
17+
| Approach | Conversation | Detail | Reasoning | Overall |
18+
|----------------|--------------|--------|-----------|---------|
19+
| Bard-0718 | 83.7 | 69.7 | 78.7 | 77.8 |
20+
| Bing-Chat-0629 | 59.6 | 52.2 | 90.1 | 71.5 |
21+
| LLaVA-13B-v1-336px-0719 (beam=1) | 64.3 | 55.9 | 81.7 | 70.1 |
22+
| LLaVA-13B-v1-336px-0719 (beam=5) | 68.4 | 59.9 | 84.3 | 73.5 |
23+
24+
Note that Bard sometimes refuses to answer questions about images containing humans, and Bing-Chat blurs the human faces in the images. We also provide the benchmark score for the subset without humans.
25+
26+
| Approach | Conversation | Detail | Reasoning | Overall |
27+
|----------------|--------------|--------|-----------|---------|
28+
| Bard-0718 | 94.9 | 74.3 | 84.3 | 84.6 |
29+
| Bing-Chat-0629 | 55.8 | 53.6 | 93.5 | 72.6 |
30+
| LLaVA-13B-v1-336px-0719 (beam=1) | 62.2 | 56.4 | 82.2 | 70.0 |
31+
| LLaVA-13B-v1-336px-0719 (beam=5) | 65.6 | 61.7 | 85.0 | 73.6 |

docs/LLaVA_from_LLaMA2.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# LLaVA (based on Llama 2 LLM, Preview)
2+
3+
*NOTE: This is a technical preview. We are still running hyperparameter search, and will release the final model soon. If you'd like to contribute to this, please contact us.*
4+
5+
:llama: **-Introduction-** [Llama 2 is an open-source LLM released by Meta AI](https://about.fb.com/news/2023/07/llama-2/) today (July 18, 2023). Compared with its early version [Llama 1](https://ai.meta.com/blog/large-language-model-llama-meta-ai/), Llama 2 is more favored in ***stronger langauge performance***, ***longer context window***, and importantly ***commercially usable***! While Llama 2 is changing the LLM market landscape in the langauge space, its multimodal ability remains unknown. We quickly develop the LLaVA variant based on the latest Llama 2 checkpoints, and release it to the community for the public use.
6+
7+
You need to apply for and download the lastest Llama 2 checkpoints to start your own training (apply [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/))
8+
9+
10+
## Training
11+
12+
Please checkout [`pretrain.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/pretrain.sh), [`finetune.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/finetune.sh), [`finetune_lora.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/finetune_lora.sh).
13+
14+
## LLaVA (based on Llama 2), What is different?
15+
16+
:volcano: How is the new LLaVA based on Llama 2 differnt from Llama 1? The comparisons of the training process are described:
17+
- **Pre-training**. The pre-trained base LLM is changed from Llama 1 to Llama 2
18+
- **Langauge instruction-tuning**. The previous LLaVA model starts with Vicuna, which is instruct tuned on ShareGPT data from Llama 1; The new LLaVA model starts with Llama 2 Chat, which is an instruct tuned checkpoint on dialogue data from Llama 2.
19+
- **Multimodal instruction-tuning**. The same LLaVA-Lighting process is applied.
20+
21+
22+
### Results
23+
24+
- Llama 2 is better at following the instructions of role playing
25+
- Llama 2 fails in following the instructions of transaltion
26+
27+
28+
<p align="center">
29+
<img src="../images/llava_example_cmp.png" width="100%">
30+
</p>

docs/LoRA.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,16 +13,20 @@ Please execute each of the command below one by one (after the previous one has
1313
python -m llava.serve.controller --host 0.0.0.0 --port 10000
1414
```
1515

16-
#### Launch a model worker
16+
#### Launch a gradio web server.
1717
```Shell
18-
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-vicuna-7b-v1.1-lcs_558k-instruct_80k_3e-lora-preview-alpha --model-base /path/to/vicuna-v1.1
18+
python -m llava.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload
1919
```
20-
Wait until the process finishes loading the model and you see "Uvicorn running on ...".
20+
You just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.
2121

22-
#### Launch a gradio web server.
22+
#### Launch a model worker
2323
```Shell
24-
python -m llava.serve.gradio_web_server --controller http://localhost:10000
24+
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-vicuna-7b-v1.1-lcs_558k-instruct_80k_3e-lora-preview-alpha --model-base /path/to/vicuna-v1.1
2525
```
26+
Wait until the process finishes loading the model and you see "Uvicorn running on ...". Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.
27+
28+
You can launch as many workers as you want, and compare between different model checkpoints in the same Gradio interface. Please keep the `--controller` the same, and modify the `--port` and `--worker` to a different port number for each worker.
29+
2630

2731
## Training
2832

docs/MODEL_ZOO.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1-
## Model Zoo
1+
# Model Zoo
2+
3+
We'll add more model checkpoints into model zoo very soon. Stay tuned!
4+
5+
If you are interested in including any other details in Model Zoo, please open a issue :)
6+
7+
| Base LLM | Vision Encoder | Pretrain Data | Pretraining schedule | Finetuning Data | Finetuning schedule | LLaVA-Bench-Conv | LLaVA-Bench-Detail | LLaVA-Bench-Complex | LLaVA-Bench-Overall | Download |
8+
|----------|----------------|---------------|----------------------|-----------------|--------------------|------------------|--------------------|---------------------|---------------------|---------------------|
9+
| Vicuna-13B-v1.3 | CLIP-L-336px | LCS-558K | 1e | LLaVA-Instruct-80K | proj-1e, lora-1e | 64.3 | 55.9 | 81.7 | 70.1 | [LoRA](https://huggingface.co/liuhaotian/llava-v1-0719-336px-lora-vicuna-13b-v1.3) [LoRA-Merged](https://huggingface.co/liuhaotian/llava-v1-0719-336px-lora-merge-vicuna-13b-v1.3) |
10+
| LLaMA-2-13B-Chat | CLIP-L | LCS-558K | 1e | LLaVA-Instruct-80K | full-finetune-1e | 56.7 | 58.6 | 80.0 | 67.9 | [preview](https://huggingface.co/liuhaotian/llava-llama-2-13b-chat-lightning-preview) |
11+
| LLaMA-2-7B-Chat | CLIP-L | LCS-558K | 1e | LLaVA-Instruct-80K | lora-1e | 51.2 | 58.9 | 71.6 | 62.8 | [preview](https://huggingface.co/liuhaotian/llava-llama-2-7b-chat-lightning-lora-preview) |
212

3-
Coming soon.

docs/Release_Notes.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Release Notes
2+
3+
We document release notes here for each update.
4+
5+
### 7/19/2023
6+
7+
The first major update since our initial release!
8+
9+
- Added **LLaMA-2** support
10+
- **Full LoRA support**. To make model training more accessible, we release a set of model weights based on LoRA, which supports training on academic resources (e.g. 4x A6000s, or 8x 3090s, **without the need of CPU offloading**)
11+
- A more versatile design for training large multimodal models, including swapping different language models, vision encoders, and more coming soon
12+
- Support higher resolution input using CLIP-ViT-L-336px as the vision encoder for a more detailed visual understanding
13+
- Ablate and clean up some design choices to make the training simpler and smoother
14+
- Full DeepSpeed support
15+
- Improved model checkpoint saving during pretraining stage to save disk space
16+
- Improved WebUI interface
17+
- Improved support for inference with multiple-GPUs
18+
- Support inference with 4-bit and 8-bit quantization
19+
- Support interactive CLI inference
20+
21+
We train all models in this release using LLaVA-LCS-558K for pretraining and LLaVA-Instruct-80K for instruction tuning, to maintain an efficient and affordable training budget. **The full training (including both pretraining and finetuning) can be completed within 6 hours on 8x 3090s.**
22+
23+
*We hope this release further benefits the community and makes large multimodal models more accessible.*
24+
25+
#### Detailed Changes (7/19/2023)
26+
27+
- Tokenization. We remove the dependency of the additional tokens (`<IM_START>`, `<IM_END>`, `<IM_PATCH>`), so that during the pretraining stage, the tokenizer does not change at all and we only update the linear projector weights.
28+
- Prompt.
29+
- Pretraining. We simplified the pretraining prompts by removing additional instructions like `Describe the image details`, which we find to allow the zero-shot inference and can slightly improve the training speed.
30+
- We keep the train/test prompt consistent, which we find to slightly improve the model's performance during the inference.
31+

images/llava_example_cmp.png

317 KB
Loading

llava/constants.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,11 @@
22
WORKER_HEART_BEAT_INTERVAL = 15
33

44
LOGDIR = "."
5+
6+
# Model Constants
7+
IGNORE_INDEX = -100
8+
IMAGE_TOKEN_INDEX = -200
9+
DEFAULT_IMAGE_TOKEN = "<image>"
10+
DEFAULT_IMAGE_PATCH_TOKEN = "<im_patch>"
11+
DEFAULT_IM_START_TOKEN = "<im_start>"
12+
DEFAULT_IM_END_TOKEN = "<im_end>"

0 commit comments

Comments
 (0)