haotian-liu
diff --git a/‎.gitignore
Lines changed: 1 addition & 0 deletions b/‎.gitignore
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md
Lines changed: 9 additions & 17 deletions b/‎README.md
Lines changed: 9 additions & 17 deletions
diff --git a/‎docs/LLaVA_Bench.md
Lines changed: 31 additions & 0 deletions b/‎docs/LLaVA_Bench.md
Lines changed: 31 additions & 0 deletions
diff --git a/‎docs/LLaVA_from_LLaMA2.md
Lines changed: 30 additions & 0 deletions b/‎docs/LLaVA_from_LLaMA2.md
Lines changed: 30 additions & 0 deletions
diff --git a/‎docs/LoRA.md
Lines changed: 9 additions & 5 deletions b/‎docs/LoRA.md
Lines changed: 9 additions & 5 deletions
diff --git a/‎docs/MODEL_ZOO.md
Lines changed: 11 additions & 2 deletions b/‎docs/MODEL_ZOO.md
Lines changed: 11 additions & 2 deletions
diff --git a/‎docs/Release_Notes.md
Lines changed: 31 additions & 0 deletions b/‎docs/Release_Notes.md
Lines changed: 31 additions & 0 deletions
diff --git a/‎images/llava_example_cmp.png
317 KB b/‎images/llava_example_cmp.png
317 KB
diff --git a/‎llava/constants.py
Lines changed: 8 additions & 0 deletions b/‎llava/constants.py
Lines changed: 8 additions & 0 deletions
@@ -8,6 +8,7 @@ dist
 *.log
 *.log.*
 *.json
+*.jsonl
 
 # Data
 !**/alpaca-data-conversation.json
 
@@ -14,11 +14,12 @@
 
 
 ## Release
-- [6/26] 🔥 [CVPR 2023 Tutorial](https://vlp-tutorial.github.io/) on **Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4**!  Please check out [[Slides](https://datarelease.blob.core.windows.net/tutorial/vision_foundation_models_2023/slides/Chunyuan_cvpr2023_tutorial_lmm.pdf)] [[Notes](https://arxiv.org/abs/2306.14895)] [[YouTube](https://youtu.be/mkI7EPD1vp8)] [[Bilibli](https://www.bilibili.com/video/BV1Ng4y1T7v3/)].
-- [6/11] 🔥 We released the preview for the mostly requested feature: DeepSpeed and LoRA support!  Please see documentations [here](./docs/LoRA.md).
-- [6/1] 🔥 We released **LLaVA-Med: Large Language and Vision Assistant for Biomedicine**, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities.  Checkout the [paper](https://arxiv.org/abs/2306.00890) and [page](https://github.com/microsoft/LLaVA-Med).
-- [5/13] 🔥 Interested in quantifying the emerged **zero-shot OCR** performance of LLaVA and open-sourced LMM? Please check out the paper ["On the Hidden Mystery of OCR in Large Multimodal Models"](https://arxiv.org/abs/2305.07895), where LLaVA consistently outperforms miniGPT4 on 17 out of 18 datasets, despite LlaVA being trained with an order of magnitude smaller training data.
-- [5/6] 🔥 We are releasing [LLaVA-Lighting-MPT-7B-preview](https://huggingface.co/liuhaotian/LLaVA-Lightning-MPT-7B-preview), based on MPT-7B-Chat!  See [here](#LLaVA-MPT-7b) for more details.
+- [7/19] 🔥 We release a major upgrade, including support for LLaMA-2, LoRA training, 4-/8-bit inference, higher resolution (336x336), and a lot more. We also support and verify training with RTX 3090 and RTX A6000. Check out [LLaVA-from-LLaMA-2](https://github.com/haotian-liu/LLaVA/blob/main/docs/LLaVA_from_LLaMA2.md), [release notes](https://github.com/haotian-liu/LLaVA/blob/main/docs/Release_Notes.md#7192023), and our [model zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)!
+- [6/26] [CVPR 2023 Tutorial](https://vlp-tutorial.github.io/) on **Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4**!  Please check out [[Slides](https://datarelease.blob.core.windows.net/tutorial/vision_foundation_models_2023/slides/Chunyuan_cvpr2023_tutorial_lmm.pdf)] [[Notes](https://arxiv.org/abs/2306.14895)] [[YouTube](https://youtu.be/mkI7EPD1vp8)] [[Bilibli](https://www.bilibili.com/video/BV1Ng4y1T7v3/)].
+- [6/11] We released the preview for the mostly requested feature: DeepSpeed and LoRA support!  Please see documentations [here](./docs/LoRA.md).
+- [6/1] We released **LLaVA-Med: Large Language and Vision Assistant for Biomedicine**, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities.  Checkout the [paper](https://arxiv.org/abs/2306.00890) and [page](https://github.com/microsoft/LLaVA-Med).
+- [5/13] Interested in quantifying the emerged **zero-shot OCR** performance of LLaVA and open-sourced LMM? Please check out the paper ["On the Hidden Mystery of OCR in Large Multimodal Models"](https://arxiv.org/abs/2305.07895), where LLaVA consistently outperforms miniGPT4 on 17 out of 18 datasets, despite LlaVA being trained with an order of magnitude smaller training data.
+- [5/6] We are releasing [LLaVA-Lighting-MPT-7B-preview](https://huggingface.co/liuhaotian/LLaVA-Lightning-MPT-7B-preview), based on MPT-7B-Chat!  See [here](#LLaVA-MPT-7b) for more details.
 - [5/2] 🔥 We are releasing LLaVA-Lighting!  Train a lite, multimodal GPT-4 with just $40 in 3 hours!  See [here](#train-llava-lightning) for more details.
 - [5/2] We upgrade LLaVA package to v0.1 to support Vicuna v0 and v1 checkpoints, please upgrade following instructions [here](#install).
 - [4/30] Our checkpoint with Vicuna-7b-v0 has been released [here](#llava-7b)! This checkpoint is more accessible and device friendly.  Stay tuned for a major upgrade next week!
@@ -60,7 +61,7 @@ pip install -e .
 3. Install additional packages for training cases
 ```
 pip install ninja
-pip install flash-attn==1.0.2
+pip install flash-attn --no-build-isolation
 ```
 
 ### Upgrade to latest code base
@@ -359,11 +360,6 @@ For pretraining, we create a concept-balanced subset of LAION-CC-SBU. It consist
 
 For instruction tuning, we create a subset of LLaVA-Instruct-150K. It consists of 80K image-instruction pairs, consisting of 40K conversation and 40K complex reasoning data, with non-overlapping images. Download `llava_instruct_80k.json` [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_80k.json).
 
-
-```Shell
-bash ./scripts/train_lightning.sh {v0,v1}
-```
-
 #### Hyperparameters
 
 1. Pretraining
@@ -403,10 +399,6 @@ python -m llava.serve.gradio_web_server --controller http://localhost:10000
 
 We use the same set of training dataset, and the hyperparameters as other Lightning checkpoints.
 
-```Shell
-bash ./scripts/train_lightning_mpt.sh
-```
-
 ### ScienceQA
 **NOTE**: Due to that ScienceQA experiments were done earlier, the current checkpoints are trained *without* `<im_start>` and `<im_end>` tokens. Here we provide our training scripts for the current checkpoints.
 
@@ -569,8 +561,8 @@ python -m llava.eval.model_vqa_science \
     --question-file /path/to/ScienceQA/data/scienceqa/llava_test.json \
     --image-folder /path/to/ScienceQA/data/scienceqa/images/test \
     --answers-file vqa/results/ScienceQA/test_llava-13b.jsonl \
-    --answer-prompter
-    --conv-mode simple
+    --answer-prompter \
+    --conv-mode llava_v0
 ```
 
 (b) Evaluate the generated responses
 
@@ -0,0 +1,31 @@
+# LLaVA-Bench
+
+**-Introduction-**  Large commercial multimodal chatbots have been released in this week, including 
+- [Multimodal Bing-Chat by Microsoft](https://blogs.bing.com/search/july-2023/Bing-Chat-Enterprise-announced,-multimodal-Visual-Search-rolling-out-to-Bing-Chat) (July 18, 2023) 
+- [Multimodal Bard by Google](https://bard.google.com/). 
+
+These chatbots are presumably supported by proprietary large mulitmodal models (LMM). Compared with the open-source LMM such as LLaVA, proprietary LMM represent the scaling success upperbound of the current SoTA techniques. They are shared the goal of developing multimodal chatbots that follow human intents to complete various daily-life visual tasks in the wild. While it remains less unexplored how to evaluate multimodal chat ability, it provides useful feedback to study open-source LMMs against the commercial multimodal chatbots. In addition to the *LLaVA-Bench (COCO)* dataset we used to develop the early versions of LLaVA, we are releasing  *LLaVA-Bench (In-the-Wild)* to the community for the public use.
+
+## LLaVA-Bench (In-the-Wild)
+
+To evaluate the model's capability in more challenging tasks and generalizability to novel domains, we collect a diverse set of 24 images with 60 questions in total, including indoor and outdoor scenes, memes, paintings, sketches, etc, and associate each image with a highly-detailed and manually-curated description and a proper selection of questions. Such design also assesses the model's robustness to different prompts. In this release, we also categorize questions into three categories: conversation (simple QA), detailed description, and complex reasoning. We continue to expand and improve the diversity of the LLaVA-Bench (In-the-Wild).  We mannually query Bing-Chat and Bard to get the responses. 
+
+### Results
+
+The score is measured by comparing against a reference answer generated by text-only GPT-4. It is generated by feeding the question, along with the ground truth image annotations as the context. A text-only GPT-4 evaluator rates both answers. We query GPT-4 by putting the reference answer first, and then the answer generated by the candidate model. We upload images at their original resolution to Bard and Bing-Chat to obtain the results.
+
+| Approach       | Conversation | Detail | Reasoning | Overall |
+|----------------|--------------|--------|-----------|---------|
+| Bard-0718      | 83.7         | 69.7   | 78.7      | 77.8    |
+| Bing-Chat-0629 | 59.6         | 52.2   | 90.1      | 71.5    |
+| LLaVA-13B-v1-336px-0719 (beam=1) | 64.3         | 55.9   | 81.7      | 70.1    |
+| LLaVA-13B-v1-336px-0719 (beam=5) | 68.4         | 59.9   | 84.3      | 73.5    |
+
+Note that Bard sometimes refuses to answer questions about images containing humans, and Bing-Chat blurs the human faces in the images. We also provide the benchmark score for the subset without humans.
+
+| Approach       | Conversation | Detail | Reasoning | Overall |
+|----------------|--------------|--------|-----------|---------|
+| Bard-0718      | 94.9         | 74.3   | 84.3      | 84.6    |
+| Bing-Chat-0629 | 55.8         | 53.6   | 93.5      | 72.6    |
+| LLaVA-13B-v1-336px-0719 (beam=1) | 62.2         | 56.4   | 82.2      | 70.0    |
+| LLaVA-13B-v1-336px-0719 (beam=5) | 65.6         | 61.7   | 85.0      | 73.6    |
@@ -0,0 +1,30 @@
+# LLaVA (based on Llama 2 LLM, Preview)
+
+*NOTE: This is a technical preview. We are still running hyperparameter search, and will release the final model soon.  If you'd like to contribute to this, please contact us.*
+
+:llama: **-Introduction-** [Llama 2 is an open-source LLM released by Meta AI](https://about.fb.com/news/2023/07/llama-2/) today (July 18, 2023). Compared with its early version [Llama 1](https://ai.meta.com/blog/large-language-model-llama-meta-ai/), Llama 2 is more favored in ***stronger langauge performance***, ***longer context window***, and importantly ***commercially usable***! While Llama 2 is changing the LLM market landscape in the langauge space, its multimodal ability remains unknown. We quickly develop the LLaVA variant based on the latest Llama 2 checkpoints, and release it to the community for the public use.
+
+You need to apply for and download the lastest Llama 2 checkpoints to start your own training (apply [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/))
+
+
+## Training
+
+Please checkout [`pretrain.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/pretrain.sh), [`finetune.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/finetune.sh), [`finetune_lora.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/finetune_lora.sh).
+
+## LLaVA (based on Llama 2), What is different? 
+
+:volcano: How is the new LLaVA based on Llama 2 differnt from Llama 1? The comparisons of the training process are described:
+- **Pre-training**. The pre-trained base LLM is changed from Llama 1 to Llama 2
+- **Langauge instruction-tuning**. The previous LLaVA model starts with Vicuna, which is instruct tuned on ShareGPT data from Llama 1; The new LLaVA model starts with Llama 2 Chat, which is an instruct tuned checkpoint on dialogue data from Llama 2.
+- **Multimodal instruction-tuning**. The same LLaVA-Lighting process is applied.
+
+
+### Results
+
+- Llama 2 is better at following the instructions of role playing
+- Llama 2 fails in following the instructions of transaltion
+
+
+<p align="center">
+    <img src="../images/llava_example_cmp.png" width="100%">
+</p>
@@ -13,16 +13,20 @@ Please execute each of the command below one by one (after the previous one has
 python -m llava.serve.controller --host 0.0.0.0 --port 10000
 ```
 
-#### Launch a model worker
+#### Launch a gradio web server.
 ```Shell
-python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-vicuna-7b-v1.1-lcs_558k-instruct_80k_3e-lora-preview-alpha --model-base /path/to/vicuna-v1.1
+python -m llava.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload
 ```
-Wait until the process finishes loading the model and you see "Uvicorn running on ...".
+You just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.
 
-#### Launch a gradio web server.
+#### Launch a model worker
 ```Shell
-python -m llava.serve.gradio_web_server --controller http://localhost:10000
+python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-vicuna-7b-v1.1-lcs_558k-instruct_80k_3e-lora-preview-alpha --model-base /path/to/vicuna-v1.1
 ```
+Wait until the process finishes loading the model and you see "Uvicorn running on ...".  Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.
+
+You can launch as many workers as you want, and compare between different model checkpoints in the same Gradio interface. Please keep the `--controller` the same, and modify the `--port` and `--worker` to a different port number for each worker.
+
 
 ## Training
 
 
@@ -1,3 +1,12 @@
-## Model Zoo
+# Model Zoo
+
+We'll add more model checkpoints into model zoo very soon. Stay tuned!
+
+If you are interested in including any other details in Model Zoo, please open a issue :)
+
+| Base LLM | Vision Encoder | Pretrain Data | Pretraining schedule | Finetuning Data | Finetuning schedule | LLaVA-Bench-Conv | LLaVA-Bench-Detail | LLaVA-Bench-Complex | LLaVA-Bench-Overall | Download |
+|----------|----------------|---------------|----------------------|-----------------|--------------------|------------------|--------------------|---------------------|---------------------|---------------------|
+| Vicuna-13B-v1.3 | CLIP-L-336px | LCS-558K | 1e | LLaVA-Instruct-80K | proj-1e, lora-1e | 64.3 | 55.9 | 81.7 | 70.1 | [LoRA](https://huggingface.co/liuhaotian/llava-v1-0719-336px-lora-vicuna-13b-v1.3) [LoRA-Merged](https://huggingface.co/liuhaotian/llava-v1-0719-336px-lora-merge-vicuna-13b-v1.3) |
+| LLaMA-2-13B-Chat | CLIP-L | LCS-558K | 1e | LLaVA-Instruct-80K | full-finetune-1e | 56.7 | 58.6 | 80.0 | 67.9 | [preview](https://huggingface.co/liuhaotian/llava-llama-2-13b-chat-lightning-preview) |
+| LLaMA-2-7B-Chat | CLIP-L | LCS-558K | 1e | LLaVA-Instruct-80K | lora-1e | 51.2 | 58.9 | 71.6 | 62.8 | [preview](https://huggingface.co/liuhaotian/llava-llama-2-7b-chat-lightning-lora-preview) |
 
-Coming soon.
 
@@ -0,0 +1,31 @@
+# Release Notes
+
+We document release notes here for each update.
+
+### 7/19/2023
+
+The first major update since our initial release!
+
+- Added **LLaMA-2** support
+- **Full LoRA support**. To make model training more accessible, we release a set of model weights based on LoRA, which supports training on academic resources (e.g. 4x A6000s, or 8x 3090s, **without the need of CPU offloading**)
+- A more versatile design for training large multimodal models, including swapping different language models, vision encoders, and more coming soon
+- Support higher resolution input using CLIP-ViT-L-336px as the vision encoder for a more detailed visual understanding
+- Ablate and clean up some design choices to make the training simpler and smoother
+- Full DeepSpeed support
+- Improved model checkpoint saving during pretraining stage to save disk space
+- Improved WebUI interface
+- Improved support for inference with multiple-GPUs
+- Support inference with 4-bit and 8-bit quantization
+- Support interactive CLI inference
+
+We train all models in this release using LLaVA-LCS-558K for pretraining and LLaVA-Instruct-80K for instruction tuning, to maintain an efficient and affordable training budget. **The full training (including both pretraining and finetuning) can be completed within 6 hours on 8x 3090s.**
+
+*We hope this release further benefits the community and makes large multimodal models more accessible.*
+
+#### Detailed Changes (7/19/2023)
+
+- Tokenization. We remove the dependency of the additional tokens (`<IM_START>`, `<IM_END>`, `<IM_PATCH>`), so that during the pretraining stage, the tokenizer does not change at all and we only update the linear projector weights.
+- Prompt.
+    - Pretraining. We simplified the pretraining prompts by removing additional instructions like `Describe the image details`, which we find to allow the zero-shot inference and can slightly improve the training speed.
+    - We keep the train/test prompt consistent, which we find to slightly improve the model's performance during the inference.
+
@@ -2,3 +2,11 @@
 WORKER_HEART_BEAT_INTERVAL = 15
 
 LOGDIR = "."
+
+# Model Constants
+IGNORE_INDEX = -100
+IMAGE_TOKEN_INDEX = -200
+DEFAULT_IMAGE_TOKEN = "<image>"
+DEFAULT_IMAGE_PATCH_TOKEN = "<im_patch>"
+DEFAULT_IM_START_TOKEN = "<im_start>"
+DEFAULT_IM_END_TOKEN = "<im_end>"
-Original file line number
+Diff line change
 *.log
 *.log.*
 *.json
 +*.jsonl
 # Data
 !**/alpaca-data-conversation.json