Update scripts and ScienceQA instructions

haotian-liu · haotian-liu · commit 9ccab832f1e0 · 2023-08-27T01:07:31.000-05:00
diff --git a/docs/ScienceQA.md b/docs/ScienceQA.md
@@ -5,115 +5,40 @@
 2. Generate ScienceQA dataset for LLaVA conversation-style format.
 
 ```Shell
-python scripts/convert_sqa_to_llava \
+python scripts/convert_sqa_to_llava.py \
     convert_to_llava \
     --base-dir /path/to/ScienceQA/data/scienceqa \
+    --prompt-format "QCM-LEA" \
     --split {train,val,minival,test,minitest}
 ```
 
 #### Training
-**NOTE**: Due to that ScienceQA experiments were done earlier, the current checkpoints are trained *without* `<im_start>` and `<im_end>` tokens. Here we provide our training scripts for the current checkpoints.
 
-<details>
-<summary>1. Pretraining</summary>
+1. Pretraining
 
-```Shell
-torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
-    llava/train/train_mem.py \
-    --model_name_or_path ./checkpoints/llama-vicuna-13b \
-    --data_path /path/to/cc3m_595k.json \
-    --image_folder /path/to/cc3m_595k \
-    --vision_tower openai/clip-vit-large-patch14 \
-    --tune_mm_mlp_adapter True \
-    --mm_vision_select_layer -2 \
-    --bf16 True \
-    --output_dir ./checkpoints/llava-13b-pretrain-no_im_start_end_token \
-    --num_train_epochs 1 \
-    --per_device_train_batch_size 16 \
-    --per_device_eval_batch_size 4 \
-    --gradient_accumulation_steps 1 \
-    --evaluation_strategy "no" \
-    --save_strategy "steps" \
-    --save_steps 2400 \
-    --save_total_limit 1 \
-    --learning_rate 2e-3 \
-    --weight_decay 0. \
-    --warmup_ratio 0.03 \
-    --lr_scheduler_type "cosine" \
-    --logging_steps 1 \
-    --tf32 True \
-    --model_max_length 2048 \
-    --gradient_checkpointing True \
-    --lazy_preprocess True \
-    --report_to wandb
-```
-</details>
+You can download our pretrained projector weights from our [Model Zoo](), or train your own projector weights using [`pretrain.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/pretrain.sh).
 
-<details>
-<summary>2. Finetuning</summary>
+2. Finetuning
 
-You may download our pretrained `llava-13b-v0-pretrain-no_im_start_end_token.bin` [here](https://huggingface.co/liuhaotian/LLaVA-13b-pretrain-projector-v0/blob/main/LLaVA-13b-pretrain-projector-v0-CC3M-595K-original_caption-no_im_token.bin).
-
-```Shell
-torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
-    llava/train/train_mem.py \
-    --model_name_or_path /path/to/llama-vicuna-13b \
-    --data_path /path/to/scienceqa/llava_train_QCM-LEPA.json \
-    --image_folder /path/to/scienceqa/images/train \
-    --vision_tower openai/clip-vit-large-patch14 \
-    --pretrain_mm_mlp_adapter ./checkpoints/llava-13b-pretrain-no_im_start_end_token/mm_projector.bin \
-    --mm_vision_select_layer -2 \
-    --bf16 True \
-    --output_dir ./checkpoints/llava-13b-pretrain-no_im_start_end_token-finetune_scienceqa \
-    --num_train_epochs 12 \
-    --per_device_train_batch_size 4 \
-    --per_device_eval_batch_size 4 \
-    --gradient_accumulation_steps 1 \
-    --evaluation_strategy "no" \
-    --save_strategy "steps" \
-    --save_steps 5000 \
-    --save_total_limit 3 \
-    --learning_rate 2e-5 \
-    --weight_decay 0. \
-    --warmup_ratio 0.03 \
-    --lr_scheduler_type "cosine" \
-    --logging_steps 1 \
-    --tf32 True \
-    --fsdp "full_shard auto_wrap" \
-    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
-    --model_max_length 2048 \
-    --gradient_checkpointing True \
-    --lazy_preprocess True \
-    --report_to wandb
-```
-</details>
+See [`finetune_sqa.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/finetune_sqa.sh).
 
 #### Evaluation
-
-1. Download our pretrained LLaVA-13B (delta) weights for ScienceQA dataset [here](https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0-science_qa).  Convert the delta weights to actual weights.
-
-```Shell
-python -m llava.model.apply_delta \
-    --base /path/to/llama-13b \
-    --target /path/to/LLaVA-13b-v0-science_qa \
-    --delta liuhaotian/LLaVA-13b-delta-v0-science_qa
 ```
 
-2. [Option 1] Multiple-GPU inference
+1. Multiple-GPU inference
 You may evaluate this with multiple GPUs, and concatenate the generated jsonl files.  Please refer to our script for [batch evaluation](https://github.com/haotian-liu/LLaVA/blob/main/scripts/sqa_eval_batch.sh) and [results gathering](https://github.com/haotian-liu/LLaVA/blob/main/scripts/sqa_eval_gather.sh).
 
-3. [Option 2] Single-GPU inference
+2. Single-GPU inference
 
 (a) Generate LLaVA responses on ScienceQA dataset
 
 ```Shell
 python -m llava.eval.model_vqa_science \
-    --model-path /path/to/LLaVA-13b-v0-science_qa \
-    --question-file /path/to/ScienceQA/data/scienceqa/llava_test.json \
+    --model-path liuhaotian/llava-lcs558k-scienceqa-vicuna-13b-v1.3 \
+    --question-file /path/to/ScienceQA/data/scienceqa/llava_test_QCM-LEA.json \
     --image-folder /path/to/ScienceQA/data/scienceqa/images/test \
     --answers-file vqa/results/ScienceQA/test_llava-13b.jsonl \
-    --answer-prompter \
-    --conv-mode llava_v0
+    --conv-mode llava_v1
 ```
 
 (b) Evaluate the generated responses
@@ -126,4 +51,4 @@ python eval_science_qa.py \
     --output-result vqa/results/ScienceQA/test_llava-13b_result.json \
 ```
 
-For reference, we attach our prediction file [`test_llava-13b_result.json`](https://github.com/haotian-liu/LLaVA/blob/main/llava/eval/table/results/test_sqa_llava_13b_v0.json) for comparison when reproducing our results, as well as for further analysis in detail.
+For reference, we attach our prediction file [`test_sqa_llava_13b_v0.json`](https://github.com/haotian-liu/LLaVA/blob/main/llava/eval/table/results/test_sqa_llava_13b_v0.json) for comparison when reproducing our results, as well as for further analysis in detail.
diff --git a/scripts/convert_sqa_to_llava.py b/scripts/convert_sqa_to_llava.py
@@ -5,7 +5,7 @@
 from convert_sqa_to_llava_base_prompt import build_prompt_chatbot
 
 
-def convert_to_llava(base_dir, split, prompt_format="QCM-LEPA"):
+def convert_to_llava(base_dir, split, prompt_format="QCM-LEA"):
     split_indices = json.load(open(os.path.join(base_dir, "pid_splits.json")))[split]
     problems = json.load(open(os.path.join(base_dir, "problems.json")))
 
diff --git a/scripts/finetune.sh b/scripts/finetune.sh
@@ -13,7 +13,7 @@
 ################## LLaMA-2 ##################
 
 deepspeed llava/train/train_mem.py \
-    --deepspeed /path/to/deepspeed.json \
+    --deepspeed ./scripts/zero2.json \
     --model_name_or_path ./checkpoints/$MODEL_VERSION \
     --version $PROMPT_VERSION \
     --data_path ./playground/data/llava_instruct_80k.json \
diff --git a/scripts/finetune_full_schedule.sh b/scripts/finetune_full_schedule.sh
@@ -13,7 +13,7 @@
 ################## LLaMA-2 ##################
 
 deepspeed llava/train/train_mem.py \
-    --deepspeed /path/to/deepspeed.json \
+    --deepspeed ./scripts/zero2.json \
     --model_name_or_path ./checkpoints/$MODEL_VERSION \
     --version $PROMPT_VERSION \
     --data_path ./playground/data/llava_instruct_158k.json \
diff --git a/scripts/finetune_lora.sh b/scripts/finetune_lora.sh
@@ -13,7 +13,7 @@
 ################## LLaMA-2 ##################
 
 deepspeed llava/train/train_mem.py \
-    --deepspeed /path/to/deepspeed.json \
+    --deepspeed ./scripts/zero2.json \
     --lora_enable True \
     --model_name_or_path ./checkpoints/$MODEL_VERSION \
     --version $PROMPT_VERSION \
diff --git a/scripts/finetune_qlora.sh b/scripts/finetune_qlora.sh
@@ -13,7 +13,7 @@
 ################## LLaMA-2 ##################
 
 deepspeed llava/train/train_mem.py \
-    --deepspeed /path/to/deepspeed_zero2.json \
+    --deepspeed ./scripts/zero2.json \
     --lora_enable True \
     --bits 4 \
     --model_name_or_path ./checkpoints/$MODEL_VERSION \
diff --git a/scripts/finetune_sqa.sh b/scripts/finetune_sqa.sh
@@ -0,0 +1,34 @@
+#!/bin/bash
+
+deepspeed llava/train/train_mem.py \
+    --deepspeed ./scripts/zero2.json \
+    --model_name_or_path lmsys/vicuna-13b-v1.3 \
+    --version $PROMPT_VERSION \
+    --data_path /Data/ScienceQA/data/scienceqa/llava_train_QCM-LEA.json \
+    --image_folder /Data/ScienceQA/data/scienceqa/images/train \
+    --vision_tower openai/clip-vit-large-patch14 \
+    --pretrain_mm_mlp_adapter ./checkpoints/huggingface/liuhaotian/llava-pretrain-vicuna-13b-v1.3/mm_projector.bin \
+    --mm_vision_select_layer -2 \
+    --mm_use_im_start_end False \
+    --mm_use_im_patch_token False \
+    --bf16 True \
+    --output_dir ./checkpoints/llava-vicuna-13b-v1.3-pretrain_lcs558k_plain-ScienceQA_QCM_LEA-12e \
+    --num_train_epochs 12 \
+    --per_device_train_batch_size 16 \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps 1 \
+    --evaluation_strategy "no" \
+    --save_strategy "steps" \
+    --save_steps 50000 \
+    --save_total_limit 1 \
+    --learning_rate 2e-5 \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --tf32 True \
+    --model_max_length 2048 \
+    --gradient_checkpointing True \
+    --dataloader_num_workers 4 \
+    --lazy_preprocess True \
+    --report_to wandb
diff --git a/scripts/pretrain.sh b/scripts/pretrain.sh
@@ -11,7 +11,7 @@ PROMPT_VERSION=plain
 ########### DO NOT CHANGE ###########
 
 deepspeed llava/train/train_mem.py \
-    --deepspeed /path/to/deepspeed.json \
+    --deepspeed ./scripts/zero2.json \
     --model_name_or_path ./checkpoints/$MODEL_VERSION \
     --version $PROMPT_VERSION \
     --data_path /path/to/pretrain_data.json \
diff --git a/scripts/sqa_eval_batch.sh b/scripts/sqa_eval_batch.sh
@@ -3,12 +3,11 @@
 CHUNKS=8
 for IDX in {0..7}; do
     CUDA_VISIBLE_DEVICES=$IDX python -m llava.eval.model_vqa_science \
-        --model-path ./checkpoints/LLaVA-13b-v0-science_qa \
-        --question-file ~/haotian/datasets/ScienceQA/data/scienceqa/llava_test_QCM-LEPA.json \
+        --model-path liuhaotian/llava-lcs558k-scienceqa-vicuna-13b-v1.3 \
+        --question-file ~/haotian/datasets/ScienceQA/data/scienceqa/llava_test_QCM-LEA.json \
         --image-folder ~/haotian/datasets/ScienceQA/data/scienceqa/images/test \
         --answers-file ./test_llava-13b-chunk$CHUNKS_$IDX.jsonl \
         --num-chunks $CHUNKS \
         --chunk-idx $IDX \
-        --answer-prompter \
-        --conv-mode llava_v0 &
+        --conv-mode llava_v1 &
 done