|
| 1 | +<div align="center"> |
| 2 | + <img src="./assets/dolphin.png" width="300"> |
| 3 | +</div> |
| 4 | + |
| 5 | +<div align="center"> |
| 6 | + <a href="https://arxiv.org/abs/2505.14059"> |
| 7 | + <img src="https://img.shields.io/badge/Paper-Arxiv-red"> |
| 8 | + </a> |
| 9 | + <a href="https://huggingface.co/ByteDance/Dolphin"> |
| 10 | + <img src="https://img.shields.io/badge/HuggingFace-Dolphin-yellow"> |
| 11 | + </a> |
| 12 | + <!-- <a href="https://link/of/demo"> |
| 13 | + <img src="https://img.shields.io/badge/Demo-Coming_Soon-blue"> |
| 14 | + </a> --> |
| 15 | + <a href="https://github.com/bytedance/Dolphin"> |
| 16 | + <img src="https://img.shields.io/badge/Code-Github-green"> |
| 17 | + </a> |
| 18 | + <a href="https://opensource.org/licenses/MIT"> |
| 19 | + <img src="https://img.shields.io/badge/License-MIT-lightgray"> |
| 20 | + </a> |
| 21 | + <br> |
| 22 | +</div> |
| 23 | + |
| 24 | +<br> |
| 25 | + |
| 26 | +<div align="center"> |
| 27 | + <img src="./assets/demo.gif" width="800"> |
| 28 | +</div> |
| 29 | + |
| 30 | +# Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting |
| 31 | + |
| 32 | +Dolphin (**Do**cument Image **P**arsing via **H**eterogeneous Anchor Prompt**in**g) is a novel multimodal document image parsing model following an analyze-then-parse paradigm. This repository contains the demo code and pre-trained models for Dolphin. |
| 33 | + |
| 34 | +## 📑 Overview |
| 35 | + |
| 36 | +Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Dolphin addresses these challenges through a two-stage approach: |
| 37 | + |
| 38 | +1. **🔍 Stage 1**: Comprehensive page-level layout analysis by generating element sequence in natural reading order |
| 39 | +2. **🧩 Stage 2**: Efficient parallel parsing of document elements using heterogeneous anchors and task-specific prompts |
| 40 | + |
| 41 | +<div align="center"> |
| 42 | + <img src="./assets/framework.png" width="680"> |
| 43 | +</div> |
| 44 | + |
| 45 | +Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. |
| 46 | + |
| 47 | +## 🚀 Demo |
| 48 | + |
| 49 | +<!-- Try our demo on [GitHub](https://github.com/ByteDance/Dolphin). --> |
| 50 | +Demo is coming soon within these days. Please keep tuned! 🔥 |
| 51 | + |
| 52 | + |
| 53 | +## 📅 Changelog |
| 54 | +- 🔥 **2025.05.20** The pretrained model and inference code of Dolphin are released. |
| 55 | + |
| 56 | +## 🛠️ Installation |
| 57 | + |
| 58 | +1. Clone the repository: |
| 59 | + ```bash |
| 60 | + git clone https://github.com/ByteDance/Dolphin.git |
| 61 | + cd Dolphin |
| 62 | + ``` |
| 63 | + |
| 64 | +2. Install the dependencies: |
| 65 | + ```bash |
| 66 | + pip install -r requirements.txt |
| 67 | + ``` |
| 68 | + |
| 69 | +3. Download the pre-trained models using one of the following options: |
| 70 | + |
| 71 | + **Option A: Original Model Format (config-based)** |
| 72 | + Download from [Baidu Yun](https://pan.baidu.com/s/1EbjjTN_lUinCq7tX7hhtfQ?pwd=wb43) or [Google Drive](https://drive.google.com/drive/folders/1PQJ3UutepXvunizZEw-uGaQ0BCzf-mie?usp=sharing) and put them in the `./checkpoints` folder. |
| 73 | + |
| 74 | + **Option B: Hugging Face Model Format** |
| 75 | + ```bash |
| 76 | + # Download the model from Hugging Face Hub |
| 77 | + git lfs install |
| 78 | + git clone https://huggingface.co/ByteDance/Dolphin ./hf_model |
| 79 | + # Or use the Hugging Face CLI |
| 80 | + huggingface-cli download ByteDance/Dolphin --local-dir ./hf_model |
| 81 | + ``` |
| 82 | + |
| 83 | +## ⚡ Inference |
| 84 | + |
| 85 | +Dolphin provides two inference frameworks with support for two parsing granularities: |
| 86 | +- **Page-level Parsing**: Parse the entire document image into a structured JSON and Markdown format |
| 87 | +- **Element-level Parsing**: Parse individual document elements (text, table, formula) |
| 88 | + |
| 89 | +### 📄 Page-level Parsing |
| 90 | + |
| 91 | +#### Using Original Framework (config-based) |
| 92 | + |
| 93 | +```bash |
| 94 | +# Process a single document image |
| 95 | +python demo_page.py --config ./config/Dolphin.yaml --input_path ./demo/page_imgs/page_1.jpeg --save_dir ./results |
| 96 | + |
| 97 | +# Process all document images in a directory |
| 98 | +python demo_page.py --config ./config/Dolphin.yaml --input_path ./demo/page_imgs --save_dir ./results |
| 99 | + |
| 100 | +# Process with custom batch size for parallel element decoding |
| 101 | +python demo_page.py --config ./config/Dolphin.yaml --input_path ./demo/page_imgs --save_dir ./results --max_batch_size 8 |
| 102 | +``` |
| 103 | + |
| 104 | +#### Using Hugging Face Framework |
| 105 | + |
| 106 | +```bash |
| 107 | +# Process a single document image |
| 108 | +python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs/page_1.jpeg --save_dir ./results |
| 109 | + |
| 110 | +# Process all document images in a directory |
| 111 | +python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs --save_dir ./results |
| 112 | + |
| 113 | +# Process with custom batch size for parallel element decoding |
| 114 | +python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs --save_dir ./results --max_batch_size 16 |
| 115 | +``` |
| 116 | + |
| 117 | +### 🧩 Element-level Parsing |
| 118 | + |
| 119 | +#### Using Original Framework (config-based) |
| 120 | + |
| 121 | +```bash |
| 122 | +# Process a single table image |
| 123 | +python demo_element.py --config ./config/Dolphin.yaml --input_path ./demo/element_imgs/table_1.jpeg --element_type table |
| 124 | + |
| 125 | +# Process a single formula image |
| 126 | +python demo_element.py --config ./config/Dolphin.yaml --input_path ./demo/element_imgs/line_formula.jpeg --element_type formula |
| 127 | + |
| 128 | +# Process a single text paragraph image |
| 129 | +python demo_element.py --config ./config/Dolphin.yaml --input_path ./demo/element_imgs/para_1.jpg --element_type text |
| 130 | +``` |
| 131 | + |
| 132 | +#### Using Hugging Face Framework |
| 133 | + |
| 134 | +```bash |
| 135 | +# Process a single table image |
| 136 | +python demo_element_hf.py --model_path ./hf_model --input_path ./demo/element_imgs/table_1.jpeg --element_type table |
| 137 | + |
| 138 | +# Process a single formula image |
| 139 | +python demo_element_hf.py --model_path ./hf_model --input_path ./demo/element_imgs/line_formula.jpeg --element_type formula |
| 140 | + |
| 141 | +# Process a single text paragraph image |
| 142 | +python demo_element_hf.py --model_path ./hf_model --input_path ./demo/element_imgs/para_1.jpg --element_type text |
| 143 | +``` |
| 144 | + |
| 145 | +## 🌟 Key Features |
| 146 | + |
| 147 | +- 🔄 Two-stage analyze-then-parse approach based on a single VLM |
| 148 | +- 📊 Promising performance on document parsing tasks |
| 149 | +- 🔍 Natural reading order element sequence generation |
| 150 | +- 🧩 Heterogeneous anchor prompting for different document elements |
| 151 | +- ⏱️ Efficient parallel parsing mechanism |
| 152 | +- 🤗 Support for Hugging Face Transformers for easier integration |
| 153 | + |
| 154 | +## 💖 Acknowledgement |
| 155 | + |
| 156 | +We would like to acknowledge the following open-source projects that provided inspiration and reference for this work: |
| 157 | +- [Donut](https://github.com/clovaai/donut/) |
| 158 | +- [Nougat](https://github.com/facebookresearch/nougat) |
| 159 | +- [GOT](https://github.com/Ucas-HaoranWei/GOT-OCR2.0) |
| 160 | +- [MinerU](https://github.com/opendatalab/MinerU/tree/master) |
| 161 | +- [Swin](https://github.com/microsoft/Swin-Transformer) |
| 162 | +- [Hugging Face Transformers](https://github.com/huggingface/transformers) |
| 163 | + |
| 164 | +## 📝 Citation |
| 165 | + |
| 166 | +If you find this code useful for your research, please use the following BibTeX entry. |
| 167 | + |
| 168 | +```bibtex |
| 169 | +@inproceedings{dolphin2025, |
| 170 | + title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting}, |
| 171 | + author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and Tang, Jingqun and Liu, Hao and Huang, Can}, |
| 172 | + year={2025}, |
| 173 | + booktitle={Proceedings of the 65rd Annual Meeting of the Association for Computational Linguistics (ACL)} |
| 174 | +} |
| 175 | +``` |
0 commit comments