Skip to content

Commit d9af04a

Browse files
committed
[init] initial commit
0 parents  commit d9af04a

31 files changed

+2750
-0
lines changed

.gitignore

Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
*.egg-info/
24+
.installed.cfg
25+
*.egg
26+
MANIFEST
27+
28+
# PyInstaller
29+
# Usually these files are written by a python script from a template
30+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
31+
*.manifest
32+
*.spec
33+
34+
# Installer logs
35+
pip-log.txt
36+
pip-delete-this-directory.txt
37+
38+
# Unit test / coverage reports
39+
htmlcov/
40+
.tox/
41+
.nox/
42+
.coverage
43+
*.cover
44+
*.py,cover
45+
.hypothesis/
46+
.pytest_cache/
47+
coverage.xml
48+
*.mo
49+
*.pot
50+
51+
# Translations
52+
*.mo
53+
*.pot
54+
55+
# Django stuff:
56+
*.log
57+
local_settings.py
58+
db.sqlite3
59+
db.sqlite3-journal
60+
61+
# Flask stuff:
62+
instance/
63+
.webassets-cache
64+
65+
# Scrapy stuff:
66+
.scrapy
67+
68+
# Sphinx documentation
69+
docs/_build/
70+
71+
# PyBuilder
72+
target/
73+
74+
# Jupyter Notebook
75+
.ipynb_checkpoints
76+
77+
# IPython
78+
profile_default/
79+
ipython_config.py
80+
81+
# pyenv
82+
.python-version
83+
84+
# pipenv
85+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
86+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
87+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
88+
# install all needed dependencies.
89+
#Pipfile.lock
90+
91+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
92+
__pypackages__/
93+
94+
# Celery stuff
95+
celerybeat-schedule
96+
celerybeat.pid
97+
98+
# SageMath parsed files
99+
*.sage.py
100+
101+
# Environments
102+
.env
103+
.venv
104+
env/
105+
venv/
106+
ENV/
107+
env.bak/
108+
venv.bak/
109+
110+
# Spyder project settings
111+
.spyderproject
112+
.spyproject
113+
114+
# Rope project settings
115+
.ropeproject
116+
117+
# mkdocs documentation
118+
/site
119+
120+
# mypy
121+
.mypy_cache/
122+
.dmypy.json
123+
dmypy.json
124+
125+
# Pyre type checker
126+
.pyre/
127+
128+
# pytype static type analyzer
129+
.pytype/
130+
131+
# Cython debug symbols
132+
cython_debug/
133+
134+
# PyCharm
135+
.idea/
136+
*.iml
137+
138+
# VS Code
139+
.vscode/
140+
!.vscode/settings.json
141+
!.vscode/tasks.json
142+
!.vscode/launch.json
143+
!.vscode/extensions.json
144+
145+
# macOS
146+
.DS_Store
147+
148+
# Windows
149+
Thumbs.db
150+
ehthumbs.db
151+
Desktop.ini
152+
153+
fusion_result.json
154+
kernel_meta/

.pre-commit-config.yaml

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
repos:
2+
# 1. isort - 自动排序 Python imports
3+
- repo: https://github.com/pycqa/isort
4+
rev: 6.0.1 # 使用固定版本号
5+
hooks:
6+
- id: isort
7+
name: isort (python)
8+
args: [--profile=black] # 与 Black 兼容的配置
9+
language: python
10+
11+
# 2. Black - 自动格式化 Python 代码
12+
- repo: https://github.com/psf/black
13+
rev: 25.1.0 # 使用固定版本号
14+
hooks:
15+
- id: black
16+
language: python
17+
18+
# 3. flake8 - Python 静态检查
19+
- repo: https://github.com/pycqa/flake8
20+
rev: 7.2.0
21+
hooks:
22+
- id: flake8
23+
args: [--max-line-length=120, --ignore=E203] # 设置行长度为 120
24+
additional_dependencies: [flake8-bugbear==24.12.12] # 可选:增强检查
25+
26+
# 4. pre-commit-hooks - 通用 Git 钩子
27+
- repo: https://github.com/pre-commit/pre-commit-hooks
28+
rev: v5.0.0
29+
hooks:
30+
- id: trailing-whitespace # 删除行尾空格
31+
- id: end-of-file-fixer # 确保文件以换行符结束
32+
- id: check-yaml # 验证 YAML 文件语法
33+
- id: check-added-large-files # 阻止大文件提交
34+
args: ["--maxkb=512"]

LICENSE

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
MIT License
2+
3+
Copyright 2025 ByteDance Ltd. and/or its affiliates
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
6+
7+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
8+
9+
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
<div align="center">
2+
<img src="./assets/dolphin.png" width="300">
3+
</div>
4+
5+
<div align="center">
6+
<a href="https://arxiv.org/abs/2505.14059">
7+
<img src="https://img.shields.io/badge/Paper-Arxiv-red">
8+
</a>
9+
<a href="https://huggingface.co/ByteDance/Dolphin">
10+
<img src="https://img.shields.io/badge/HuggingFace-Dolphin-yellow">
11+
</a>
12+
<!-- <a href="https://link/of/demo">
13+
<img src="https://img.shields.io/badge/Demo-Coming_Soon-blue">
14+
</a> -->
15+
<a href="https://github.com/bytedance/Dolphin">
16+
<img src="https://img.shields.io/badge/Code-Github-green">
17+
</a>
18+
<a href="https://opensource.org/licenses/MIT">
19+
<img src="https://img.shields.io/badge/License-MIT-lightgray">
20+
</a>
21+
<br>
22+
</div>
23+
24+
<br>
25+
26+
<div align="center">
27+
<img src="./assets/demo.gif" width="800">
28+
</div>
29+
30+
# Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
31+
32+
Dolphin (**Do**cument Image **P**arsing via **H**eterogeneous Anchor Prompt**in**g) is a novel multimodal document image parsing model following an analyze-then-parse paradigm. This repository contains the demo code and pre-trained models for Dolphin.
33+
34+
## 📑 Overview
35+
36+
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Dolphin addresses these challenges through a two-stage approach:
37+
38+
1. **🔍 Stage 1**: Comprehensive page-level layout analysis by generating element sequence in natural reading order
39+
2. **🧩 Stage 2**: Efficient parallel parsing of document elements using heterogeneous anchors and task-specific prompts
40+
41+
<div align="center">
42+
<img src="./assets/framework.png" width="680">
43+
</div>
44+
45+
Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism.
46+
47+
## 🚀 Demo
48+
49+
<!-- Try our demo on [GitHub](https://github.com/ByteDance/Dolphin). -->
50+
Demo is coming soon within these days. Please keep tuned! 🔥
51+
52+
53+
## 📅 Changelog
54+
- 🔥 **2025.05.20** The pretrained model and inference code of Dolphin are released.
55+
56+
## 🛠️ Installation
57+
58+
1. Clone the repository:
59+
```bash
60+
git clone https://github.com/ByteDance/Dolphin.git
61+
cd Dolphin
62+
```
63+
64+
2. Install the dependencies:
65+
```bash
66+
pip install -r requirements.txt
67+
```
68+
69+
3. Download the pre-trained models using one of the following options:
70+
71+
**Option A: Original Model Format (config-based)**
72+
Download from [Baidu Yun](https://pan.baidu.com/s/1EbjjTN_lUinCq7tX7hhtfQ?pwd=wb43) or [Google Drive](https://drive.google.com/drive/folders/1PQJ3UutepXvunizZEw-uGaQ0BCzf-mie?usp=sharing) and put them in the `./checkpoints` folder.
73+
74+
**Option B: Hugging Face Model Format**
75+
```bash
76+
# Download the model from Hugging Face Hub
77+
git lfs install
78+
git clone https://huggingface.co/ByteDance/Dolphin ./hf_model
79+
# Or use the Hugging Face CLI
80+
huggingface-cli download ByteDance/Dolphin --local-dir ./hf_model
81+
```
82+
83+
## ⚡ Inference
84+
85+
Dolphin provides two inference frameworks with support for two parsing granularities:
86+
- **Page-level Parsing**: Parse the entire document image into a structured JSON and Markdown format
87+
- **Element-level Parsing**: Parse individual document elements (text, table, formula)
88+
89+
### 📄 Page-level Parsing
90+
91+
#### Using Original Framework (config-based)
92+
93+
```bash
94+
# Process a single document image
95+
python demo_page.py --config ./config/Dolphin.yaml --input_path ./demo/page_imgs/page_1.jpeg --save_dir ./results
96+
97+
# Process all document images in a directory
98+
python demo_page.py --config ./config/Dolphin.yaml --input_path ./demo/page_imgs --save_dir ./results
99+
100+
# Process with custom batch size for parallel element decoding
101+
python demo_page.py --config ./config/Dolphin.yaml --input_path ./demo/page_imgs --save_dir ./results --max_batch_size 8
102+
```
103+
104+
#### Using Hugging Face Framework
105+
106+
```bash
107+
# Process a single document image
108+
python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs/page_1.jpeg --save_dir ./results
109+
110+
# Process all document images in a directory
111+
python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs --save_dir ./results
112+
113+
# Process with custom batch size for parallel element decoding
114+
python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs --save_dir ./results --max_batch_size 16
115+
```
116+
117+
### 🧩 Element-level Parsing
118+
119+
#### Using Original Framework (config-based)
120+
121+
```bash
122+
# Process a single table image
123+
python demo_element.py --config ./config/Dolphin.yaml --input_path ./demo/element_imgs/table_1.jpeg --element_type table
124+
125+
# Process a single formula image
126+
python demo_element.py --config ./config/Dolphin.yaml --input_path ./demo/element_imgs/line_formula.jpeg --element_type formula
127+
128+
# Process a single text paragraph image
129+
python demo_element.py --config ./config/Dolphin.yaml --input_path ./demo/element_imgs/para_1.jpg --element_type text
130+
```
131+
132+
#### Using Hugging Face Framework
133+
134+
```bash
135+
# Process a single table image
136+
python demo_element_hf.py --model_path ./hf_model --input_path ./demo/element_imgs/table_1.jpeg --element_type table
137+
138+
# Process a single formula image
139+
python demo_element_hf.py --model_path ./hf_model --input_path ./demo/element_imgs/line_formula.jpeg --element_type formula
140+
141+
# Process a single text paragraph image
142+
python demo_element_hf.py --model_path ./hf_model --input_path ./demo/element_imgs/para_1.jpg --element_type text
143+
```
144+
145+
## 🌟 Key Features
146+
147+
- 🔄 Two-stage analyze-then-parse approach based on a single VLM
148+
- 📊 Promising performance on document parsing tasks
149+
- 🔍 Natural reading order element sequence generation
150+
- 🧩 Heterogeneous anchor prompting for different document elements
151+
- ⏱️ Efficient parallel parsing mechanism
152+
- 🤗 Support for Hugging Face Transformers for easier integration
153+
154+
## 💖 Acknowledgement
155+
156+
We would like to acknowledge the following open-source projects that provided inspiration and reference for this work:
157+
- [Donut](https://github.com/clovaai/donut/)
158+
- [Nougat](https://github.com/facebookresearch/nougat)
159+
- [GOT](https://github.com/Ucas-HaoranWei/GOT-OCR2.0)
160+
- [MinerU](https://github.com/opendatalab/MinerU/tree/master)
161+
- [Swin](https://github.com/microsoft/Swin-Transformer)
162+
- [Hugging Face Transformers](https://github.com/huggingface/transformers)
163+
164+
## 📝 Citation
165+
166+
If you find this code useful for your research, please use the following BibTeX entry.
167+
168+
```bibtex
169+
@inproceedings{dolphin2025,
170+
title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting},
171+
author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and Tang, Jingqun and Liu, Hao and Huang, Can},
172+
year={2025},
173+
booktitle={Proceedings of the 65rd Annual Meeting of the Association for Computational Linguistics (ACL)}
174+
}
175+
```

assets/demo.gif

3.08 MB
Loading

assets/dolphin.png

81.3 KB
Loading

assets/framework.png

1.91 MB
Loading

0 commit comments

Comments
 (0)