xiechengmude
diff --git a/‎.gitignore
Lines changed: 154 additions & 0 deletions b/‎.gitignore
Lines changed: 154 additions & 0 deletions
diff --git a/‎.pre-commit-config.yaml
Lines changed: 34 additions & 0 deletions b/‎.pre-commit-config.yaml
Lines changed: 34 additions & 0 deletions
diff --git a/‎LICENSE
Lines changed: 9 additions & 0 deletions b/‎LICENSE
Lines changed: 9 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 175 additions & 0 deletions b/‎README.md
Lines changed: 175 additions & 0 deletions
diff --git a/‎assets/demo.gif
3.08 MB b/‎assets/demo.gif
3.08 MB
diff --git a/‎assets/dolphin.png
81.3 KB b/‎assets/dolphin.png
81.3 KB
diff --git a/‎assets/framework.png
1.91 MB b/‎assets/framework.png
1.91 MB
@@ -0,0 +1,154 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+coverage.xml
+*.mo
+*.pot
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+.idea/
+*.iml
+
+# VS Code
+.vscode/
+!.vscode/settings.json
+!.vscode/tasks.json
+!.vscode/launch.json
+!.vscode/extensions.json
+
+# macOS
+.DS_Store
+
+# Windows
+Thumbs.db
+ehthumbs.db
+Desktop.ini
+
+fusion_result.json
+kernel_meta/
@@ -0,0 +1,34 @@
+repos:
+  # 1. isort - 自动排序 Python imports
+  - repo: https://github.com/pycqa/isort
+    rev: 6.0.1  # 使用固定版本号
+    hooks:
+      - id: isort
+        name: isort (python)
+        args: [--profile=black]  # 与 Black 兼容的配置
+        language: python
+
+  # 2. Black - 自动格式化 Python 代码
+  - repo: https://github.com/psf/black
+    rev: 25.1.0  # 使用固定版本号
+    hooks:
+      - id: black
+        language: python
+
+  # 3. flake8 - Python 静态检查
+  - repo: https://github.com/pycqa/flake8
+    rev: 7.2.0
+    hooks:
+      - id: flake8
+        args: [--max-line-length=120, --ignore=E203]  # 设置行长度为 120
+        additional_dependencies: [flake8-bugbear==24.12.12]  # 可选：增强检查
+
+  # 4. pre-commit-hooks - 通用 Git 钩子
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v5.0.0
+    hooks:
+      - id: trailing-whitespace  # 删除行尾空格
+      - id: end-of-file-fixer    # 确保文件以换行符结束
+      - id: check-yaml           # 验证 YAML 文件语法
+      - id: check-added-large-files  # 阻止大文件提交
+        args: ["--maxkb=512"]
@@ -0,0 +1,9 @@
+MIT License
+
+Copyright 2025 ByteDance Ltd. and/or its affiliates
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,175 @@
+<div align="center">
+  <img src="./assets/dolphin.png" width="300">
+</div>
+
+<div align="center">
+  <a href="https://arxiv.org/abs/2505.14059">
+    <img src="https://img.shields.io/badge/Paper-Arxiv-red">
+  </a>
+  <a href="https://huggingface.co/ByteDance/Dolphin">
+    <img src="https://img.shields.io/badge/HuggingFace-Dolphin-yellow">
+  </a>
+  <!-- <a href="https://link/of/demo">
+    <img src="https://img.shields.io/badge/Demo-Coming_Soon-blue">
+  </a> -->
+  <a href="https://github.com/bytedance/Dolphin">
+    <img src="https://img.shields.io/badge/Code-Github-green">
+  </a>
+  <a href="https://opensource.org/licenses/MIT">
+    <img src="https://img.shields.io/badge/License-MIT-lightgray">
+  </a>
+  <br>
+</div>
+
+<br>
+
+<div align="center">
+  <img src="./assets/demo.gif" width="800">
+</div>
+
+# Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
+
+Dolphin (**Do**cument Image **P**arsing via **H**eterogeneous Anchor Prompt**in**g) is a novel multimodal document image parsing model following an analyze-then-parse paradigm. This repository contains the demo code and pre-trained models for Dolphin.
+
+## 📑 Overview
+
+Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Dolphin addresses these challenges through a two-stage approach:
+
+1. **🔍 Stage 1**: Comprehensive page-level layout analysis by generating element sequence in natural reading order
+2. **🧩 Stage 2**: Efficient parallel parsing of document elements using heterogeneous anchors and task-specific prompts
+
+<div align="center">
+  <img src="./assets/framework.png" width="680">
+</div>
+
+Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism.
+
+## 🚀 Demo
+
+<!-- Try our demo on [GitHub](https://github.com/ByteDance/Dolphin). -->
+Demo is coming soon within these days. Please keep tuned! 🔥
+
+
+## 📅 Changelog
+- 🔥 **2025.05.20** The pretrained model and inference code of Dolphin are released.
+
+## 🛠️ Installation
+
+1. Clone the repository:
+   ```bash
+   git clone https://github.com/ByteDance/Dolphin.git
+   cd Dolphin
+   ```
+
+2. Install the dependencies:
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+3. Download the pre-trained models using one of the following options:
+
+   **Option A: Original Model Format (config-based)**
+   Download from [Baidu Yun](https://pan.baidu.com/s/1EbjjTN_lUinCq7tX7hhtfQ?pwd=wb43) or [Google Drive](https://drive.google.com/drive/folders/1PQJ3UutepXvunizZEw-uGaQ0BCzf-mie?usp=sharing) and put them in the `./checkpoints` folder.
+
+   **Option B: Hugging Face Model Format**
+   ```bash
+   # Download the model from Hugging Face Hub
+   git lfs install
+   git clone https://huggingface.co/ByteDance/Dolphin ./hf_model
+   # Or use the Hugging Face CLI
+   huggingface-cli download ByteDance/Dolphin --local-dir ./hf_model
+   ```
+
+## ⚡ Inference
+
+Dolphin provides two inference frameworks with support for two parsing granularities:
+- **Page-level Parsing**: Parse the entire document image into a structured JSON and Markdown format
+- **Element-level Parsing**: Parse individual document elements (text, table, formula)
+
+### 📄 Page-level Parsing
+
+#### Using Original Framework (config-based)
+
+```bash
+# Process a single document image
+python demo_page.py --config ./config/Dolphin.yaml --input_path ./demo/page_imgs/page_1.jpeg --save_dir ./results
+
+# Process all document images in a directory
+python demo_page.py --config ./config/Dolphin.yaml --input_path ./demo/page_imgs --save_dir ./results
+
+# Process with custom batch size for parallel element decoding
+python demo_page.py --config ./config/Dolphin.yaml --input_path ./demo/page_imgs --save_dir ./results --max_batch_size 8
+```
+
+#### Using Hugging Face Framework
+
+```bash
+# Process a single document image
+python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs/page_1.jpeg --save_dir ./results
+
+# Process all document images in a directory
+python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs --save_dir ./results
+
+# Process with custom batch size for parallel element decoding
+python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs --save_dir ./results --max_batch_size 16
+```
+
+### 🧩 Element-level Parsing
+
+#### Using Original Framework (config-based)
+
+```bash
+# Process a single table image
+python demo_element.py --config ./config/Dolphin.yaml --input_path ./demo/element_imgs/table_1.jpeg --element_type table
+
+# Process a single formula image
+python demo_element.py --config ./config/Dolphin.yaml --input_path ./demo/element_imgs/line_formula.jpeg --element_type formula
+
+# Process a single text paragraph image
+python demo_element.py --config ./config/Dolphin.yaml --input_path ./demo/element_imgs/para_1.jpg --element_type text
+```
+
+#### Using Hugging Face Framework
+
+```bash
+# Process a single table image
+python demo_element_hf.py --model_path ./hf_model --input_path ./demo/element_imgs/table_1.jpeg --element_type table
+
+# Process a single formula image
+python demo_element_hf.py --model_path ./hf_model --input_path ./demo/element_imgs/line_formula.jpeg --element_type formula
+
+# Process a single text paragraph image
+python demo_element_hf.py --model_path ./hf_model --input_path ./demo/element_imgs/para_1.jpg --element_type text
+```
+
+## 🌟 Key Features
+
+- 🔄 Two-stage analyze-then-parse approach based on a single VLM
+- 📊 Promising performance on document parsing tasks
+- 🔍 Natural reading order element sequence generation
+- 🧩 Heterogeneous anchor prompting for different document elements
+- ⏱️ Efficient parallel parsing mechanism
+- 🤗 Support for Hugging Face Transformers for easier integration
+
+## 💖 Acknowledgement
+
+We would like to acknowledge the following open-source projects that provided inspiration and reference for this work:
+- [Donut](https://github.com/clovaai/donut/)
+- [Nougat](https://github.com/facebookresearch/nougat)
+- [GOT](https://github.com/Ucas-HaoranWei/GOT-OCR2.0)
+- [MinerU](https://github.com/opendatalab/MinerU/tree/master)
+- [Swin](https://github.com/microsoft/Swin-Transformer)
+- [Hugging Face Transformers](https://github.com/huggingface/transformers)
+
+## 📝 Citation
+
+If you find this code useful for your research, please use the following BibTeX entry.
+
+```bibtex
+@inproceedings{dolphin2025,
+  title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting},
+  author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and Tang, Jingqun and Liu, Hao and Huang, Can},
+  year={2025},
+  booktitle={Proceedings of the 65rd Annual Meeting of the Association for Computational Linguistics (ACL)}
+}
+```