Add an optional nanonets/Nanonets-OCR-s backend to Docling’s ingestion pipeline so that any PDF or image passed to Docling is converted directly into richly-tagged Markdown.

<html>
<body>
<html><head></head><body><h3>Requested feature</h3>
Add an optional <code inline="">nanonets/Nanonets-OCR-s</code> backend to Docling’s ingestion pipeline so that any PDF or image passed to Docling is converted directly into richly-tagged Markdown. 
Docling already excels at parsing diverse document formats and preparing them for Gen-AI workflows, but it still relies on conventional OCR engines that output plain text with limited structural hints.(<a href="https://github.com/docling-project/docling?utm_source=chatgpt.com" title="docling-project/docling: Get your documents ready for gen AI - GitHub">github.com</a>, <a href="https://docling-project.github.io/docling/?utm_source=chatgpt.com" title="Docling - GitHub Pages">docling-project.github.io</a>, <a href="https://research.ibm.com/blog/docling-generative-AI?utm_source=chatgpt.com" title="IBM is open-sourcing a new toolkit for document conversion">research.ibm.com</a>) Integrating Nanonets-OCR-s would let Docling emit Markdown that preserves:
<ul>
<li>
LaTeX maths &nbsp;— inline <code inline="">$…$</code> and block <code inline="">$$…$$</code> equations are reproduced verbatim, so downstream LLMs can render or reason over formulas without heuristic post-processing.(<a href="https://huggingface.co/nanonets/Nanonets-OCR-s?utm_source=chatgpt.com" title="nanonets/Nanonets-OCR-s - Hugging Face">huggingface.co</a>, <a href="https://nanonets.com/research/nanonets-ocr-s/?utm_source=chatgpt.com" title="Advanced Document Understanding API - Nanonets OCR">nanonets.com</a>, <a href="https://news.ycombinator.com/item?id=44287043&amp;utm_source=chatgpt.com" title="OCR model that transforms documents into structured markdown">news.ycombinator.com</a>)
</li>
<li>
Image semantics &nbsp;— embedded pictures, charts and logos are replaced by <code inline="">&lt;img&gt;</code> tags containing concise, model-generated alt-descriptions, dramatically improving RAG pipelines that mix text and vision.(<a href="https://huggingface.co/spaces/Souvik3333/Nanonets-ocr-s?utm_source=chatgpt.com" title="Nanonets Ocr S - a Hugging Face Space by Souvik3333">huggingface.co</a>, <a href="https://www.reddit.com/r/LocalLLaMA/comments/1l9p54x/nanonetsocrs_an_opensource_imagetomarkdown_model/?utm_source=chatgpt.com" title="Nanonets-OCR-s: An Open-Source Image-to-Markdown Model with ...">reddit.com</a>)
</li>
<li>
Signature, watermark &amp; checkbox tags &nbsp;— legal or business docs arrive with <code inline="">&lt;signature&gt;</code>, <code inline="">&lt;watermark&gt;</code>, <code inline="">☐/☑/☒</code> markers already in place, enabling precise filtering or redaction rules.(<a href="https://huggingface.co/nanonets/Nanonets-OCR-s?utm_source=chatgpt.com" title="nanonets/Nanonets-OCR-s - Hugging Face">huggingface.co</a>, <a href="https://huggingface.co/nanonets/Nanonets-OCR-s/resolve/main/README.md?download=true&amp;utm_source=chatgpt.com" title="6.29 kB - Hugging Face">huggingface.co</a>)
</li>
<li>
HTML/Markdown tables &nbsp;— complex multi–row or nested tables are emitted twice (HTML + MD), matching Docling’s existing dual-format philosophy and sparing users from brittle table-reconstruction code.(<a href="https://nanonets.com/research/nanonets-ocr-s/?utm_source=chatgpt.com" title="Advanced Document Understanding API - Nanonets OCR">nanonets.com</a>, <a href="https://huggingface.co/Mungert/Nanonets-OCR-s-GGUF?utm_source=chatgpt.com" title="Mungert/Nanonets-OCR-s-GGUF - Hugging Face">huggingface.co</a>)
</li>
</ul>
Why this matters for Docling users
<ol>
<li>
Greatly reduced post-processing: Docling’s current post-OCR “structure inference” heuristics (e.g., detecting table borders or maths blocks) can be removed or simplified, cutting pipeline runtime and maintenance.(<a href="https://github.com/docling-project/docling?utm_source=chatgpt.com" title="docling-project/docling: Get your documents ready for gen AI - GitHub">github.com</a>, <a href="https://medium.com/%40felix-pappe/pdf-to-markdown-simplified-implementation-and-comparison-of-mistral-and-docling-5c70b6f9a8f0?utm_source=chatgpt.com" title="PDF to Markdown: Mistral vs. Docling OCR | Medium">medium.com</a>)
</li>
<li>
LLM-ready output out-of-the-box: Rich Markdown with semantic tags feeds directly into retrieval-augmented generation, few-shot tutoring or embeddings without extra parsing layers.(<a href="https://huggingface.co/nanonets/Nanonets-OCR-s?utm_source=chatgpt.com" title="nanonets/Nanonets-OCR-s - Hugging Face">huggingface.co</a>, <a href="https://news.ycombinator.com/item?id=44287043&amp;utm_source=chatgpt.com" title="OCR model that transforms documents into structured markdown">news.ycombinator.com</a>)
</li>
<li>
Consistent developer experience: Nanonets-OCR-s ships via 🤗 Transformers and vLLM, both technologies Docling already uses, so the integration is mostly wiring model invocation plus a feature flag.(<a href="https://huggingface.co/nanonets/Nanonets-OCR-s?utm_source=chatgpt.com" title="nanonets/Nanonets-OCR-s - Hugging Face">huggingface.co</a>, <a href="https://nanonets.com/research/nanonets-ocr-s/?utm_source=chatgpt.com" title="Advanced Document Understanding API - Nanonets OCR">nanonets.com</a>)
</li>
<li>
Future-proofing: The small model runs comfortably on consumer GPUs; when larger variants appear, Docling can expose a <code inline="">--ocr-model</code> switch and scale up without API changes.(<a href="https://nanonets.com/research/nanonets-ocr-s/?utm_source=chatgpt.com" title="Advanced Document Understanding API - Nanonets OCR">nanonets.com</a>, <a href="https://huggingface.co/Mungert/Nanonets-OCR-s-GGUF?utm_source=chatgpt.com" title="Mungert/Nanonets-OCR-s-GGUF - Hugging Face">huggingface.co</a>)
</li>
</ol>
<hr>
<h3>Alternatives</h3>

Option | Strengths | Shortcomings vs. Nanonets-OCR-s
-- | -- | --
LayoutLMv3 + custom post-processing(arxiv.org, arxiv.org) | Multimodal pre-training; good form understanding | Outputs token positions, not Markdown; LaTeX & checkbox logic must be handcrafted
Amazon Textract(docs.aws.amazon.com, aws.amazon.com, aws.amazon.com) | Reliable table & form extraction; managed scaling | Closed source, pay-per-page, no LaTeX, limited watermark/signature tagging
Google Document AI(cloud.google.com, cloud.google.com, cloud.google.com) | Wide language support; layout detection | Quotas & hard system limits; no Markdown output; LaTeX not preserved
Tesseract + heuristics (status-quo) | Zero cost; already in Docling | Plain text only; fragile table detection; no semantic tags


All alternatives either lack semantic Markdown, require substantial bespoke logic (e.g., to detect <code inline="">$…$</code> math or image captions), or impose commercial costs and vendor lock-in. Nanonets-OCR-s uniquely combines open-source licensing, lightweight resource needs, and out-of-the-box semantic tagging, making it the most practical upgrade path for Docling today.</body></html>
</body>
</html>### Requested feature



**Add an optional `nanonets/Nanonets-OCR-s` backend to Docling’s ingestion pipeline so that any PDF or image passed to Docling is converted directly into richly-tagged Markdown.**
Docling already excels at parsing diverse document formats and preparing them for Gen-AI workflows, but it still relies on conventional OCR engines that output *plain* text with limited structural hints.([[github.com](https://github.com/docling-project/docling?utm_source=chatgpt.com)][1], [[docling-project.github.io](https://docling-project.github.io/docling/?utm_source=chatgpt.com)][2], [[research.ibm.com](https://research.ibm.com/blog/docling-generative-AI?utm_source=chatgpt.com)][3]) Integrating Nanonets-OCR-s would let Docling emit Markdown that preserves:

* **LaTeX maths**  — inline `$…$` and block `$$…$$` equations are reproduced verbatim, so downstream LLMs can render or reason over formulas without heuristic post-processing.([[huggingface.co](https://huggingface.co/nanonets/Nanonets-OCR-s?utm_source=chatgpt.com)][4], [[nanonets.com](https://nanonets.com/research/nanonets-ocr-s/?utm_source=chatgpt.com)][5], [[news.ycombinator.com](https://news.ycombinator.com/item?id=44287043&utm_source=chatgpt.com)][6])
* **Image semantics**  — embedded pictures, charts and logos are replaced by `<img>` tags containing concise, model-generated alt-descriptions, dramatically improving RAG pipelines that mix text and vision.([[huggingface.co](https://huggingface.co/spaces/Souvik3333/Nanonets-ocr-s?utm_source=chatgpt.com)][7], [[reddit.com](https://www.reddit.com/r/LocalLLaMA/comments/1l9p54x/nanonetsocrs_an_opensource_imagetomarkdown_model/?utm_source=chatgpt.com)][8])
* **Signature, watermark & checkbox tags**  — legal or business docs arrive with `<signature>`, `<watermark>`, `☐/☑/☒` markers already in place, enabling precise filtering or redaction rules.([[huggingface.co](https://huggingface.co/nanonets/Nanonets-OCR-s?utm_source=chatgpt.com)][4], [[huggingface.co](https://huggingface.co/nanonets/Nanonets-OCR-s/resolve/main/README.md?download=true&utm_source=chatgpt.com)][9])
* **HTML/Markdown tables**  — complex multi–row or nested tables are emitted twice (HTML + MD), matching Docling’s existing dual-format philosophy and sparing users from brittle table-reconstruction code.([[nanonets.com](https://nanonets.com/research/nanonets-ocr-s/?utm_source=chatgpt.com)][5], [[huggingface.co](https://huggingface.co/Mungert/Nanonets-OCR-s-GGUF?utm_source=chatgpt.com)][10])

**Why this matters for Docling users**

1. **Greatly reduced post-processing:** Docling’s current post-OCR “structure inference” heuristics (e.g., detecting table borders or maths blocks) can be removed or simplified, cutting pipeline runtime and maintenance.([[github.com](https://github.com/docling-project/docling?utm_source=chatgpt.com)][1], [[medium.com](https://medium.com/%40felix-pappe/pdf-to-markdown-simplified-implementation-and-comparison-of-mistral-and-docling-5c70b6f9a8f0?utm_source=chatgpt.com)][11])
2. **LLM-ready output out-of-the-box:** Rich Markdown with semantic tags feeds directly into retrieval-augmented generation, few-shot tutoring or embeddings without extra parsing layers.([[huggingface.co](https://huggingface.co/nanonets/Nanonets-OCR-s?utm_source=chatgpt.com)][4], [[news.ycombinator.com](https://news.ycombinator.com/item?id=44287043&utm_source=chatgpt.com)][6])
3. **Consistent developer experience:** Nanonets-OCR-s ships via 🤗 Transformers *and* vLLM, both technologies Docling already uses, so the integration is mostly wiring model invocation plus a feature flag.([[huggingface.co](https://huggingface.co/nanonets/Nanonets-OCR-s?utm_source=chatgpt.com)][4], [[nanonets.com](https://nanonets.com/research/nanonets-ocr-s/?utm_source=chatgpt.com)][5])
4. **Future-proofing:** The small model runs comfortably on consumer GPUs; when larger variants appear, Docling can expose a `--ocr-model` switch and scale up without API changes.([[nanonets.com](https://nanonets.com/research/nanonets-ocr-s/?utm_source=chatgpt.com)][5], [[huggingface.co](https://huggingface.co/Mungert/Nanonets-OCR-s-GGUF?utm_source=chatgpt.com)][10])

---

### Alternatives



| Option | Strengths | Shortcomings vs. Nanonets-OCR-s |
| ---------------------------------------------------------------------------------------------- | ------------------------------------------------- | --------------------------------------------------------------------------------- |
| **LayoutLMv3 + custom post-processing**([[arxiv.org](https://arxiv.org/pdf/2204.08387?utm_source=chatgpt.com)][12], [[arxiv.org](https://arxiv.org/html/2404.10848v1?utm_source=chatgpt.com)][13]) | Multimodal pre-training; good form understanding | Outputs token positions, not Markdown; LaTeX & checkbox logic must be handcrafted |
| **Amazon Textract**([[docs.aws.amazon.com](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-tables.html?utm_source=chatgpt.com)][14], [[aws.amazon.com](https://aws.amazon.com/blogs/machine-learning/announcing-enhanced-table-extractions-with-amazon-textract/?utm_source=chatgpt.com)][15], [[aws.amazon.com](https://aws.amazon.com/textract/features/?utm_source=chatgpt.com)][16]) | Reliable table & form extraction; managed scaling | Closed source, pay-per-page, no LaTeX, limited watermark/signature tagging |
| **Google Document AI**([[cloud.google.com](https://cloud.google.com/document-ai/limits?utm_source=chatgpt.com)][17], [[cloud.google.com](https://cloud.google.com/document-ai/docs/enterprise-document-ocr?utm_source=chatgpt.com)][18], [[cloud.google.com](https://cloud.google.com/document-ai/quotas?utm_source=chatgpt.com)][19]) | Wide language support; layout detection | Quotas & hard system limits; no Markdown output; LaTeX not preserved |
| **Tesseract + heuristics (status-quo)** | Zero cost; already in Docling | Plain text only; fragile table detection; no semantic tags |

All alternatives either lack semantic Markdown, require substantial bespoke logic (e.g., to detect `$…$` math or image captions), or impose commercial costs and vendor lock-in. Nanonets-OCR-s uniquely combines **open-source licensing, lightweight resource needs, and out-of-the-box semantic tagging**, making it the most practical upgrade path for Docling today.

[1]: https://github.com/docling-project/docling?utm_source=chatgpt.com "docling-project/docling: Get your documents ready for gen AI - GitHub"
[2]: https://docling-project.github.io/docling/?utm_source=chatgpt.com "Docling - GitHub Pages"
[3]: https://research.ibm.com/blog/docling-generative-AI?utm_source=chatgpt.com "IBM is open-sourcing a new toolkit for document conversion"
[4]: https://huggingface.co/nanonets/Nanonets-OCR-s?utm_source=chatgpt.com "nanonets/Nanonets-OCR-s - Hugging Face"
[5]: https://nanonets.com/research/nanonets-ocr-s/?utm_source=chatgpt.com "Advanced Document Understanding API - Nanonets OCR"
[6]: https://news.ycombinator.com/item?id=44287043&utm_source=chatgpt.com "OCR model that transforms documents into structured markdown"
[7]: https://huggingface.co/spaces/Souvik3333/Nanonets-ocr-s?utm_source=chatgpt.com "Nanonets Ocr S - a Hugging Face Space by Souvik3333"
[8]: https://www.reddit.com/r/LocalLLaMA/comments/1l9p54x/nanonetsocrs_an_opensource_imagetomarkdown_model/?utm_source=chatgpt.com "Nanonets-OCR-s: An Open-Source Image-to-Markdown Model with ..."
[9]: https://huggingface.co/nanonets/Nanonets-OCR-s/resolve/main/README.md?download=true&utm_source=chatgpt.com "6.29 kB - Hugging Face"
[10]: https://huggingface.co/Mungert/Nanonets-OCR-s-GGUF?utm_source=chatgpt.com "Mungert/Nanonets-OCR-s-GGUF - Hugging Face"
[11]: https://medium.com/%40felix-pappe/pdf-to-markdown-simplified-implementation-and-comparison-of-mistral-and-docling-5c70b6f9a8f0?utm_source=chatgpt.com "PDF to Markdown: Mistral vs. Docling OCR | Medium"
[12]: https://arxiv.org/pdf/2204.08387?utm_source=chatgpt.com "[PDF] LayoutLMv3: Pre-training for Document AI with Unified Text ... - arXiv"
[13]: https://arxiv.org/html/2404.10848v1?utm_source=chatgpt.com "A LayoutLMv3-Based Model for Enhanced Relation Extraction in ..."
[14]: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-tables.html?utm_source=chatgpt.com "Tables - Amazon Textract - AWS Documentation"
[15]: https://aws.amazon.com/blogs/machine-learning/announcing-enhanced-table-extractions-with-amazon-textract/?utm_source=chatgpt.com "Announcing enhanced table extractions with Amazon Textract - AWS"
[16]: https://aws.amazon.com/textract/features/?utm_source=chatgpt.com "Amazon Textract features"
[17]: https://cloud.google.com/document-ai/limits?utm_source=chatgpt.com "Limits | Document AI - Google Cloud"
[18]: https://cloud.google.com/document-ai/docs/enterprise-document-ocr?utm_source=chatgpt.com "Enterprise Document OCR | Document AI | Google Cloud"
[19]: https://cloud.google.com/document-ai/quotas?utm_source=chatgpt.com "Quotas | Document AI - Google Cloud"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add an optional nanonets/Nanonets-OCR-s backend to Docling’s ingestion pipeline so that any PDF or image passed to Docling is converted directly into richly-tagged Markdown. #1799

Requested feature

Alternatives

Alternatives

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Option	Strengths	Shortcomings vs. Nanonets-OCR-s
LayoutLMv3 + custom post-processing(arxiv.org, arxiv.org)	Multimodal pre-training; good form understanding	Outputs token positions, not Markdown; LaTeX & checkbox logic must be handcrafted
Amazon Textract(docs.aws.amazon.com, aws.amazon.com, aws.amazon.com)	Reliable table & form extraction; managed scaling	Closed source, pay-per-page, no LaTeX, limited watermark/signature tagging
Google Document AI(cloud.google.com, cloud.google.com, cloud.google.com)	Wide language support; layout detection	Quotas & hard system limits; no Markdown output; LaTeX not preserved
Tesseract + heuristics (status-quo)	Zero cost; already in Docling	Plain text only; fragile table detection; no semantic tags

Option	Strengths	Shortcomings vs. Nanonets-OCR-s
LayoutLMv3 + custom post-processing([arxiv.org]12, [arxiv.org]13)	Multimodal pre-training; good form understanding	Outputs token positions, not Markdown; LaTeX & checkbox logic must be handcrafted
Amazon Textract([docs.aws.amazon.com]14, [aws.amazon.com]15, [aws.amazon.com]16)	Reliable table & form extraction; managed scaling	Closed source, pay-per-page, no LaTeX, limited watermark/signature tagging
Google Document AI([cloud.google.com]17, [cloud.google.com]18, [cloud.google.com]19)	Wide language support; layout detection	Quotas & hard system limits; no Markdown output; LaTeX not preserved
Tesseract + heuristics (status-quo)	Zero cost; already in Docling	Plain text only; fragile table detection; no semantic tags

Add an optional nanonets/Nanonets-OCR-s backend to Docling’s ingestion pipeline so that any PDF or image passed to Docling is converted directly into richly-tagged Markdown. #1799

Description

Requested feature

Alternatives

Alternatives

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions