Description
Requested feature
Add an optional nanonets/Nanonets-OCR-s
backend to Docling’s ingestion pipeline so that any PDF or image passed to Docling is converted directly into richly-tagged Markdown.
Docling already excels at parsing diverse document formats and preparing them for Gen-AI workflows, but it still relies on conventional OCR engines that output plain text with limited structural hints.(github.com, docling-project.github.io, research.ibm.com) Integrating Nanonets-OCR-s would let Docling emit Markdown that preserves:
-
LaTeX maths — inline
$…$
and block$$…$$
equations are reproduced verbatim, so downstream LLMs can render or reason over formulas without heuristic post-processing.(huggingface.co, nanonets.com, news.ycombinator.com) -
Image semantics — embedded pictures, charts and logos are replaced by
<img>
tags containing concise, model-generated alt-descriptions, dramatically improving RAG pipelines that mix text and vision.(huggingface.co, reddit.com) -
Signature, watermark & checkbox tags — legal or business docs arrive with
<signature>
,<watermark>
,☐/☑/☒
markers already in place, enabling precise filtering or redaction rules.(huggingface.co, huggingface.co) -
HTML/Markdown tables — complex multi–row or nested tables are emitted twice (HTML + MD), matching Docling’s existing dual-format philosophy and sparing users from brittle table-reconstruction code.(nanonets.com, huggingface.co)
Why this matters for Docling users
-
Greatly reduced post-processing: Docling’s current post-OCR “structure inference” heuristics (e.g., detecting table borders or maths blocks) can be removed or simplified, cutting pipeline runtime and maintenance.(github.com, medium.com)
-
LLM-ready output out-of-the-box: Rich Markdown with semantic tags feeds directly into retrieval-augmented generation, few-shot tutoring or embeddings without extra parsing layers.(huggingface.co, news.ycombinator.com)
-
Consistent developer experience: Nanonets-OCR-s ships via 🤗 Transformers and vLLM, both technologies Docling already uses, so the integration is mostly wiring model invocation plus a feature flag.(huggingface.co, nanonets.com)
-
Future-proofing: The small model runs comfortably on consumer GPUs; when larger variants appear, Docling can expose a
--ocr-model
switch and scale up without API changes.(nanonets.com, huggingface.co)
Alternatives
Option | Strengths | Shortcomings vs. Nanonets-OCR-s |
---|---|---|
LayoutLMv3 + custom post-processing(arxiv.org, arxiv.org) | Multimodal pre-training; good form understanding | Outputs token positions, not Markdown; LaTeX & checkbox logic must be handcrafted |
Amazon Textract(docs.aws.amazon.com, aws.amazon.com, aws.amazon.com) | Reliable table & form extraction; managed scaling | Closed source, pay-per-page, no LaTeX, limited watermark/signature tagging |
Google Document AI(cloud.google.com, cloud.google.com, cloud.google.com) | Wide language support; layout detection | Quotas & hard system limits; no Markdown output; LaTeX not preserved |
Tesseract + heuristics (status-quo) | Zero cost; already in Docling | Plain text only; fragile table detection; no semantic tags |
All alternatives either lack semantic Markdown, require substantial bespoke logic (e.g., to detect $…$
math or image captions), or impose commercial costs and vendor lock-in. Nanonets-OCR-s uniquely combines open-source licensing, lightweight resource needs, and out-of-the-box semantic tagging, making it the most practical upgrade path for Docling today.
Add an optional nanonets/Nanonets-OCR-s
backend to Docling’s ingestion pipeline so that any PDF or image passed to Docling is converted directly into richly-tagged Markdown.
Docling already excels at parsing diverse document formats and preparing them for Gen-AI workflows, but it still relies on conventional OCR engines that output plain text with limited structural hints.([github.com]1, [docling-project.github.io]2, [research.ibm.com]3) Integrating Nanonets-OCR-s would let Docling emit Markdown that preserves:
- LaTeX maths — inline
$…$
and block$$…$$
equations are reproduced verbatim, so downstream LLMs can render or reason over formulas without heuristic post-processing.([huggingface.co]4, [nanonets.com]5, [news.ycombinator.com]6) - Image semantics — embedded pictures, charts and logos are replaced by
<img>
tags containing concise, model-generated alt-descriptions, dramatically improving RAG pipelines that mix text and vision.([huggingface.co]7, [reddit.com]8) - Signature, watermark & checkbox tags — legal or business docs arrive with
<signature>
,<watermark>
,☐/☑/☒
markers already in place, enabling precise filtering or redaction rules.([huggingface.co]4, [huggingface.co]9) - HTML/Markdown tables — complex multi–row or nested tables are emitted twice (HTML + MD), matching Docling’s existing dual-format philosophy and sparing users from brittle table-reconstruction code.([nanonets.com]5, [huggingface.co]10)
Why this matters for Docling users
- Greatly reduced post-processing: Docling’s current post-OCR “structure inference” heuristics (e.g., detecting table borders or maths blocks) can be removed or simplified, cutting pipeline runtime and maintenance.([github.com]1, [medium.com]11)
- LLM-ready output out-of-the-box: Rich Markdown with semantic tags feeds directly into retrieval-augmented generation, few-shot tutoring or embeddings without extra parsing layers.([huggingface.co]4, [news.ycombinator.com]6)
- Consistent developer experience: Nanonets-OCR-s ships via 🤗 Transformers and vLLM, both technologies Docling already uses, so the integration is mostly wiring model invocation plus a feature flag.([huggingface.co]4, [nanonets.com]5)
- Future-proofing: The small model runs comfortably on consumer GPUs; when larger variants appear, Docling can expose a
--ocr-model
switch and scale up without API changes.([nanonets.com]5, [huggingface.co]10)
Alternatives
Option | Strengths | Shortcomings vs. Nanonets-OCR-s |
---|---|---|
LayoutLMv3 + custom post-processing([arxiv.org]12, [arxiv.org]13) | Multimodal pre-training; good form understanding | Outputs token positions, not Markdown; LaTeX & checkbox logic must be handcrafted |
Amazon Textract([docs.aws.amazon.com]14, [aws.amazon.com]15, [aws.amazon.com]16) | Reliable table & form extraction; managed scaling | Closed source, pay-per-page, no LaTeX, limited watermark/signature tagging |
Google Document AI([cloud.google.com]17, [cloud.google.com]18, [cloud.google.com]19) | Wide language support; layout detection | Quotas & hard system limits; no Markdown output; LaTeX not preserved |
Tesseract + heuristics (status-quo) | Zero cost; already in Docling | Plain text only; fragile table detection; no semantic tags |
All alternatives either lack semantic Markdown, require substantial bespoke logic (e.g., to detect $…$
math or image captions), or impose commercial costs and vendor lock-in. Nanonets-OCR-s uniquely combines open-source licensing, lightweight resource needs, and out-of-the-box semantic tagging, making it the most practical upgrade path for Docling today.
Activity