Skip to content

Add an optional nanonets/Nanonets-OCR-s backend to Docling’s ingestion pipeline so that any PDF or image passed to Docling is converted directly into richly-tagged Markdown. #1799

Open
@Soliver84

Description

@Soliver84

Requested feature

Add an optional nanonets/Nanonets-OCR-s backend to Docling’s ingestion pipeline so that any PDF or image passed to Docling is converted directly into richly-tagged Markdown.
Docling already excels at parsing diverse document formats and preparing them for Gen-AI workflows, but it still relies on conventional OCR engines that output plain text with limited structural hints.(github.com, docling-project.github.io, research.ibm.com) Integrating Nanonets-OCR-s would let Docling emit Markdown that preserves:

  • LaTeX maths  — inline $…$ and block $$…$$ equations are reproduced verbatim, so downstream LLMs can render or reason over formulas without heuristic post-processing.(huggingface.co, nanonets.com, news.ycombinator.com)

  • Image semantics  — embedded pictures, charts and logos are replaced by <img> tags containing concise, model-generated alt-descriptions, dramatically improving RAG pipelines that mix text and vision.(huggingface.co, reddit.com)

  • Signature, watermark & checkbox tags  — legal or business docs arrive with <signature>, <watermark>, ☐/☑/☒ markers already in place, enabling precise filtering or redaction rules.(huggingface.co, huggingface.co)

  • HTML/Markdown tables  — complex multi–row or nested tables are emitted twice (HTML + MD), matching Docling’s existing dual-format philosophy and sparing users from brittle table-reconstruction code.(nanonets.com, huggingface.co)

Why this matters for Docling users

  1. Greatly reduced post-processing: Docling’s current post-OCR “structure inference” heuristics (e.g., detecting table borders or maths blocks) can be removed or simplified, cutting pipeline runtime and maintenance.(github.com, medium.com)

  2. LLM-ready output out-of-the-box: Rich Markdown with semantic tags feeds directly into retrieval-augmented generation, few-shot tutoring or embeddings without extra parsing layers.(huggingface.co, news.ycombinator.com)

  3. Consistent developer experience: Nanonets-OCR-s ships via 🤗 Transformers and vLLM, both technologies Docling already uses, so the integration is mostly wiring model invocation plus a feature flag.(huggingface.co, nanonets.com)

  4. Future-proofing: The small model runs comfortably on consumer GPUs; when larger variants appear, Docling can expose a --ocr-model switch and scale up without API changes.(nanonets.com, huggingface.co)


Alternatives

Option Strengths Shortcomings vs. Nanonets-OCR-s
LayoutLMv3 + custom post-processing(arxiv.org, arxiv.org) Multimodal pre-training; good form understanding Outputs token positions, not Markdown; LaTeX & checkbox logic must be handcrafted
Amazon Textract(docs.aws.amazon.com, aws.amazon.com, aws.amazon.com) Reliable table & form extraction; managed scaling Closed source, pay-per-page, no LaTeX, limited watermark/signature tagging
Google Document AI(cloud.google.com, cloud.google.com, cloud.google.com) Wide language support; layout detection Quotas & hard system limits; no Markdown output; LaTeX not preserved
Tesseract + heuristics (status-quo) Zero cost; already in Docling Plain text only; fragile table detection; no semantic tags

All alternatives either lack semantic Markdown, require substantial bespoke logic (e.g., to detect $…$ math or image captions), or impose commercial costs and vendor lock-in. Nanonets-OCR-s uniquely combines open-source licensing, lightweight resource needs, and out-of-the-box semantic tagging, making it the most practical upgrade path for Docling today.

### Requested feature

Add an optional nanonets/Nanonets-OCR-s backend to Docling’s ingestion pipeline so that any PDF or image passed to Docling is converted directly into richly-tagged Markdown.
Docling already excels at parsing diverse document formats and preparing them for Gen-AI workflows, but it still relies on conventional OCR engines that output plain text with limited structural hints.([github.com]1, [docling-project.github.io]2, [research.ibm.com]3) Integrating Nanonets-OCR-s would let Docling emit Markdown that preserves:

  • LaTeX maths  — inline $…$ and block $$…$$ equations are reproduced verbatim, so downstream LLMs can render or reason over formulas without heuristic post-processing.([huggingface.co]4, [nanonets.com]5, [news.ycombinator.com]6)
  • Image semantics  — embedded pictures, charts and logos are replaced by <img> tags containing concise, model-generated alt-descriptions, dramatically improving RAG pipelines that mix text and vision.([huggingface.co]7, [reddit.com]8)
  • Signature, watermark & checkbox tags  — legal or business docs arrive with <signature>, <watermark>, ☐/☑/☒ markers already in place, enabling precise filtering or redaction rules.([huggingface.co]4, [huggingface.co]9)
  • HTML/Markdown tables  — complex multi–row or nested tables are emitted twice (HTML + MD), matching Docling’s existing dual-format philosophy and sparing users from brittle table-reconstruction code.([nanonets.com]5, [huggingface.co]10)

Why this matters for Docling users

  1. Greatly reduced post-processing: Docling’s current post-OCR “structure inference” heuristics (e.g., detecting table borders or maths blocks) can be removed or simplified, cutting pipeline runtime and maintenance.([github.com]1, [medium.com]11)
  2. LLM-ready output out-of-the-box: Rich Markdown with semantic tags feeds directly into retrieval-augmented generation, few-shot tutoring or embeddings without extra parsing layers.([huggingface.co]4, [news.ycombinator.com]6)
  3. Consistent developer experience: Nanonets-OCR-s ships via 🤗 Transformers and vLLM, both technologies Docling already uses, so the integration is mostly wiring model invocation plus a feature flag.([huggingface.co]4, [nanonets.com]5)
  4. Future-proofing: The small model runs comfortably on consumer GPUs; when larger variants appear, Docling can expose a --ocr-model switch and scale up without API changes.([nanonets.com]5, [huggingface.co]10)

Alternatives

Option Strengths Shortcomings vs. Nanonets-OCR-s
LayoutLMv3 + custom post-processing([arxiv.org]12, [arxiv.org]13) Multimodal pre-training; good form understanding Outputs token positions, not Markdown; LaTeX & checkbox logic must be handcrafted
Amazon Textract([docs.aws.amazon.com]14, [aws.amazon.com]15, [aws.amazon.com]16) Reliable table & form extraction; managed scaling Closed source, pay-per-page, no LaTeX, limited watermark/signature tagging
Google Document AI([cloud.google.com]17, [cloud.google.com]18, [cloud.google.com]19) Wide language support; layout detection Quotas & hard system limits; no Markdown output; LaTeX not preserved
Tesseract + heuristics (status-quo) Zero cost; already in Docling Plain text only; fragile table detection; no semantic tags

All alternatives either lack semantic Markdown, require substantial bespoke logic (e.g., to detect $…$ math or image captions), or impose commercial costs and vendor lock-in. Nanonets-OCR-s uniquely combines open-source licensing, lightweight resource needs, and out-of-the-box semantic tagging, making it the most practical upgrade path for Docling today.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @Soliver84

        Issue actions

          Add an optional nanonets/Nanonets-OCR-s backend to Docling’s ingestion pipeline so that any PDF or image passed to Docling is converted directly into richly-tagged Markdown. · Issue #1799 · docling-project/docling