Skip to content

No forward slash at the end of singleton tags in parsed HTML output #354

Open
@15mbp

Description

@15mbp

I have noticed that in html.py, at the lines

head_parts = ["<head>", '<meta charset="UTF-8">']
and
'<meta name="generator" content="Docling HTML Serializer">'
respectively, the singleton tags <meta charset="UTF-8"> and <meta name="generator" content="Docling HTML Serializer"> lack a forward slash at the end.

This is creating interoperability issues with other HTML parsers e.g. xml.dom.minidom gives parsing ExpatError.
So one has to either save the HTML output and manually edit the file (tedious for large volumes) or use the Python String replace() method on the return value of export_to_html() as seen below (hacky)

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # PDF path or URL
converter = DocumentConverter()
result = converter.convert(source)
html_doc = result.document.export_to_html()  # output: "### Docling Technical Report[...]"

html_doc.replace('<meta charset="UTF-8">', '<meta charset="UTF-8"/>')
html_doc.replace('<meta name="generator" content="Docling HTML Serializer">', '<meta name="generator" content="Docling HTML Serializer"/>')

I am willing to commit the fix myself to html.py if my pull request would be duly merged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions