Open
Description
I have noticed that in html.py, at the lines
<meta charset="UTF-8">
and <meta name="generator" content="Docling HTML Serializer">
lack a forward slash at the end.
This is creating interoperability issues with other HTML parsers e.g. xml.dom.minidom
gives parsing ExpatError.
So one has to either save the HTML output and manually edit the file (tedious for large volumes) or use the Python String replace()
method on the return value of export_to_html()
as seen below (hacky)
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
converter = DocumentConverter()
result = converter.convert(source)
html_doc = result.document.export_to_html() # output: "### Docling Technical Report[...]"
html_doc.replace('<meta charset="UTF-8">', '<meta charset="UTF-8"/>')
html_doc.replace('<meta name="generator" content="Docling HTML Serializer">', '<meta name="generator" content="Docling HTML Serializer"/>')
I am willing to commit the fix myself to html.py
if my pull request would be duly merged.
Metadata
Metadata
Assignees
Labels
No labels