Description
Describe the bug
I'm trying to read a pdf file that contains bold and normal text. The normal text gets read correctly, but all the characters of the bold text are repeated.
For example, BOLD TEXT is read as BBOOLLDD TTEEXXTT.
To Reproduce
filename = "example_files/creatinine.pdf" # cannot share this file because it contains confidential information
with open(filename, "rb") as f:
files=shared.Files(
content=f.read(),
file_name=filename,
)
req = shared.PartitionParameters(
files=files,
strategy='hi_res',
pdf_infer_table_structure=True,
languages=["eng"],
)
try:
resp = s.general.partition(req)
print(json.dumps(resp.elements[19], indent=2))
except SDKError as e:
print(e)
Expected behavior
The output of the above code should be as follows:
{ "type": "NarrativeText", "element_id": "681ea37fceaad7479d246b8ccc52ec2d", "text": ">60", "metadata": { "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 2, "parent_id": "e72be637f803a9bf4509b64448ff1133", "filename": "creatinine.pdf" } }
But since the text >60 is BOLD in the pdf, the output looks like this:
{ "type": "NarrativeText", "element_id": "681ea37fceaad7479d246b8ccc52ec2d", "text": ">60>60", "metadata": { "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 2, "parent_id": "e72be637f803a9bf4509b64448ff1133", "filename": "creatinine.pdf" } }
Screenshots
Here's a screenshot from the pdf showing >60 in bold