Skip to content

bug/Bold characters get repeated while extracting #3864

Open
@gauri-nagavkar

Description

@gauri-nagavkar

Describe the bug
I'm trying to read a pdf file that contains bold and normal text. The normal text gets read correctly, but all the characters of the bold text are repeated.

For example, BOLD TEXT is read as BBOOLLDD TTEEXXTT.

To Reproduce

filename = "example_files/creatinine.pdf" # cannot share this file because it contains confidential information
with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(), 
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy='hi_res',
    pdf_infer_table_structure=True,
    languages=["eng"],
)
try:
    resp = s.general.partition(req)
    print(json.dumps(resp.elements[19], indent=2))
except SDKError as e:
    print(e)

Expected behavior
The output of the above code should be as follows:

{ "type": "NarrativeText", "element_id": "681ea37fceaad7479d246b8ccc52ec2d", "text": ">60", "metadata": { "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 2, "parent_id": "e72be637f803a9bf4509b64448ff1133", "filename": "creatinine.pdf" } }

But since the text >60 is BOLD in the pdf, the output looks like this:

{ "type": "NarrativeText", "element_id": "681ea37fceaad7479d246b8ccc52ec2d", "text": ">60>60", "metadata": { "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 2, "parent_id": "e72be637f803a9bf4509b64448ff1133", "filename": "creatinine.pdf" } }

Screenshots
Here's a screenshot from the pdf showing >60 in bold
image

Here's a screenshot of the code and the output:
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions