Skip to content

HTML parsing issue in partition_html #3856

Open
@meetFarmanUllah

Description

@meetFarmanUllah

Describe the bug
I am trying to parse md files for chunking, first i have used partition_md but due to many open issues related to it i was not able to parse my md file directly so i parsed md file using markdown-it and then used partition_html. The issue i am facing is that strong tag within the paragraph tag is considered a title by partition_html which is a problem when chunking_by_title.
To Reproduce
``
from unstructured.partition.html import partition_html
import json

text = "

Example:

"

elements = partition_html(text=text)
element_dict = [el.to_dict() for el in elements]
print(json.dumps(element_dict,indent=2)) ``

Expected behavior
it should not be parsed as title it should be parsed as NarrativeText

Screenshots
code
image
output
image

Environment Info
Name: unstructured
Version: 0.16.11
Python 3.11.9

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions