Skip to content

Fuzzing reveals a number of parse errors #568

Open
@leonardr

Description

@leonardr

I'm the lead developer of Beautiful Soup, which has html5lib as an optional dependency. Over the past couple of years I've gotten a number of notifications from Google's oss-fuzz project about unhandled exceptions that actually turned out to be problems in html5lib. There wasn't much I could do with these errors, but now that it looks like html5lib maintenance is picking up, I can pass them on to you. (Sorry. 😿)

I've incorporated the fuzz reports into the Beautiful Soup test suite, and the test cases themselves are here, but here's a general picture of what problems I see. In each case, I believe just parsing the bad markup is enough to trigger the error.

clusterfuzz-testcase-minimized-bs4_fuzzer-4999465949331456

Markup: b')<a><math><TR><a><mI><a><p><a>'

Error:

self = <html>, node = <p>, refNode = None

    def insertBefore(self, node, refNode):
>       index = self.element.index(refNode.element)
E       AttributeError: 'NoneType' object has no attribute 'element'

clusterfuzz-testcase-minimized-bs4_fuzzer-5843991618256896

Markup: b'-<math><sElect><mi><sElect><sElect>'

Error:

    def resetInsertionMode(self):
    ...
            # Check for conditions that should only happen in the innerHTML
            # case
            if nodeName in ("select", "colgroup", "head", "html"):
>               assert self.innerHTML
E               AssertionError

clusterfuzz-testcase-minimized-bs4_fuzzer-6241471367348224

Markup: b'ñ<table><svg><html>'

Error:

self = <html5lib.html5parser.getPhases.<locals>.InTablePhase object at 0x7f8f405ad440>

    def processEOF(self):
        if self.tree.openElements[-1].name != "html":
            self.parser.parseError("eof-in-table")
        else:
>           assert self.parser.innerHTML
E           AssertionError

clusterfuzz-testcase-minimized-bs4_fuzzer-6600557255327744

Markup: b'\t<TABLE><<!>;<!><<!>.<lec><th>i><a><mat\x00\x01<mi\x00a><math>><th><mI>chardeta\xff\xff\xff\xff<><th><mI><||||||||A<select><>qu?\xbemath><th><mie>qu'

Error:

self = <html5lib.html5parser.getPhases.<locals>.InTableBodyPhase object at 0x7f8f4184ce00>

    def clearStackToTableBodyContext(self):
        while self.tree.openElements[-1].name not in ("tbody", "tfoot",
                                                      "thead", "html"):
            # self.parser.parseError("unexpected-implied-end-tag-in-table",
            #  {"name": self.tree.openElements[-1].name})
            self.tree.openElements.pop()
        if self.tree.openElements[-1].name == "html":
>           assert self.parser.innerHTML
E           AssertionError

Also reported to me recently was the issue that was reported to you as issue #557.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions