Skip to content

Sanitizing filter broken in 0.90 #72

Closed
@gsnedders

Description

@gsnedders

http://code.google.com/p/html5lib/issues/detail?id=162

Reported by gdr@garethrees.org, Oct 10, 2010

DESCRIPTION

Consider the following interaction with html5lib 0.90:

    >>> from html5lib import html5parser, serializer, treebuilders, treewalkers
    >>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom'))
    >>> dom = p.parse("""<body onload="sucker()">""") 
    >>> s = serializer.htmlserializer.HTMLSerializer(sanitize = True)
    >>> ''.join(s.serialize(treewalkers.getTreeWalker('dom')(dom)))
    u'<body onload=sucker()>'

This is clearly incorrect: the onload attribute should have been removed by the sanitizer during the serialization.

ANALYSIS

The problem is that there are two sanitizers: a tokenizing sanitizer in html5lib.sanitizer, and a sanitizing filter in html5lib.filter.sanitizer. To avoid duplication of code, these two sanitizers inherit from the class HTMLSanitizerMixin and both call that class's function sanitize_token.

Unfortunately, the format of tokens differs between tokenization and filtering. During tokenization, a token looks like this:

    >>> from html5lib import tokenizer
    >>> next(iter(tokenizer.HTMLTokenizer("""<body onload="sucker()">""")))
    {'selfClosing': False, 'data': [[u'onload', u'sucker()']], 'type': 3, 'name': u'body', 'selfClosingAcknowledged': False}

But during filtering, tokens look like this:

    >>> list(iter(treewalkers.getTreeWalker('dom')(dom)))[3]
    {'namespace': u'http:/​/​www.w3.org/​1999/​xhtml', 'type': 'StartTag', 'name': u'body', 'data': [(u'onload', u'sucker()')]}

When the sanitizing filter passes its token to the sanitize_token method of HTMLSanitizerMixin, nothing happens, because sanitize_token is expecting 'type' to be an integer.

OBSERVATION

Having two very similar but subtly different data formats for the same data type is dangerous: how many other incompatibilities are there?

WORKAROUND

I am working around this problem as follows: when I need to apply a sanitizing filter to a DOM tree, instead I do the following:

  1. Serialize the DOM to HTML without sanitization.
  2. Re-parse the HTML from step 1, using the sanitizing tokenizer.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions