Description
http://code.google.com/p/html5lib/issues/detail?id=162
Reported by gdr@garethrees.org, Oct 10, 2010
DESCRIPTION
Consider the following interaction with html5lib 0.90:
>>> from html5lib import html5parser, serializer, treebuilders, treewalkers >>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom')) >>> dom = p.parse("""<body onload="sucker()">""") >>> s = serializer.htmlserializer.HTMLSerializer(sanitize = True) >>> ''.join(s.serialize(treewalkers.getTreeWalker('dom')(dom))) u'<body onload=sucker()>'
This is clearly incorrect: the onload attribute should have been removed by the sanitizer during the serialization.
ANALYSIS
The problem is that there are two sanitizers: a tokenizing sanitizer in html5lib.sanitizer, and a sanitizing filter in html5lib.filter.sanitizer. To avoid duplication of code, these two sanitizers inherit from the class HTMLSanitizerMixin and both call that class's function sanitize_token.
Unfortunately, the format of tokens differs between tokenization and filtering. During tokenization, a token looks like this:
>>> from html5lib import tokenizer >>> next(iter(tokenizer.HTMLTokenizer("""<body onload="sucker()">"""))) {'selfClosing': False, 'data': [[u'onload', u'sucker()']], 'type': 3, 'name': u'body', 'selfClosingAcknowledged': False}
But during filtering, tokens look like this:
>>> list(iter(treewalkers.getTreeWalker('dom')(dom)))[3] {'namespace': u'http://www.w3.org/1999/xhtml', 'type': 'StartTag', 'name': u'body', 'data': [(u'onload', u'sucker()')]}
When the sanitizing filter passes its token to the sanitize_token method of HTMLSanitizerMixin, nothing happens, because sanitize_token is expecting 'type' to be an integer.
OBSERVATION
Having two very similar but subtly different data formats for the same data type is dangerous: how many other incompatibilities are there?
WORKAROUND
I am working around this problem as follows: when I need to apply a sanitizing filter to a DOM tree, instead I do the following:
- Serialize the DOM to HTML without sanitization.
- Re-parse the HTML from step 1, using the sanitizing tokenizer.