Sanitizing filter broken in 0.90

http://code.google.com/p/html5lib/issues/detail?id=162

Reported by gdr@garethrees.org, Oct 10, 2010 

> DESCRIPTION
> 
> Consider the following interaction with html5lib 0.90:
> 
> ```
>     >>> from html5lib import html5parser, serializer, treebuilders, treewalkers
>     >>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom'))
>     >>> dom = p.parse("""<body onload="sucker()">""") 
>     >>> s = serializer.htmlserializer.HTMLSerializer(sanitize = True)
>     >>> ''.join(s.serialize(treewalkers.getTreeWalker('dom')(dom)))
>     u'<body onload=sucker()>'
> ```
> 
> This is clearly incorrect: the onload attribute should have been removed by the sanitizer during the serialization.
> 
> ANALYSIS
> 
> The problem is that there are two sanitizers: a tokenizing sanitizer in html5lib.sanitizer, and a sanitizing filter in html5lib.filter.sanitizer. To avoid duplication of code, these two sanitizers inherit from the class HTMLSanitizerMixin and both call that class's function sanitize_token.
> 
> Unfortunately, the format of tokens differs between tokenization and filtering. During tokenization, a token looks like this:
> 
> ```
>     >>> from html5lib import tokenizer
>     >>> next(iter(tokenizer.HTMLTokenizer("""<body onload="sucker()">""")))
>     {'selfClosing': False, 'data': [[u'onload', u'sucker()']], 'type': 3, 'name': u'body', 'selfClosingAcknowledged': False}
> ```
> 
> But during filtering, tokens look like this:
> 
> ```
>     >>> list(iter(treewalkers.getTreeWalker('dom')(dom)))[3]
>     {'namespace': u'http:/​/​www.w3.org/​1999/​xhtml', 'type': 'StartTag', 'name': u'body', 'data': [(u'onload', u'sucker()')]}
> ```
> 
> When the sanitizing filter passes its token to the sanitize_token method of HTMLSanitizerMixin, nothing happens, because sanitize_token is expecting 'type' to be an integer.
> 
> OBSERVATION
> 
> Having two very similar but subtly different data formats for the same data type is dangerous: how many other incompatibilities are there?
> 
> WORKAROUND
> 
> I am working around this problem as follows: when I need to apply a sanitizing filter to a DOM tree, instead I do the following:
> 1. Serialize the DOM to HTML without sanitization.
> 2. Re-parse the HTML from step 1, using the sanitizing tokenizer.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitizing filter broken in 0.90 #72

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sanitizing filter broken in 0.90 #72

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions