Skip to content

xml.etree.ElementTree: file source must be binary for non-UTF-8 encodings #99064

Open
@coproc

Description

@coproc

Documentation

In the documentation for xml.etree.ElementTree.parse it says for the first argument source: "... source is a filename or file object containing XML data. ..."

But for file objects this does not work in all (expected) cases. A hint like the following is missing:

For file objects containing XML data with non-ASCII and non-UTF-8 encoding (e.g. ISO 8859-1), the file must have been opened in binary mode.

Otherwise (if opening the file in ASCII mode, regardless of the specified encoding) non-ASCII characters are not read correctly. (see this question on stackoverflow and also the attached files in test_parseXml.zip for reproducing the problem)

Here is an excerpt of the attached test code:

import xml

# ok
with open('test_ISO-8859-1.xml', 'rb') as fileInBinary:
    root = xml.etree.ElementTree.parse(fileInBinary).getroot()
print(root.attrib['attributeWithUmlauts'])

# garbage
with open('test_ISO-8859-1.xml', 'r', encoding='ISO-8859-1') as fileInAscii:
    root = xml.etree.ElementTree.parse(fileInAscii).getroot()
print(root.attrib['attributeWithUmlauts'])

giving the following output:

äöü
äöü

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsDocumentation in the Doc dirtopic-XML

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions