be carefull with libxml2 in python

Unpleasant surprise from libxml2 bindings for Python: one must care for encoding conversion.

test.xml:

<text>Grüß</text>

test.py:

import libxml2
doc  = libxml2.parseFile('test.xml')
node = doc.getRootElement().children
s = node.getContent()
print repr(s)

The code outputs:

'Gr\xc3\xbc\xc3\x9f'

Instead of a unicode string we get a sequence of 6 bytes, which represents the text in the UTF-8 encoding. To get a correct unicode string, use the following:

s = unicode(s, 'UTF-8')
print repr(s)

This gives the expected result:

u'Gr\xfc\xdf'

Correspondingly, to add an unicode string to a libxml2 XML tree, one uses the following magic: s.encode('UTF-8').

Categories: python

Updated: