be carefull with libxml2 in python

Unpleasant surprise from libxml2 bindings for Python: one must care for encoding conversion.



import libxml2
doc  = libxml2.parseFile('test.xml')
node = doc.getRootElement().children
s = node.getContent()
print repr(s)

The code outputs:


Instead of a unicode string we get a sequence of 6 bytes, which represents the text in the UTF-8 encoding. To get a correct unicode string, use the following:

s = unicode(s, 'UTF-8')
print repr(s)

This gives the expected result:


Correspondingly, to add an unicode string to a libxml2 XML tree, one uses the following magic: s.encode('UTF-8').

Categories: python