be carefull with libxml2 in python
Unpleasant surprise from libxml2 bindings for Python: one must care for encoding conversion.
test.xml:
<text>Grüß</text>
test.py:
import libxml2
doc = libxml2.parseFile('test.xml')
node = doc.getRootElement().children
s = node.getContent()
print repr(s)
The code outputs:
'Gr\xc3\xbc\xc3\x9f'
Instead of a unicode string we get a sequence of 6 bytes, which represents the text in the UTF-8 encoding. To get a correct unicode string, use the following:
s = unicode(s, 'UTF-8')
print repr(s)
This gives the expected result:
u'Gr\xfc\xdf'
Correspondingly, to add an unicode string to a libxml2 XML tree, one uses the following magic: s.encode('UTF-8')
.
Categories:
python