xml.etree.ElementTree and processing instructions

Python standard library xml.etree.ElementTree is convenient to work with a simple subset of XML. Unfortunately for me, this subset does not include processing instructions, therefore an workaround is required.

The source XML:

<a>Is <?aaaa bbbb?> supported?</a>

After roundtrip (read and immediately write) with help of ElementTree, one gets:

<a>Is  supported?</a>

My workaround is to tune the internals of the library and force handling of processing instructions. I represent them as elements with the special name *PI* and attributes target and data.

import xml.etree.ElementTree

def pi_handler(obj, target, data):
  obj.start('*PI*', {'target': target, 'data': data})
  obj.end('*PI*')
xml.etree.ElementTree.TreeBuilder.pi = pi_handler

tree = xml.etree.ElementTree.ElementTree()
tree.parse("x.xml")
xml.etree.ElementTree.dump(tree)

This code creates the following XML.

<a>Is <*PI* data="bbbb" target="aaaa" /> supported?</a>

Notes:

* The XML is obviously incorrect, but it is not important as long as you don't need to have got it serialized.
* ElementTree actually supports ProcessingInstruction nodes in memory, it just does not read them
* A better approach is to create ProcessingInstruction nodes in my pi_handler instead of fake elements, but this approach requires more coding and understanding

25 July 2012, update

The code above does not work with the ElementTree from an old Python 2.5. Therefore, I have written an alternative version:

xetxtb_saved_init = xml.etree.ElementTree.XMLTreeBuilder.__init__
def xetxtb_new_init(self, *ls, **kw):
  def new_pi(target, data):
    self._parser.StartElementHandler('*PI*', ['target', target, 'data', data])
    self._parser.EndElementHandler('*PI*')
  xetxtb_saved_init(self, *ls, **kw)
  self._parser.ProcessingInstructionHandler = new_pi
xml.etree.ElementTree.XMLTreeBuilder.__init__ = xetxtb_new_init

Categories: python