python libxml2: save XML as HTML

HTML is the main output format for XML transformations. Every XSLT-processor, including libxslt/libxml2, supports it. But if you transform a libxml2 tree manually, you are in trouble. You can save XML only as XML, not as HTML. A solution is required. My version is not elegant, but works.

By the way, the desired functionality is provided by the plain C library. I think that the parameter options of the functions xmlSaveToXXX should set the flag XML_SAVE_AS_HTML, "force HTML serialization on XML doc". Another try would be to change the type of a node from XML_DOCUMENT_NODE to XML_HTML_DOCUMENT_NODE.

Unfortunately, the Python bindings can't access this functionality. Fortunately, based on the "node type" idea, the following does work:

* Create an empty HTML document
* Move nodes from the XML tree into the new HMTL tree
* Save the HTML tree

Proof-of-the concept code:


import libxml2

doc = libxml2.parseDoc('''
<html>
  <head>
    <title>I'm a title</title>
    <link rel="stylesheet" type="text/css" href="style/style.css"></link>
  </head>
  <body>
    <h1>Test</h1>
    <img src="#none" width="32" height="32"/>
    <p>Test</p>
  </body>
</html>
''')
node = doc.getRootElement()
print node.serialize()

html_doc = libxml2.htmlParseDoc('<html></html>', None)
html_root = html_doc.getRootElement()
while node.children:
  kid = node.children
  kid.unlinkNode()
  html_root.addChild(kid)

print '------------------'
print html_root.serialize()

The output. First as an XML-tree, than as an HTML-tree


<html>
  <head>
    <title>I'm a title</title>
    <link rel="stylesheet" type="text/css" href="style/style.css"/>
  </head>
  <body>
    <h1>Test</h1>
    <img src="#none" width="32" height="32"/>
    <p>Test</p>
  </body>
</html>
------------------
<html>
  <head>
    <title>I'm a title</title>
    <link rel="stylesheet" type="text/css" href="style/style.css">
  </head>
  <body>
    <h1>Test</h1>
    <img src="#none" width="32" height="32">
    <p>Test</p>
  </body>
</html>

Note the differences in the ends of the elements link and img.

Categories: python

Updated: