Perl-XML at uucode.com

Page contains libraries and code examples for handling XML in Perl. I think that Perl is not very friendly for XML processing. I do not insist on it, I just need standard DOM for all modules and functionality of Balise -- powerful SGML/XML scripting language. Some of features are implemented in DOM utilities library.

I don't give any guarantee on code because most part of it is written under pressure of time. Code should be considered of alpha quality. Anyway, my alpha quality is usually good.

This code is of public domains. It means that you can use code for any purpose without any restrictions. The best you can do is to convert code to modules or integrate it to other modules. At least, I hope that you find my code useful.

Russian encodings pack for XML::Parser

Standard Russian encodings are: windows-1251, cp866 (DOS) and koi8-r (UNIX). All others are not standard. Note for latin-1 users: iso-8859-5 is a standard in ISO, not in Russia. If you wish to add support of our encodings to your product, please add 'windows-1251' and 'koi8-r'. If you add only 'iso-8859-5', then your product will not support Russian language.

XML::Parser supports only 'iso-8859-5'. Language pack adds support of all standard Russian encodings. Content of package enc.zip:

windows-1251.enc, ibm866.enc, koi8-r.enc
These files should be copied into the folder with other .enc-files of installed XML::Parser.
cp1251.txt, cp866.txt, koi8-r.txt
Text files with mapping of encodings to Unicode.
windows-1251.xml, ibm866.xml, koi8-r.xml
XML description of encodings. Input data for XML::Encoding module for creating .enc-files.

Next step after installation of encodings is to update Perl code. You should specify desired encoding for XML parser. See documentation for more information.

Last step is most important. Please think, why do you use 8bit files instead of UTF8?

Include functionality for XML parsing

This code demonstrates how to build DOM tree from a number of files. Files are included on-the-fly, during XML parsing. DOM tree builder (XML::Handler::BuildDOM) does not know about inclusion.

I use module XML::Filter::SAXT. This module is an analog of cat and tee unix commands for SAX events. It simplifies the building of chains of responsibility for SAX handlers. For example, parser generates SAX events. First handler in chain renames tags para to p, second handler drops some tags, third handler can join a series of character events to one event, next handler can emulate push/pull namespace-aware parser.

My code (include-test.zip) is built on top of XML::Filter::SAXT. I filter tags <include src="filename"/>. When I meet this tag, I suspend the handling of current document and create new instance of my code for given file. This file also may contain include tags. All SAX events are joined to one stream. This stream is an input of next SAX handler in stream. In my code this handler is a XML::Handler::BuildDOM.

Build DOM tree in 8-bit instead of UTF8

This code (8bit-dom.tar.gz) is another example of extending of XML::Filter::SAXT. I redefined characters handler. Updated handler calls XML Parser for original 8-bit string and passes this string to characters handler of next handler in chain. After updated SAX events are processed in XML::Handler::BuildDOM, you get 8-bit DOM instead of UTF8.

Important note: you MUST NOT build 8-bit DOM tree unless you are surely know why are you doing it. Perl XML programmers consider that XML::DOM character data is in UTF8. If you don't regard this assumption, you create time bomb for your project.

DOM utilities

I found that DOM standard is not friendly for programming. As always, it is only my opinion. This library (domutil.tar.gz) helps me to concentrate on programming, not on reading XML::DOM documentation. Here is a list of functions:

ancestor, children, child, search_elem_nodes
Extract nodes with given name(s) and/or attributes.
hasAttr, getAttr
Procedure style of access of attributes of node.
scanSubTree
Scan subtree. Generate events on start and end of element. Handlers can stop process of scanning or deny scanning of subtree.
content
Generate a string from text content of tree.
from_8bitDOM_to_string
The same as XML::DOM::toString (convert DOM tree to XML string), but for 8bit DOM instead of UTF8 DOM.
You can also read online documentation. This documentation is autogenerated from POD format.

Pyxie modules

I'm the author of PyxieSax modules. This modules use plain text notation for representing XML data. This notation is a subset of Nsgmls output format. After Sean McGrath published article at XML.com about his PYX notation, I added support of PYX to my module and uploaded it on CPAN.

See my XmlConnect page for more information about the idea and realization. By the way: XML::Parser is working faster than PyxParser, but PyxParser is extremely faster than pure Perl regexp parser.


http://uucode.com/xml/perl/index.html
Oleg A. Paraschenko <olpa uucode com>