Page contains libraries and code examples for handling XML in Perl. I think that Perl is not very friendly for XML processing. I do not insist on it, I just need standard DOM for all modules and functionality of Balise -- powerful SGML/XML scripting language. Some of features are implemented in DOM utilities library.
I don't give any guarantee on code because most part of it is written under pressure of time. Code should be considered of alpha quality. Anyway, my alpha quality is usually good.
This code is of public domains. It means that you can use code for any purpose without any restrictions. The best you can do is to convert code to modules or integrate it to other modules. At least, I hope that you find my code useful.
Standard Russian encodings are: windows-1251, cp866 (DOS) and koi8-r (UNIX). All others are not standard. Note for latin-1 users: iso-8859-5 is a standard in ISO, not in Russia. If you wish to add support of our encodings to your product, please add 'windows-1251' and 'koi8-r'. If you add only 'iso-8859-5', then your product will not support Russian language.
XML::Parser supports only 'iso-8859-5'. Language pack adds support of all standard Russian encodings. Content of package enc.zip:
Next step after installation of encodings is to update Perl code. You should specify desired encoding for XML parser. See documentation for more information.
Last step is most important. Please think, why do you use 8bit files instead of UTF8?
This code demonstrates how to build DOM tree from a number of files. Files are included on-the-fly, during XML parsing. DOM tree builder (XML::Handler::BuildDOM) does not know about inclusion.
I use module XML::Filter::SAXT. This module is an analog of cat and tee unix commands for SAX events. It simplifies the building of chains of responsibility for SAX handlers. For example, parser generates SAX events. First handler in chain renames tags para to p, second handler drops some tags, third handler can join a series of character events to one event, next handler can emulate push/pull namespace-aware parser.
My code (include-test.zip) is built on top of XML::Filter::SAXT. I filter tags <include src="filename"/>. When I meet this tag, I suspend the handling of current document and create new instance of my code for given file. This file also may contain include tags. All SAX events are joined to one stream. This stream is an input of next SAX handler in stream. In my code this handler is a XML::Handler::BuildDOM.
This code (8bit-dom.tar.gz) is another example of extending of XML::Filter::SAXT. I redefined characters handler. Updated handler calls XML Parser for original 8-bit string and passes this string to characters handler of next handler in chain. After updated SAX events are processed in XML::Handler::BuildDOM, you get 8-bit DOM instead of UTF8.
Important note: you MUST NOT build 8-bit DOM tree unless you are surely know why are you doing it. Perl XML programmers consider that XML::DOM character data is in UTF8. If you don't regard this assumption, you create time bomb for your project.
I found that DOM standard is not friendly for programming. As always, it is only my opinion. This library (domutil.tar.gz) helps me to concentrate on programming, not on reading XML::DOM documentation. Here is a list of functions:
I'm the author of PyxieSax modules. This modules use plain text notation for representing XML data. This notation is a subset of Nsgmls output format. After Sean McGrath published article at XML.com about his PYX notation, I added support of PYX to my module and uploaded it on CPAN.
See my XmlConnect page for more information about the idea and realization. By the way: XML::Parser is working faster than PyxParser, but PyxParser is extremely faster than pure Perl regexp parser.