XmlConnect

Abstract: Simple plain text XML representation is developed. This notation is an ideal for interchange of XML data between applications, especially if applications are not XML aware. Notation is a subset of Nsgmls output format. SAX Perl and Java modules are developed. Tools and ideas for further development are described. Examples of usage and references are given.

Table of Contents

What is XmlConnect

The goal of the XmlConnect project is to transfer Xml data between applications with overheads minimized. Modules use subset of Nsgmls output format [SPOUT] (Esis strings) and are compatible with PYX [SEAN] notation.

Simple example:
XML version Plain-text notation (ESIS strings)
...
<?markid L000012?>
<para lang="en" align="center">
Hello, World!</para>
...
...
?markid L000012
Alang CDATA en
Aalign CDATA center
(para
-\n\012Hello, World!
)para
...

Download

PyxieSax.0.32.tar.gz (Perl)
Package contains two modules: XML::Parser::PyxParser and XML::Handler::PyxWriter. First module reads input Esis stream and generates Perl Sax events. Second module handles Perl Sax events and writes Esis strings to output.
PyxieSax.java.2.0.alpha.tar.gz (Java)
Package contains two modules: com.olpa.xml.PyxieSax.PyxParser and com.olpa.xml.PyxieSax.PyxWriter. First module reads input Esis stream and generates Java Sax events. Second module handles Java Sax events and writes Esis strings to output. Code is derived from SAXToPYX.java and PYXToSAX.java of Shawn Silverman [SHAWN].

XmlConnect tools

Packages contain some tests/examples that can be useful:

n2p
Converts Esis strings to PYX notation.
p2n
Converts PYX notation to Esis strings.
n2c
Converts Esis strings to canonical Xml.
p2c
Converts PYX notation to canonical Xml.

XML data interchange

If applications are going to interchange XML (for example, client asks database for data), one program should convert memory representation of XML to some format and second program should convert this format back to memory. If applications are written in the same programming language, the simplest way is to use serialization libraries. But Bruce Martin [BRUCE] says: Surprisingly, the performance results presented below indicate that a textual representation of XML is a far more efficient representation than a serialized DOM representation. Also, the time required to externalize a DOM representation and reparse the textual form is cheaper than the direct Java serialization and deserialization of the DOM.

One of the reasons why serialization is not very effective for XML interchange is that serialization is developed for general purpose. XML tree is a specific memory structure, interchange of this structure can be optimized. Instead of tree we can transfer portions of information from which we can restore original tree. Other approach for optimization of transferring XML data is to minimize time of parsing (Minimal/Common/Binary XMLs). If we join interchange of XML memory tree and interchange of preparsed XML, we get interchange by SAX events.

Searching for best method of transferring of SAX events I remembered program Nsgmls [NSGMLS] of James Clark. This program is well-known for SGML/XML developers. It parses SGML or XML documents and prints structure as ESIS (element structure information set) strings. The set of these strings is not equal to set of SAX events, but many of strings are not significant for XML interchange and we can select a subset of them. These subset will be referenced further as Esis strings or Esis notation.

Here are basic considerations what functionality is required for general Xml exchange based on Esis notation.

TODO 1

One of the big problems with XML is that good storage of XML is yet not implemented. The basic set of requirements is under development: storing, extracting and indexing XML, transactions and locking, user and programming interface, GPL license. But! Many-many years ago, when word XML was not invented, scientists already developed storages of tree data.

If we take such system and define simple representation of XML in terms of this system, we get real XML database. Esis notation is an ideal for attempt. If someone is working on it, please notify me.

By the way, do you need free XML storage with advanced search functionality? Look at freeWAIS-sf [WAIS]. We are using freeWAIS for several years as the real XML database (it can index XML). The only problems are the lack of documentation and that online updating of storage seems to be impossible.

TODO 2

Sometimes I work with invalid SGML documents. There are a lot of reasons why documents can be wrong:

In many cases markup can be guessed and document can be parsed. This can be done by a standalone program. This program should save result of parsing. Result of restoring markup still can be not valid SGML or XML. Example:

...
<a name="para1"/>
<para>Here is a <b>sample text</b> <linefeed>
on two lines.</para>
...

This example is a result of inserting SGML text with empty tag linefeed to XML document. Result is neither SGML (no such DTD) nor XML (<linefeed> should be <linefeed/>). So we have to work with result using non-SGML/XML tools. Esis notation is a very good input format for such tools.

Somewhen I will write non-conformant non-validating SGML and XML parser -- all in one. This parser will understand most of markup methods (SGML, XML, human HTML) and generate Esis strings.

Differences and common with Pyxie

Pyxie [PYXIE] is a Python library for processing XML developed by Sean McGrath. I was working on my modules when Pyxie was announced. Pyxie used near the same method of representation of XML (PYX notation) and become popular, so I added support of its format to my modules. Esis strings and PYX notation have only two differences in formats:

Pyxie is a Python library. This library is optimized for XML processing in Python. My modules add support of PYX notation to Perl and Java. But default format of modules is Esis strings because I believe that compatibility with Nsgmls is very important feature.

Examples of usage

Some examples are result of my experience, other are taken from Pyxie articles and discussions. These examples are optimized for text processing. Program x2p is a some program for converting XML or SGML to Esis notation. I usually use Nsgmls for doing it. I created code archive (pyx_usage.tar.gz) with examples of usage of Esis or PYX notation.

Counting number of tags

In order to calculate number of tags para in document, use

$ x2p document | grep -i '^(para$' | wc -l

Getting list of referenced system entities

Sometimes I have to get list of images associated with document. Here is a method:

$ x2p document | grep '^f<OSFILE>' | sed 's/^f<OSFILE>//' | sort -u

Check of changes

Sometimes I make mass changes in structured documents. Usually I do changes with tools that are optimized for doing this changes. These tools usually don't know about structure of document and can make false changes. In order to detect false changes we can convert original document to Esis notation and make the same changes regarding the structure. After it we can compare this etalon with Esis representation of modified structured document.

Typical example is a renaming attributes. For example, we renamed all attributes a1 to a2 in file file1 and saved result in file file2. We did it in usual text editor by text-replace of a1=" on a2=". Now we should check that it is done correct.

$  # Generate Esis notation of modified structured document.
$ x2p file2 | grep -v ' IMPLIED$' >t-esis2
$  #  Make etalon Esis.
$ x2p file1 | grep -v ' IMPLIED$' >t-esis1
$ sed 's/^Aa1 /Aa2 /i' t-esis1 >t-etalon
$  # Check that we replaced all.
$ grep -i '^Aa1 ' t-esis2
$  # Check for false changes.
# diff -i t-etalon t-esis2

Similar technique can be applied to more complicated tasks. You can check that two documents are equal without regard to whitespaces. If structured document is localized, you can check that translator did not suddenly changed a structure.

Sometimes I use WinDiff for testing difference of documents. In some cases checking the delta of documents is more suitable in Esis format, not in SGML/XML format.

Spellchecking

I found this example in Perl-XML mailing list.

From: Matt Sergeant
Subject: Using PYX and GNU diction
Date: Tue, 09 May 2000 15:05:55 -0700

This is very cool, I just had to share it...

Get diction from: http://www.gnu.org/software/diction/diction.html

And pyx, from the XML::PYX distribution

Then type:

pyx <xmlfile> | grep ^- | perl -pe 's/^-//; s/\\n\n//;' | diction

You can substitute "diction" for "style" too.

You'll have to install diction to see what it does - it's just way too cool to even describe... now I have to go re-write all my articles and texts...

Should be a great boon for XML based authors.

See also

[NSGMLS] Nsgmls
Nsgmls parses and validates the SGML document and prints a simple text representation of its Element Structure Information Set.
[SPOUT] http://www.jclark.com/sp/sgmlsout.htm
Nsgmls output format.
[BRUCE] Bruce Martin "Build distributed applications with Java and XML", JavaWorld
Article explores applications that communicates and processes XML between computers. Contains benchmarks for three methods: XML as text, serialized representation via Java RMI and CORBA-IIOP.
[SEAN] Sean McGrath "Pyxie", xml.com
Article is an overview of PYX notation with examples of usage. Popularity of Pyxie is started from this article.
Edd Dumbill "Pyxie Perfect", xml.com
Announce of Perl (XML::PYX) and Java PYX support.
Edd Dumbill "Keep it Simple...", xml.com
Philosophical look at Pyxie of Edd Dumbill.
[PYXIE] http://www.pyxie.org/
Pyxie project homepage.
[SHAWN] http://members.home.net/sfs/xml/
Original Java code (SAXToPYX.java and PYXToSAX.java) of Shawn Silverman.
[WAIS] freeWAIS-sf
FreeWAIS-sf is a system for indexing structured text documents. It can index XML files and can be used as XML database.


http://uucode.com/xc/index.html
Oleg A. Paraschenko <olpa uucode com>