The goal of the XmlConnect project is to transfer Xml data between applications with overheads minimized. Modules use subset of Nsgmls output format [SPOUT] (Esis strings) and are compatible with PYX [SEAN] notation.
|XML version||Plain-text notation (ESIS strings)|
... <?markid L000012?> <para lang="en" align="center"> Hello, World!</para> ...
... ?markid L000012 Alang CDATA en Aalign CDATA center (para -\n\012Hello, World! )para ...
Packages contain some tests/examples that can be useful:
If applications are going to interchange XML (for example, client asks database for data), one program should convert memory representation of XML to some format and second program should convert this format back to memory. If applications are written in the same programming language, the simplest way is to use serialization libraries. But Bruce Martin [BRUCE] says: Surprisingly, the performance results presented below indicate that a textual representation of XML is a far more efficient representation than a serialized DOM representation. Also, the time required to externalize a DOM representation and reparse the textual form is cheaper than the direct Java serialization and deserialization of the DOM.
One of the reasons why serialization is not very effective for XML interchange is that serialization is developed for general purpose. XML tree is a specific memory structure, interchange of this structure can be optimized. Instead of tree we can transfer portions of information from which we can restore original tree. Other approach for optimization of transferring XML data is to minimize time of parsing (Minimal/Common/Binary XMLs). If we join interchange of XML memory tree and interchange of preparsed XML, we get interchange by SAX events.
Searching for best method of transferring of SAX events I remembered program Nsgmls [NSGMLS] of James Clark. This program is well-known for SGML/XML developers. It parses SGML or XML documents and prints structure as ESIS (element structure information set) strings. The set of these strings is not equal to set of SAX events, but many of strings are not significant for XML interchange and we can select a subset of them. These subset will be referenced further as Esis strings or Esis notation.
Here are basic considerations what functionality is required for general Xml exchange based on Esis notation.
One of the big problems with XML is that good storage of XML is yet not implemented. The basic set of requirements is under development: storing, extracting and indexing XML, transactions and locking, user and programming interface, GPL license. But! Many-many years ago, when word XML was not invented, scientists already developed storages of tree data.
If we take such system and define simple representation of XML in terms of this system, we get real XML database. Esis notation is an ideal for attempt. If someone is working on it, please notify me.
By the way, do you need free XML storage with advanced search functionality? Look at freeWAIS-sf [WAIS]. We are using freeWAIS for several years as the real XML database (it can index XML). The only problems are the lack of documentation and that online updating of storage seems to be impossible.
Sometimes I work with invalid SGML documents. There are a lot of reasons why documents can be wrong:
In many cases markup can be guessed and document can be parsed. This can be done by a standalone program. This program should save result of parsing. Result of restoring markup still can be not valid SGML or XML. Example:
... <a name="para1"/> <para>Here is a <b>sample text</b> <linefeed> on two lines.</para> ...
This example is a result of inserting SGML text with empty tag linefeed to XML document. Result is neither SGML (no such DTD) nor XML (<linefeed> should be <linefeed/>). So we have to work with result using non-SGML/XML tools. Esis notation is a very good input format for such tools.
Somewhen I will write non-conformant non-validating SGML and XML parser -- all in one. This parser will understand most of markup methods (SGML, XML, human HTML) and generate Esis strings.
Pyxie [PYXIE] is a Python library for processing XML developed by Sean McGrath. I was working on my modules when Pyxie was announced. Pyxie used near the same method of representation of XML (PYX notation) and become popular, so I added support of its format to my modules. Esis strings and PYX notation have only two differences in formats:
Pyxie is a Python library. This library is optimized for XML processing in Python. My modules add support of PYX notation to Perl and Java. But default format of modules is Esis strings because I believe that compatibility with Nsgmls is very important feature.
Some examples are result of my experience, other are taken from Pyxie articles and discussions. These examples are optimized for text processing. Program x2p is a some program for converting XML or SGML to Esis notation. I usually use Nsgmls for doing it. I created code archive (pyx_usage.tar.gz) with examples of usage of Esis or PYX notation.
In order to calculate number of tags para in document, use
$ x2p document | grep -i '^(para$' | wc -l
Sometimes I have to get list of images associated with document. Here is a method:
$ x2p document | grep '^f<OSFILE>' | sed 's/^f<OSFILE>//' | sort -u
Sometimes I make mass changes in structured documents. Usually I do changes with tools that are optimized for doing this changes. These tools usually don't know about structure of document and can make false changes. In order to detect false changes we can convert original document to Esis notation and make the same changes regarding the structure. After it we can compare this etalon with Esis representation of modified structured document.
Typical example is a renaming attributes. For example, we renamed all attributes a1 to a2 in file file1 and saved result in file file2. We did it in usual text editor by text-replace of a1=" on a2=". Now we should check that it is done correct.
$ # Generate Esis notation of modified structured document. $ x2p file2 | grep -v ' IMPLIED$' >t-esis2 $ # Make etalon Esis. $ x2p file1 | grep -v ' IMPLIED$' >t-esis1 $ sed 's/^Aa1 /Aa2 /i' t-esis1 >t-etalon $ # Check that we replaced all. $ grep -i '^Aa1 ' t-esis2 $ # Check for false changes. # diff -i t-etalon t-esis2
Similar technique can be applied to more complicated tasks. You can check that two documents are equal without regard to whitespaces. If structured document is localized, you can check that translator did not suddenly changed a structure.
Sometimes I use WinDiff for testing difference of documents. In some cases checking the delta of documents is more suitable in Esis format, not in SGML/XML format.
I found this example in Perl-XML mailing list.
From: Matt Sergeant
Subject: Using PYX and GNU diction
Date: Tue, 09 May 2000 15:05:55 -0700
This is very cool, I just had to share it...
Get diction from: http://www.gnu.org/software/diction/diction.html
And pyx, from the XML::PYX distribution
Then type:pyx <xmlfile> | grep ^- | perl -pe 's/^-//; s/\\n\n//;' | diction
You can substitute "diction" for "style" too.
You'll have to install diction to see what it does - it's just way too cool to even describe... now I have to go re-write all my articles and texts...
Should be a great boon for XML based authors.