joining entities

XML tools are good when the input data is XML. And they are awful when the data is XML-like. As result, instead of using "xmllint --noent", I had to write my own entity substitutor "entity.py".

First, it accepts the DTD file names such as "C:\Program Files\...". Xmllint doesn't.

Then, the tool reads and drops XML declaration, remembering encoding. Then it tries to convert the data to UTF8. In case of failure, the tool doesn't stop, but just prints the error message and returns the empty string.

Next step is to read DOCTYPE definition and parse the DTD and the internal subset, collecting the entity definitions.

Entities are substituted with regexps. Fortunately, the data doesn't have CDATA-sections, so it's not hard. If an entity is a system entity, the tool recursively includes the corresponding file.

It isn't hard, but some accuracy required.

Categories:

Updated: