the fall of XPath over filesystem

Many XPath tutorials use file paths as an analogy of XPath. While it is ok from a high-level point of view, the analogy is misleading and actual technical implementations (one, another, my) are kludge (at least, my implementation). Here are some issues.

The first one is the user interface. When a node (a file) is matched, what should be printed to the user:

/usr/bin/find, or

Other issues are technical.

The second issue. The file system isn’t a tree. There are symbolic links. On the one hand, as user, I want that XPath “.//*[match(name(),'*.c')]” find matches in the folder “src”, even if this folder is actually a symbolic link. On the other hand, symbolic links can create hardly detectable infinite loops.

Third. File systems have many features. One of them allows to create a folder, for which one can’t get the list of the children files, but one can enter to a subdirectory if the name is known. Let’s have a site in the folder “/var/web/pub/uVc7k/” and “pub” is a such folder. Then XPath “//*[match(name(),'*.html')]” doesn’t find HTML files of the site.

Fourth. XPath tutorials suggest nice XPath expressions like “/usr/bin/find“. Unfortunately, actual expressions look like /node[name()='usr']/node[name()='bin']/node[name()='find']. Indeed, file names are not limited to the ASCII letters and digits.

Fifth, the situation is even worse. Many file systems allow to use any characters for file names, excluding “\000” and “/”, but including the symbols with codes “\001”, “\002” etc, which are forbidden in XML.

Sixth. Finally, the main. I just don’t see what a practical problem can be solved using XPath over file system. I hope I overlooked something.

3 Responses to “the fall of XPath over filesystem”

  1. 123Philippe Poulard Says:

    I think that the 3 first issues are not related specifically to XPath: any tool that have to deal with file systems should propose a solution to these issues.

    The fourth point is somewhat unsettling ; I simply wouldn’t use a system that would write paths like this : /node[name()=’usr’]/node[name()=’bin’]/node[name()=’find’] ; my own implementation do really understand paths like this : /usr/bin/find

    The fifth point is a real problem ; although files with codes like “01″, “02″ can’t be named in XPath expressions, they can be handled anyway (“//*”). I have encountered that issue with legal XML characters that are illegal in XML names, and for such a case I use the alternative syntax *[name()=’my file’] (as “my file” contains a white space) ; codes like “01″, “02″ could be escaped ; fortunately, these cases occur rarely.

    Sixth, I agree that using XPath on the command line to handle files won’t help much more than the usefull “find”-like commands. However, using it on another environment may help ; for example, instead of using numerous elements that define file sets in Ant (<fileset>, <dirset>, <include>, <exclude> etc and other ugly things like “**/*.jar”) we could consider XPath expression instead (this is an exercice that I have submitted to my students).

    To conclude, my own experience on using XPath to walk through file systems is a success, certainly because I used it with a real XML-based system. When I have a set of XML files to transform with XSLT, I write it simply like this :

    <xcl:for-each name="file"
      select="{ io:file('/path/to/dir')//*[@io:is-file] }">
         <xcl:transform source="{ $file }"
             output="{ substring-before( $file/@io:path , '.xml' ) }.html"

    This is an appropriate solution on an appropriate system.

  2. olpa Says:

    Philippe, thanks for commenting. It have led me to re-thinking on the subject.

    I do agree that any reasonable implementation should support paths like /usr/bin/find. Mine supports too.

    The problem with codes \001, \002 etc also can be workarounded. Most systems check for bad symbols only when parsing XML, and don’t check when manipulating XML in memory.

    By mentioning Ant and your XSLT example, you’ve convinced me that practical use cases for XPath over filesystem do exist. Nice.

    In short, each issue in my list isn’t a showstopper and can be workarounded. But I still insist that XPath over filesystem has failed, at least for my goals. The problem, which I failed to formulate in the original post, is the leaky abstraction. (For several minutes, I was distracted by re-reading “The Law of Leaky Abstractions“.) All that small problem, taken all together, lead to the leaky abstraction.

    Initially, my XPath over filesystem was developed as a reference implementation for deploying my XML Virtual Machine. Unfortunately, the leakage caused bothering about lots of unrelated details instead of actual deploying.

    And I still have no good practical idea for a reference implementation.

  3. olpa Says:

    From a private comment by Erik Wilde (one of the authors of “XPath Filename Expansion in a Unix Shell“):

    interesting. partly, you are right. and this is the reason why we decided to not let the user enter the xpath directly. instead, we created an xpath-like syntax which is then mapped internally to a real xpath. /node[name()=_usr_]/node[name()=_bin_]/node[name()=_find_] can be avoided this way, or even better, we don’t even use the xml element name for carrying the file name, so we don’t have problems with characters which are nor allowed in xml names. however, characters whcih are not allowed in *xml* still are a problem…

    so why could it be useful? i think that as a productivity tool, it could be quite useful for sysadms and other power users, but we also never got around to implementing enough adaptors to make it really interesting…

Leave a Reply