find with XPath over file system

/* Read also “The Fall of XPath Over Filesystem”. */

The standard UNIX utility find now supports XPath:

$ ./find -xpath '/bin/*[@size > /bin/bash/@size]'
/bin/ipv6calc
/bin/rpm

XML can be considered as an external representation of in-memory tree-like structures, and the XML-related standards -- as methods of processing such data. I'm working on a reusable XPath implementation, and the new find is a reference implementation for adding XPath into the C programs.

User's reference

The new find is as compatible as possible with the original find. The issues are given later.

An XPath expression is specified using the command line parameter -xpath. If it isn't given, the default XPath expression descendant-or-self::* is used.

The name of a element is the name of the file. Each element has attributes:

The attribute name is the name of the file. All other attributes represent fields of the struct stat from the sys/stat.h.

User-level issues

At the moment, I'm using a naive XPath implementation. Don't expect the document ordering or the uniqueness of nodes in the result. Somewhen I'll fix it.

Actions (for example, printing file names) are executed after collecting files, not during tree traversal.

The command-line parameters -depth and -noleaf are ignored.

You can use the variable base which is the reference to the base directory:

./find -xpath './/*[@ino = $base/somefile/@ino]' -ls

Going to the parent directory outside the base directory increases the depth level, so the -maxdepth parameter may affect parent steps.

The -maxdepth parameter affects XPath predicates. For example, the XPath expression a[b/c] always returns nothing if max depth is 2 because the file a/b/c is never seen.

The parameter -xdev is ignored when getting the parent directory or the root directory (programmer's fault).

The name of a file appears twice, as the name of the element and as the value of the attribute name. The former allows to write simple XPaths like /bin/cat and the latter allows to use the name as XPath step: dir[res1/@name=res2/@name].

There are some XPath issues due to the programmer's issues. They are discussed below.

Notes for programmers

There is the Scheme programming language, there is the SXML tools for S-expression-based XML parsing, query and conversion. There is Guile, an embeddable Scheme interpreter. We combine all together in one application.

The application uses XPath engine (currently txpath edition) from the SXML library. The file system tree is exposed as the SXML tree to the library.

The binding is "lazy", so nodes in the SXML tree are instantiated on demand. Guile was patched to allow lazy binding.

In order to switch between the file metadata (C layer) and nodes in the SXML tree (Scheme layer), the application maintains a map between them.

S-expressions have no parent links, so the parent and related steps in SXPath are quite expensive. But in the application, due to metadata, it's very easy to find the parent node for a node in the SXML tree. We exploit it and redefine some SXPath axes in the file sxpath-native.scm.

The map misses SXML attribute's nodes, so getting the parent of an attribute node doesn't work, and XPaths like file/@name/.. produce nothing (programmer's fault).

There are some differences between the file system tree and the XML tree. At first, the paths /bin and ../../../../bin may point to the same directory, but they look different for users. At second, we can lazily walk from the base directory to the root and at the same time descent from the root. A special attention should be paid to a meeting point. But I decided to ignore it due to the first issue, so nodes from relative and absolute XPaths are never equal.

The main issue is garbage collection. I have no experience with it, so I don't know if GC is safe in the program. For example, can Guile delete an SXML node even if the map has a link to it? I'm going to investigate the issue, but not right now.

Download and install

First install the Guile with the lazy patch: http://uucode.com/texts/lazypair/index.html.

Then download findutils 4.1.20 and compile them.

Download the changed source code of find, copy the files to the folder find in the findutils source tree. Touch the updated files:

$ pwd
...../findutils-4.1.20/find
$ touch find.c parser.c defs.h

Update the Makefile:

Now compile the new find and run it to test if it works. You might need to set up the environment variable LD_LIBRARY_PATH.

$ export LD_LIBRARY_PATH=...../guile-1.6.6/opt/lib
$ ./find
.
./sxpath-native.scm
./xpath-parser.scm
...

That's all.

Instead of compiling from sources, you can try to use the binary version (you still need the package with source code for scm-files). The program requires the following shared libraries: find.so (included), libdl.so.2, libm.so.6, libcrypt.so.1, libguile-ltdl.so.1, libc.so.6, libpthread.so.0, /lib/ld-linux.so.2. You might need to set up the LD_LIBRARY_PATH to help the system in finding the find.so.

Final notes

Some references are given in the section "Notes for programmers". Further references and background information can be found in the paper "An approach to portable implementation of the XQuery language".

Author is Oleg Paraschenko <olpa uucode com>, this page's URL is http://uucode.com/texts/xfind/.


http://uucode.com/texts/xfind/index.html
Oleg A. Paraschenko <olpa uucode com>