Fun with XSLT
Posted in syntax database on June 8th, 2008 by CapnKirkThe whole point of using XML is that one can easily transform it into another form, whether XML or not. And it is easy…until you want to do something complicated!
My goal was to convert our in-house syntax database into something “lean and mean,” both small in size and without extraneous data (mostly embedded in attributes) that had accumulated ad hoc during development for the needs of the parser, mostly. I also wanted to refactor all the names of elements and attributes into something closer to the usual jargon of biblical scholars and Bible software vendors, our primary users of the raw XML form of the data.
An XSLT parser reads in an XML document into memory, and then starts at the top of the document, transversing each node from parent to child and from sibling to sibling in the document hierarchy. Along the way, the parser applies every “template” that is relevant.
In my innocence, I thought I would write one XSLT file and do everything I wanted in one go. Wrong. It’s not a simple matter to flatten the hierarchy (remove a layer of elements at every point) in two places at the same time. And then, at the same time rename all the different elements and attributes. Oh, yes, remove a bunch of the attributes that are not relevant for users from the terminal nodes.
The problem is that to do all those things at the same time with one downward pass through the file hierarchy using recursion gets very complicated. There are ways around the problem but it gets really hairy. On the other hand, all the tasks — except for the global renaming — are simple and straightforward if they are done separately. The trade off is that one must run — in this case — the parser four times. I don’t know if fewer runs would be faster, since the XSLTs would be a lot more complicated, but disk read/write times these days are rarely the limiting time factor as they once were. So I wrote four transforms and automated their invocations in a script.
The hardest task was the global renaming of elements and attributes. Changing one element or attribute is trivial; but doing a global renaming is hard — in one transform. But there is a solution. It turns out that this task is a common one in the XML world, and so a general solution to the problem is available. It is a table-driven rename transform, which one then calls with a simple table of “from” and “to”. I found the solution in Sal Mangano’s XSLT Cookbook, a set of “recipes” for “standard” tasks and problems.
The moral of this tale: unless you’re very experienced in XSLT, consider keeping your transforms to doing only one task, or similar types of tasks and using several transforms in succession.
One interesting problem arose during this process: what am I going to do about discontinuous clauses (left-dislocation, etc.)? In order to preserve the syntax tree, the terminal nodes (the stream of text) have to be disrupted. To put it in XML terms, we have two different information hierarchies and an XML document was designed to handle only one hierarchy at a time. More on this anon.
Now I can return to the business of importing my “lean & mean” database file into emdros.

