Fun with XSLT

Posted in syntax database on June 8th, 2008 by CapnKirk

The whole point of using XML is that one can easily transform it into another form, whether XML or not. And it is easy…until you want to do something complicated!

My goal was to convert our in-house syntax database into something “lean and mean,” both small in size and without extraneous data (mostly embedded in attributes) that had accumulated ad hoc during development for the needs of the parser, mostly. I also wanted to refactor all the names of elements and attributes into something closer to the usual jargon of biblical scholars and Bible software vendors, our primary users of the raw XML form of the data.

An XSLT parser reads in an XML document into memory, and then starts at the top of the document, transversing each node from parent to child and from sibling to sibling in the document hierarchy. Along the way, the parser applies every “template” that is relevant.

In my innocence, I thought I would write one XSLT file and do everything I wanted in one go. Wrong. It’s not a simple matter to flatten the hierarchy (remove a layer of elements at every point) in two places at the same time. And then, at the same time rename all the different elements and attributes. Oh, yes, remove a bunch of the attributes that are not relevant for users from the terminal nodes.

The problem is that to do all those things at the same time with one downward pass through the file hierarchy using recursion gets very complicated. There are ways around the problem but it gets really hairy. On the other hand, all the tasks — except for the global renaming — are simple and straightforward if they are done separately. The trade off is that one must run — in this case — the parser four times. I don’t know if fewer runs would be faster, since the XSLTs would be a lot more complicated, but disk read/write times these days are rarely the limiting time factor as they once were. So I wrote four transforms and automated their invocations in a script.

The hardest task was the global renaming of elements and attributes. Changing one element or attribute is trivial; but doing a global renaming is hard — in one transform. But there is a solution. It turns out that this task is a common one in the XML world, and so a general solution to the problem is available. It is a table-driven rename transform, which one then calls with a simple table of “from” and “to”. I found the solution in Sal Mangano’s XSLT Cookbook, a set of “recipes” for “standard” tasks and problems.

The moral of this tale: unless you’re very experienced in XSLT, consider keeping your transforms to doing only one task, or similar types of tasks and using several transforms in succession.

One interesting problem arose during this process: what am I going to do about discontinuous clauses (left-dislocation, etc.)? In order to preserve the syntax tree, the terminal nodes (the stream of text) have to be disrupted. To put it in XML terms, we have two different information hierarchies and an XML document was designed to handle only one hierarchy at a time. More on this anon.

Now I can return to the business of importing my “lean & mean” database file into emdros.

Hebrew database update

Posted in syntax database on June 4th, 2008 by CapnKirk

Some considerable time has been spent determining what the best way to import the Hebrew syntax data (in XML) into emdros should be. There is an XML import format that emdros supports called TigerXML. Urik, emdros’ developer, and I played around with mapping our XML structure to TigerXML. We decided finally not to use this approach. The reason is that TigerXML uses a data model of syntax that uses acyclic graphs. My data uses a hierarchy known as rooted syntax trees, and trees can be represented in graph theory as directed acyclic graphs. The TigerXML approach divides the nodes into terminal and non-terminal, with “edges” defining the links between the nodes. This is a radically different model than my data which is strictly hierarchical. That means the mapping of my data to a TigerXML compliant file would be more work than it is worth.

So, back to the drawing board. The normal way one uses emdros is to create a schema or data model. Then one writes a script to generate the MQL statements that actually populate the database with objects from the data files. Here’s what we’ve accomplished so far:

  • Created a test xml file for Genesis 1:1 from the syntax database
  • Created the emdros schema for the test
  • Ulrik wrote a python script to convert the test xml file to MQL statements
  • Using the schema and MQL of Genesis 1:1, I created a test database
  • Using the Emdros Query Tool, retrieved the tree
  • Checked the tree for correctness
  • Established the exact details for the “final” version of the Hebrew syntax xml files

Along the way we solved various problems, and discovered a bug in the query tool. Ulrik fixed the bug with alacrity adding a new “horizontal tree” display feature along the way. The results are impressive:

gn1-1tree.png

The next steps:

  • Write an XSLT to transform the “final” version of the data files into the “final-final” version
  • Modify the schema to handle all the data attributes
  • Modify the xml2mql.py script to handle those changes
  • Test the new schema/script with the book of Ruth

I’ll report the results shortly.

Creating an annotated text database

Posted in writing, syntax database on May 20th, 2008 by CapnKirk

After completing the initial chapters of A Guide to the Westminster Hebrew Syntax Database, I realized that I was going to need clear and elegant examples of Hebrew clauses. The only practical way to do this is to have a query-able database. But all I have right now is 231MB of xml files.

The solution is to use emdros, a database search engine optimized for annotated text. It is open source, and I know the developer, Ulrik Sandborg-Petersen, personally. This makes it an excellent choice, with plenty of technical support. But best of all, emdros has a query language, MQL, which is in the same idiom that I think in when I think linguistically. That means there is no wrestling with the query language to translate what I want into something it understands. I simply lay it out and we’re ready to go.

I’ve committed to presenting on the process of using emdros for text projects at the Computer Assisted Research Group for the Society of Biblical Literature’s annual meetings this November. I thought it practical to keep notes on my experience here. It will also be useful to you, the reader, as well.

Here is an overview of the entire process:

  1. Finalize the xml file format
  2. Write an XSLT transform to convert the files into TigerXML format
  3. Use the emdros utility tigerxmlimport to move the files into an emdros database
  4. Test the import with a test suite of queries with known/expected results
  5. Evaluate the query client for usefulness

Of course, it’s going to be more complicated than this as we get into it. But this is a start. Today I read all the documentation to get an idea of the path ahead. Then I installed emdros on both my Mac laptop and my Fedora 8 desktop. On the latter machine I compiled from the source code to get the faster i686 binary executable. It’s quite simple; just download the source code file and execute two commands:

rpmbuild -ta –target i686-redhat-linux emdros-3.0.1.tar.gz
yum install –nogpgcheck ~/rpmbuild/RPMS/i686/emdros-3.0.1-1.fc8.i686.rpm

This assumes you have rpmbuild already installed, including a host of other stuff for compiling source code. But that is a tutorial beyond the scope of this post!

Somebody agrees with me!

Posted in raison d'étre on May 19th, 2008 by CapnKirk

One of my driving motivations as a scholar is to be a rocket scientist. I want to do the math when analyzing and understanding literary texts! This is because I believe literary criticism is fundamentally an objective — yes, I said o-b-j-e-c-t-i-v-e — endeavor. Empirical methods count in the study of literature, too. This, of course, flies in the face of accepted wisdom in the Academy today. “Truth” cannot be known, authors do not rule, readers do, and all narrative is about political competition and oppression. I think that’s all wrong. And I’m not alone.

Jonathan Gottschall teaches English at Washington & Jefferson College. He is the author of the forthcoming book Literature, Science, and a New Humanities. Recently, he wrote an article for the Boston Globe, “Measure for Measure.” Here are some juicy quotes:

We literary scholars have mostly failed to generate surer and firmer knowledge about the things we study. While most other fields gradually accumulate new and durable understanding about the world, the great minds of literary studies have, over the past few decades, chiefly produced theories and speculation with little relevance to anyone but the scholars themselves.

Literary studies should become more like the sciences. Literature professors should apply science’s research methods, its theories, its statistical tools, and its insistence on hypothesis and proof.

Literary scholars should actually do science. “Literary science” may seem laughably, even pathetically, oxymoronic, but in fact it is already being done, with real results.

The great wall dividing the two cultures of the sciences and humanities has no substance. We can walk right through it.

Just like I’ve been saying for 25 years…

Winding down

Posted in writing on May 13th, 2008 by CapnKirk

Tomorrow is my last full day here in Wilmington. Thursday I head off to Greensboro to help Kevin move out of the dorm into his summer housing. I have done all I’m going to for Chapter 1 and Appendix A this time around. Tomorrow I concentrate on the ostensible subject of this writing project — Hebrew syntax! Nevertheless, the foundations needed to be laid. I doubt I could have done such complex thinking without the isolation I’ve had here to really concentrate. I’m going to need more such concentrated time in the near future. But the conclusion I’ve come to is that in order to make any real headway for the rest of the book I absolutely must have a practical way to query the database files. For, although I call it a “database,” they are only data files. When I return to Philadelphia next week that will be my top priority. I have a plan.

Technical Note

Problem: I needed a list of all the unique names of attributes in the 231MB of XML files in the database. It is documented in the source code, but nowhere else. It’s not practical to eyeball the files manually.

Solution: the simplest way to do this is to use Extensible Stylesheet Language Transformations (XSLT). XSLT is a powerful way to work with and manipulate XML files. It is one of the advantages of using XML for our database files. How powerful? With just a few lines:

xslt.jpg

I could now execute the command:

xsltproc getAttributeNames.xsl ??.trees.xml | sort -u

The sort command with the “-u” parameter not only sorts the output of xsltproc but also outputs only one instance of each line of the input.

“Simple,” said I? Yes. Arcane? For everyday work in XML? Not at all. A trivial example, really. The power of computing “under the hood” cannot be overstated.

The Log

I record here my progress according the metrics of page and word count and pages read. The numbers are cumulative.

Today’s PDF Page Count Word Count Pages Read
Click here 29 6601 611