I am now a PhD

July 4th, 2008

On June 27, 2008, I defended my PhD thesis in front of an august panel of professors.  I passed, so I guess I am now a Doctor. It feels great :-) .
The thesis can be had here:

Ulrik Sandborg-Petersen’s PhD thesis.

It’s all about how Emdros can help save cultural heritage.

The Emdros code has in fact progressed since the last release (which was 3.0.1), even though I have been busy writing the thesis and defending it. One of the upcoming goodies which will be in 3.0.2 is a tree-display in the Emdros Query Tool.  Yes, you can now display the TIGER corpus, for example, or the Penn Treebank, or the BLLIP corpus, or almost any other treebank, as trees right inside the Emdros Query Tool.

Ulrik

Emdros Query Tool: New Harvesting algorithms

March 19th, 2008

Since its inception by Hendrik Jan Bosman many years ago, the Emdros Query Tool has only had one harvesting algorithm. Well, until today, that is. Now it has four, including the old one.

The overall harvesting algorithm is:

  1. Execute the query. This results in a sheaf.
  2. Traverse the sheaf and gather a list of “hits”: One monad set for each “hit”.
  3. Traverse the sheaf and gather the big-union of the sets of monads in all matched objects whose “Focus” boolean is true. This is called the “sheaf focus monad set”.
  4. Get a set of raster monad ranges based on the list of “hits”. A “raster monad range” determines how much context to show around a set of monads corresponding to a “hit”. See below for how it is calculated.
  5. Get all “data units” and their features, based on the set of monads being the big-union of all raster monad ranges. A “data unit” is an object type whose objects must be shown for any given hit. Typical data units include “Word”, “Phrase”, “Clause”, “Sentence”, etc. This is gotten using the MQL statement called “GET OBJECTS HAVING MONADS IN”.
  6. Traverse the list of monad sets corresponding to a “hit”. For each monad set, calculate one “solution” to be: (i) The “hit” set of monads; (ii) The set of monads arising from taking all of the raster units that overlap with a stretch of monads in the “hit” set of monads. This is called the “raster monad set” for this solution; (iii) All data unit objects which have monads sets which overlap with the “raster monad set”. (iv) A “focus set of monads”, which is the intersection of the “raster monad set” and the “sheaf focus monad set”.

There are two changes to the harvesting algorithm which I have made today. The first relates to step #2 (gathering “hit” monad sets), and the second relates to step #4 (gathering raster monad ranges).

The first change (gathering “hit” monad sets) now has four ways to do it, as opposed to only one before today:

  • outermost“: This is the old one which was already there. It simply traverses the sheaf, and for each outermost straw, it calculates one set of monads being the big-union of the monad sets of all matched objects which are direct children of each outermost straw. Naturally, this can get unwieldy if the outermost block is, say, a “book”.
  • focus“: This calculates one “hit” monad set for each matched object whose “focus” boolean is “true”. The “hit” monad set is simply the monad set of the matched object.
  • innermost“: This calculates one “hit” for each straw which satisfies the condition that all its children are terminals in the sheaf tree, i.e., none of the children have an inner sheaf. The “hit” is simply the big-union of the monad sets of all matched objects in such straws.
  • innermost_focus“: Like innermost, but only does the big-union of the monad sets of those matched objects in the straw whose focus boolean is “true”.

The “innermost” and “innermost_focus” algorithms are especially well suited to making concordance-views (which I’ll hopefully blog about at some point).

The second change is to step #4, which calculates the raster monad ranges. The old way used to be to be told an object type (a “raster unit”) whose objects would determine the context range of monads. This would be done with GET OBJECTS HAVING MONADS IN, using the big-union of all “hit” monad sets, and using the “raster unit” object type as the object type to GET. This method is still available.

The new way, however, specifies two context monads: “raster_context_before” and “raster_context_after”: Two independent, positive integers which determine the raster context ranges. The algorithm is to traverse the list of “hit” set of monads, and for each set of monads, take the first monad, minus “raster_context_before” as the first monad of the range, and take the last monad, plus “raster_context_after” as the last monad of the range. Again, this is especially useful for concordance-type views.
This will appear in the next public release after 3.0.1.

As always, if anyone is interested in having a preview, please contact me.

Until then,

Ulrik

Emdros 3.0.1 released

March 19th, 2008

It has been a while, but I forgot to mention that Emdros version 3.0.1 was released on February 17, 2008.

Ulrik

Emdros 3.0.0 released

January 27th, 2008

After more than 4 years in the making, Emdros version 3.0.0 has been released over at SourceForge.Net:

http://emdros.org/download.html

This started off as a branch off of the 1.1-series of Emdros, way back in 2004 (or was that 2003, even?). It then became a long series of preview releases, labelled 1.2.0.preXX (running internally to 1.2.0.pre269!). I should, of course, have released 2.0 way earlier. Now it became 3.0, simply because that is what it is, in terms of feature-additions.

By the way, the primary reason I haven’t been so publicly active around Emdros is that I have gotten married (hence the change of surname you’ll see below). Things have been moving internally, though, so 3.0.0 is actually a long ways from 1.2.0.pre262, the last public release.Enjoy!

Ulrik Sandborg-Petersen

Emdros demo website online

July 19th, 2007

Today I’ve made an Emdros demo website available on the ‘net. Be sure to check it out!

It is butt-ugly for now, but it works. It gives the user MQL-query access to the Penn Treebank sampler available with the Natural Language Toolkit (NLTK). That is, about 1 million words of the WSJ corpus can now be searched online with the demo website.

Enjoy!

Emdros downloads approaching 16000

July 15th, 2007

Within the next 24 hours, I expect that the 16000th copy of Emdros and related files will be downloaded from SourceForge.Net. This does not include those copies that may have been downloaded from elsewhere.

Emdros was first released to the public on October 11, 2001, as version 1.0.3. Since then, around 45 releases have been made public. One Linux distribution (the Russian “Alt Linux”) has picked it up and included it in their portfolio of packages. Two companies have bought licenses, and incorporated it into their software, so that Emdros may be in use by thousands of people every day. At least four academic settings have used Emdros for meeting their own needs, including IRIT in Toulouse, France, who are using Emdros as the foundation for a concordancer used by linguists in their research. Several individuals have been very kind, and have written to me with requests for help and enhancements, and some have even contributed bugfixes. Emdros has taken me to two countries to meet with people who were interested in using Emdros, and my work on Emdros has led to several new friends, some of whom I have not yet met face to face, but only via the Internet. So, I have been truly blessed by the Lord in his making me able to produce Emdros.

Update: It has already happened, as of around 09:15 GMT, on 2007-07-15.

Emdros 1.2.0.pre262 released!

July 4th, 2007

I’ve released Emdros version 1.2.0.pre262 over at SourceForge.Net.  It contains all the goodies I’ve been blogging about since March 2007 (i.e., since the last release, which was 1.2.0.pre242).

To summarize, this release brings:

  • C# bindings, or rather, .Net bindings
  • marks on topographic blocks
  • expansion of the topographic part of the language
  • a TIGER XML importer

Enjoy!

Ulrik

MQL takes a leap closer to QL

June 8th, 2007

One of my customers told me that the new Wrap block wasn’t what they really needed. The main point of complaint was that there was an implicit power block at the beginning of the innards of the wrap block. It wasn’t intuitive enough.

I looked at my implementation, and realized that, in order to fix the problem, I had to essentially rewrite large parts of the implementation of the topographic part of MQL.

So that’s what I did. I struck gold when I came up with a very simple solution:

  1. Remove the wrap block completely. If I ever reinstate it, it will most likely be done right :-) .
  2. Move the part that drives the computation forward in the monad stream from the block_string to the blocks construct.
  3. Implement the concept of “StartMonadIterator”, which iterates over the possible start monads. There are three kinds of StartMonadIterator: One iterating over an Inst, one iterating over a set of monads, and one iterating over the gaps in a set of monads.
  4. Do what Doedens prescribed w.r.t. block_strings: Split them up into a number of levels, such that: a) blocks are handled at the bottom level, at which level we also handle [grouping]. b) At the next level up, handle the first level, plus Kleene Star on the first level. c) At the third level, handle the second level, and let the third level also be the level at which concatenation is handled. d) At the fourth level, handle the third level, and also OR between strings of blocks.
  5. Handle power blocks at the same level as all other blocks, but with the usual checks: A power block … a) Cannot appear at the beginning or end of a context, b) Two power blocks cannot stand next to each other.
  6. Allow any kind of block after a power block (except a power block).

The bottom line is that MQL is much more powerful now, as a result of the following: a) We now have “real grouping” of strings of blocks; b) Kleene star can now apply to groups as well as to individual blocks. c) We can now have any kind of block after a power block, not just object blocks.
These three points may not seem to be “big”. But let me assure you that this is indeed one — nay two — quantum leaps upward for MQL in expressive power. That is, MQL is now much closer to what Doedens had envisaged, and what he described in his PhD thesis as the language “QL”.

I have added many regression tests to the regression test suite in order to test the new functionality, and the old regression tests all run without a hiccup. I have also run valgrind’s memory checker on the regression test program, and it comes up with 0 memory leaks. Finally, all of the test queries in my corpus of test queries against a “real” database come up with the same answers as before, except for the order in which straws from OR-separated block_strings appear.

So things are looking good.

Again, I still have no schedule for when these changes become public. If you want to try it out, please drop me a line.

Until then,

Enjoy!

Ulrik

Unofficial Emdros blog interview

May 28th, 2007

The “Unofficial Emdros blog” has an interview with yours truly.

The Unofficial Emdros blog is run by one of my very good friends, and mostly has tidbits bantered in friendly conversation.

Ulrik

C# bindings for Emdros

May 19th, 2007

I’ve added C# bindings to Emdros.  They might still be a little rough around the edges, but overall they seem to work pretty well.

A small example is given in the source code, showing how to map the C++ API to the SWIG bindings.  Also, some documentation is provided.  Plus, the source code now comes with the C# bindings pre-compiled as a DLL, built with Mono on Linux, but capable of running on Windows(R).

Again, there is no time-frame for when this gets released to the public.  Drop me a line if you feel you must try the new features out.

Ulrik