Skip to content

Hebrew Institute

Sections
Personal tools
You are here: Home » Members » klowery's Home » Status of Hebrew Fonts Under Linux

Status of Hebrew Fonts Under Linux

Document Actions

Status of Hebrew Fonts Under Linux

As of April 2, 2004

Kirk E. Lowery
Westminster Hebrew Institute


Current State of Affairs

A Web search came up with a number of issues for using the SBLHebrew font under Linux:
  • The central issue is about "combining characters": the diacritics are not properly rendered
  • Linux does not support (from the kernel) "Level 2" or "Level 3" implementation of ISO10646-1, which includes combining characters
  • X-Windows also does not support Level 2; FreeType does support OpenType fonts, but not combining characters. This must be handled by the software package, e.g., word processor, browser, editor!
  • The issue may eventually be resolved by X-Windows, but not any time soon

If SBLHebrew is going to be used under Linux, programs such as OpenOffice, Emacs and FireFox (Mozilla) are going to have to provide the support for combining characters. And this is not likely.

Strategy for the Westminster Hebrew Institute

For now, it is clear that we cannot move to UTF-8 for maintenance, archiving and development of our Hebrew Bible morphology database. Instead, we will need to provide a software infrastructure to convert from Michigan encoding to UTF-8, such that we can provide UTF-8 Hebrew text on an ad hoc basis.

My recommendation is:

  1. continue to develop and maintain Morph using our current build toolchain

  2. move to an XML format that uses Michigan encoding for Hebrew text

  3. re-implement my perl script (Michigan2UTF8) in python, which has excellent UTF-8 support

  4. write small and efficient functions to be incorporated in our Zope applications

  5. when necessary, extract the Hebrew text from the database to create any arbitrary span of the Hebrew text itself, including the entire text at once.

  6. wait until the technology catches up: we are looking at no less than 5 years


From UTF-8 and Unicode FAQ for Unix/Linux:

Not all systems can be expected to support all the advanced mechanisms of UCS, such as combining characters. Therefore, ISO 10646 specifies the following three implementation levels:

Level 1
Combining characters and Hangul Jamo characters are not supported.
[Hangul Jamo are an alternative representation of precomposed modern Hangul syllables as a sequence of consonants and vowels. They are required to fully support the Korean script including Middle Korean.]
Level 2
Like level 1, however in some scripts, a fixed list of combining characters is now allowed (e.g., for Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugo, Kannada, Malayalam, Thai and Lao). These scripts cannot be represented adequately in UCS without support for at least certain combining characters.
Level 3
All UCS characters are supported, such that, for example, mathematicians can place a tilde or an arrow (or both) on any character.

Normalization Form C (NFC): Use precomposed characters instead of combining sequences where possible, e.g. use U+00C4 ("Latin capital letter A with diaeresis") instead of U+0041 U+0308 ("Latin capital letter A", "combining diaeresis"). Also avoid deprecated characters, e.g. use U+00C5 (LATIN CAPITAL LETTER A WITH RING ABOVE) instead of U+212B (ANGSTROM SIGN).
NFC is the preferred form for Linux and WWW.

Full Unicode functionality with all bells and whistles (e.g. high-quality typesetting of the Arabic and Indic scripts) can only be expected from sophisticated multi-lingual word-processing packages. What Linux supports today on a broad base is far simpler and mainly aimed at replacing the old 8- and 16-bit character sets. Linux terminal emulators and command line tools usually only support a Level 1 implementation of ISO 10646-1 (no combining characters), and only scripts such as Latin, Greek, Cyrillic, Armenian, Georgian, CJK, and many scientific symbols are supported that need no further processing support. At this level, UCS support is very comparable to ISO 8859 support and the only significant difference is that we have now thousands of different characters available, that characters can be represented by multibyte sequences, and that ideographic Chinese/Japanese/Korean characters require two terminal character positions (double-width).

Level 2 support in the form of combining characters for selected scripts (in particular Thai) and Hangul Jamo is in parts also available (i.e., some fonts, terminal emulators and editors support it via simple overstringing), but precomposed characters should be preferred over combining character sequences where available. More formally, the preferred way of encoding text in Unicode under Linux should be Normalization Form C as defined in Unicode Technical Report #15.

Combining characters: The X11 specification does not support combining characters in any way. The font information lacks the data necessary to perform high-quality automatic accent placement (as it is found, for example, in all TeX fonts). Various people have experimented with implementing simplest overstriking combining characters using zero-width characters with ink on the left side of the origin, but details of how to do this exactly are unspecified (e.g., are zero-width characters allowed in CharCell and Monospaced fonts?) and this is therefore not yet widely established practice.

Several XFree86 team members are trying to work on these issues with X.Org, which is the official successor of the X Consortium and the Opengroup as the custodian of the X11 standards and the sample implementation. But things are moving rather slowly. Support for UTF8_STRING, UCS keysyms, and ISO10646-1 extensions of the core fonts will hopefully make it into R6.7.1 in 2004. With regard to the other font related problems, the solution will probably be to dump the old server-side font mechanisms entirely and use instead XFree86's new Xft. Another work-in-progress is a new Standard Type Services (ST) framework that Sun has been working on and plans to donate to XFree86 and X.org very soon.

From Thread on Combining Characters on freetype.org mailing list

FreeType doesn't handle normalization or even glyph substitutions like ligatures directly. This must be implemented on top of it.



Created by klowery
Last modified 2004-04-01 12:32 PM
 

Powered by Plone

This site conforms to the following standards: