Status of Hebrew Fonts Under Linux
Status of Hebrew Fonts Under Linux
As of April 2, 2004
Kirk E. Lowery
Westminster Hebrew Institute
Current State of Affairs
- A Web search came up with a number of issues for using the SBLHebrew font under Linux:
- The central issue is about "combining characters": the diacritics are not properly rendered
- Linux does not support (from the kernel) "Level 2" or "Level 3" implementation of ISO10646-1, which includes combining characters
- X-Windows also does not support Level 2; FreeType does support OpenType fonts, but not combining characters. This must be handled by the software package, e.g., word processor, browser, editor!
- The issue may eventually be resolved by X-Windows, but not any time soon
If SBLHebrew is going to be used under Linux, programs such as OpenOffice, Emacs and FireFox (Mozilla) are going to have to provide the support for combining characters. And this is not likely.
Strategy for the Westminster Hebrew Institute
For now, it is clear that we cannot move to UTF-8 for maintenance, archiving and development of our Hebrew Bible morphology database. Instead, we will need to provide a software infrastructure to convert from Michigan encoding to UTF-8, such that we can provide UTF-8 Hebrew text on an ad hoc basis.
My recommendation is:
continue to develop and maintain Morph using our current build toolchain
move to an XML format that uses Michigan encoding for Hebrew text
re-implement my perl script (
Michigan2UTF8) in python, which has excellent UTF-8 supportwrite small and efficient functions to be incorporated in our Zope applications
when necessary, extract the Hebrew text from the database to create any arbitrary span of the Hebrew text itself, including the entire text at once.
wait until the technology catches up: we are looking at no less than 5 years
From UTF-8 and Unicode FAQ for Unix/Linux:
Not all systems can be expected to support all the advanced mechanisms of UCS, such as combining characters. Therefore, ISO 10646 specifies the following three implementation levels:
- Level 1
-
Combining characters and Hangul Jamo characters
are not supported.
[Hangul Jamo are an alternative representation of precomposed modern Hangul syllables as a sequence of consonants and vowels. They are required to fully support the Korean script including Middle Korean.] - Level 2
- Like level 1, however in some scripts, a fixed list of combining characters is now allowed (e.g., for Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugo, Kannada, Malayalam, Thai and Lao). These scripts cannot be represented adequately in UCS without support for at least certain combining characters.
- Level 3
- All UCS characters are supported, such that, for example, mathematicians can place a tilde or an arrow (or both) on any character.
…
Normalization Form C (NFC): Use precomposed
characters instead of combining sequences where possible, e.g. use
U+00C4 ("Latin capital letter A with diaeresis") instead of
U+0041 U+0308 ("Latin capital letter A", "combining
diaeresis"). Also avoid deprecated characters, e.g. use U+00C5
(LATIN CAPITAL LETTER A WITH RING ABOVE) instead of U+212B (ANGSTROM
SIGN).
NFC is the preferred form for Linux and WWW.
…
Full Unicode functionality with all bells and whistles (e.g. high-quality typesetting of the Arabic and Indic scripts) can only be expected from sophisticated multi-lingual word-processing packages. What Linux supports today on a broad base is far simpler and mainly aimed at replacing the old 8- and 16-bit character sets. Linux terminal emulators and command line tools usually only support a Level 1 implementation of ISO 10646-1 (no combining characters), and only scripts such as Latin, Greek, Cyrillic, Armenian, Georgian, CJK, and many scientific symbols are supported that need no further processing support. At this level, UCS support is very comparable to ISO 8859 support and the only significant difference is that we have now thousands of different characters available, that characters can be represented by multibyte sequences, and that ideographic Chinese/Japanese/Korean characters require two terminal character positions (double-width).
Level 2 support in the form of combining characters for selected scripts (in particular Thai) and Hangul Jamo is in parts also available (i.e., some fonts, terminal emulators and editors support it via simple overstringing), but precomposed characters should be preferred over combining character sequences where available. More formally, the preferred way of encoding text in Unicode under Linux should be Normalization Form C as defined in Unicode Technical Report #15.
…
Combining characters: The X11 specification does not support combining characters in any way. The font information lacks the data necessary to perform high-quality automatic accent placement (as it is found, for example, in all TeX fonts). Various people have experimented with implementing simplest overstriking combining characters using zero-width characters with ink on the left side of the origin, but details of how to do this exactly are unspecified (e.g., are zero-width characters allowed in CharCell and Monospaced fonts?) and this is therefore not yet widely established practice.
…
Several XFree86 team members are trying to work on these issues with X.Org, which is the official successor of the X Consortium and the Opengroup as the custodian of the X11 standards and the sample implementation. But things are moving rather slowly. Support for UTF8_STRING, UCS keysyms, and ISO10646-1 extensions of the core fonts will hopefully make it into R6.7.1 in 2004. With regard to the other font related problems, the solution will probably be to dump the old server-side font mechanisms entirely and use instead XFree86's new Xft. Another work-in-progress is a new Standard Type Services (ST) framework that Sun has been working on and plans to donate to XFree86 and X.org very soon.
…
From Thread on Combining Characters on freetype.org mailing list
FreeType doesn't handle normalization or even glyph substitutions like ligatures directly. This must be implemented on top of it.