Now uses terminology from the Unicode standard. No more talk of

characters, for example.

Normalization forms NFKD and NFD are supported for the Tibetan Unicode
range.  I don't like either, actually.  I've tested NFKD, but I've not yet
committed the tests.
This commit is contained in:
dchandler 2002-12-15 03:35:24 +00:00
parent 3199ff7926
commit a42347b224
7 changed files with 210 additions and 136 deletions

View file

@ -23,26 +23,37 @@ package org.thdl.tib.text.tshegbar;
*
* <p> First, some terminology.</p>
*
* <ul> <li>When we talk about a <i>glyph</i>, we mean a picture
* found in a font. A single glyph may have one or more
* representations by sequences of Unicode characters, or it may not
* be representable becuase it is only part of one Unicode character
* or pictures a nonstandard character.</li> <li>When we talk about a
* <i>stack</i>, we mean either a number (or half-number), a mark or
* sign, a bit of punctuation, or a consonant stack.</li> <li>A
* <i>consonant stack</i> is or one or more consonants stacked
* vertically, plus an optional vocalic modification such as an
* anusvara (DLC what do we call a bindu?) or visarga, plus zero or
* more signs like <code>&#92;u0F35</code>, plus an optional a-chung
* (<code>&#92;u0F71</code>), plus an optional simple vowel.</li> <li>By
* <i>simple vowel</i>, we mean any of <code>&#92;u0F72</code>,
* <code>&#92;u0F74</code>, <code>&#92;u0F7A</code>, <code>&#92;u0F7B</code>,
* <ul> <li>When we talk about a <i>grapheme cluster</i> (or
* <i>grcl</i>), we mean what the Unicode standard calls a "grapheme
* cluster". Most glyphs (i.e., pictures) found in a font are
* grapheme clusters, but the picture corresponding to the Unicode
* codepoint <code>&#92;u0F74</code> is not a grapheme cluster. In
* addition, in English, many fonts have a single glyph (a
* "ligature") for the combination of two grapheme clusters,
* e.g. "fi". A single grapheme cluster may have one or more
* representations by sequences of Unicode codepoints, or it may not
* be representable becuase it is only part of one Unicode codepoint
* or pictures a nonstandard character.</li> <li>We will attempt to
* avoid using the word "character", as it sometimes refers to a
* codepoint and sometimes refers to a glyph in a font and yet other
* times refers to a grapheme cluster.</li> <li>We'll try to avoid
* using the word "stack" because it sometimes refers to a sequence
* of stacked Tibetan consonants and sometimes refers to an entire
* grapheme cluster.</li> <li>A <i>Tibetan stack</i> is or one or
* more consonants stacked vertically, plus an optional vocalic
* modification such as an anusvara (DLC what do we call a bindu?) or
* visarga, plus zero or more signs like <code>&#92;u0F35</code>,
* plus an optional a-chung (<code>&#92;u0F71</code>), plus an
* optional simple vowel.</li> <li>By <i>simple vowel</i>, we mean
* any of <code>&#92;u0F72</code>, <code>&#92;u0F74</code>,
* <code>&#92;u0F7A</code>, <code>&#92;u0F7B</code>,
* <code>&#92;u0F7C</code>, <code>&#92;u0F7D</code>, or
* <code>&#92;u0F80</code>.</li> </ul>
*
* (Note: The string <code>"&#92;u0F68&#92;u0F7E&#92;u0F7C"</code> seems to equal
* <code>"&#92;u0F00"</code>, though the Unicode standard does not
* indicate that it is so. This code treats it that way.)</p>
* <p>(Note: The string <code>"&#92;u0F68&#92;u0F7E&#92;u0F7C"</code>
* seems to equal <code>"&#92;u0F00"</code>, though the Unicode
* standard does not indicate that it is so. This code treats it
* that way.)</p>
*
* <p> This class allows for invalid tsheg bars, like those
* containing more than one prefix, more than two suffixes, an
@ -55,10 +66,10 @@ package org.thdl.tib.text.tshegbar;
* and for invalid tsheg bars. Note that correctness is at the tsheg
* bar level only; it may be grammatically incorrect to concatenate
* two valid tsheg bars. Some subclasses can be represented in
* Unicode, but others contain nonstandard glyphs and cannot be.</p>
* Unicode, but others contain nonstandard glyphs/characters and
* cannot be.</p>
*
* @author David Chandler
*/
* @author David Chandler */
public abstract class TshegBar implements UnicodeReadyThunk {
/** Returns true, as we consider a transliteration in the Tibetan
* alphabet of a non-Tibetan language, say Chinese, as being