Now uses terminology from the Unicode standard. No more talk of

characters, for example. Normalization forms NFKD and NFD are supported for the Tibetan Unicode range. I don't like either, actually. I've tested NFKD, but I've not yet committed the tests.
2002-12-15 03:35:24 +00:00 · 2002-12-15 03:35:24 +00:00 · a42347b224
commit a42347b224
parent 3199ff7926
7 changed files with 210 additions and 136 deletions
--- a/source/org/thdl/tib/text/tshegbar/TshegBar.java
+++ b/source/org/thdl/tib/text/tshegbar/TshegBar.java
@ -23,26 +23,37 @@ package org.thdl.tib.text.tshegbar;
 *
 *  <p> First, some terminology.</p>
 *
- *  <ul> <li>When we talk about a <i>glyph</i>, we mean a picture
- *  found in a font.  A single glyph may have one or more
- *  representations by sequences of Unicode characters, or it may not
- *  be representable becuase it is only part of one Unicode character
- *  or pictures a nonstandard character.</li> <li>When we talk about a
- *  <i>stack</i>, we mean either a number (or half-number), a mark or
- *  sign, a bit of punctuation, or a consonant stack.</li> <li>A
- *  <i>consonant stack</i> is or one or more consonants stacked
- *  vertically, plus an optional vocalic modification such as an
- *  anusvara (DLC what do we call a bindu?) or visarga, plus zero or
- *  more signs like <code>&#92;u0F35</code>, plus an optional a-chung
- *  (<code>&#92;u0F71</code>), plus an optional simple vowel.</li> <li>By
- *  <i>simple vowel</i>, we mean any of <code>&#92;u0F72</code>,
- *  <code>&#92;u0F74</code>, <code>&#92;u0F7A</code>, <code>&#92;u0F7B</code>,
+ *  <ul> <li>When we talk about a <i>grapheme cluster</i> (or
+ *  <i>grcl</i>), we mean what the Unicode standard calls a "grapheme
+ *  cluster".  Most glyphs (i.e., pictures) found in a font are
+ *  grapheme clusters, but the picture corresponding to the Unicode
+ *  codepoint <code>&#92;u0F74</code> is not a grapheme cluster.  In
+ *  addition, in English, many fonts have a single glyph (a
+ *  "ligature") for the combination of two grapheme clusters,
+ *  e.g. "fi".  A single grapheme cluster may have one or more
+ *  representations by sequences of Unicode codepoints, or it may not
+ *  be representable becuase it is only part of one Unicode codepoint
+ *  or pictures a nonstandard character.</li> <li>We will attempt to
+ *  avoid using the word "character", as it sometimes refers to a
+ *  codepoint and sometimes refers to a glyph in a font and yet other
+ *  times refers to a grapheme cluster.</li> <li>We'll try to avoid
+ *  using the word "stack" because it sometimes refers to a sequence
+ *  of stacked Tibetan consonants and sometimes refers to an entire
+ *  grapheme cluster.</li> <li>A <i>Tibetan stack</i> is or one or
+ *  more consonants stacked vertically, plus an optional vocalic
+ *  modification such as an anusvara (DLC what do we call a bindu?) or
+ *  visarga, plus zero or more signs like <code>&#92;u0F35</code>,
+ *  plus an optional a-chung (<code>&#92;u0F71</code>), plus an
+ *  optional simple vowel.</li> <li>By <i>simple vowel</i>, we mean
+ *  any of <code>&#92;u0F72</code>, <code>&#92;u0F74</code>,
+ *  <code>&#92;u0F7A</code>, <code>&#92;u0F7B</code>,
 *  <code>&#92;u0F7C</code>, <code>&#92;u0F7D</code>, or
 *  <code>&#92;u0F80</code>.</li> </ul>
 *
- *  (Note: The string <code>"&#92;u0F68&#92;u0F7E&#92;u0F7C"</code> seems to equal
- *  <code>"&#92;u0F00"</code>, though the Unicode standard does not
- *  indicate that it is so.  This code treats it that way.)</p>
+ *  <p>(Note: The string <code>"&#92;u0F68&#92;u0F7E&#92;u0F7C"</code>
+ *  seems to equal <code>"&#92;u0F00"</code>, though the Unicode
+ *  standard does not indicate that it is so.  This code treats it
+ *  that way.)</p>
 *
 *  <p> This class allows for invalid tsheg bars, like those
 *  containing more than one prefix, more than two suffixes, an
@ -55,10 +66,10 @@ package org.thdl.tib.text.tshegbar;
 *  and for invalid tsheg bars.  Note that correctness is at the tsheg
 *  bar level only; it may be grammatically incorrect to concatenate
 *  two valid tsheg bars.  Some subclasses can be represented in
- *  Unicode, but others contain nonstandard glyphs and cannot be.</p>
+ *  Unicode, but others contain nonstandard glyphs/characters and
+ *  cannot be.</p>
 *
- *  @author David Chandler
- */
+ *  @author David Chandler */
 public abstract class TshegBar implements UnicodeReadyThunk {
    /** Returns true, as we consider a transliteration in the Tibetan
     *  alphabet of a non-Tibetan language, say Chinese, as being