Now uses terminology from the Unicode standard. No more talk of
characters, for example. Normalization forms NFKD and NFD are supported for the Tibetan Unicode range. I don't like either, actually. I've tested NFKD, but I've not yet committed the tests.
This commit is contained in:
parent
3199ff7926
commit
a42347b224
7 changed files with 210 additions and 136 deletions
|
@ -23,26 +23,37 @@ package org.thdl.tib.text.tshegbar;
|
|||
*
|
||||
* <p> First, some terminology.</p>
|
||||
*
|
||||
* <ul> <li>When we talk about a <i>glyph</i>, we mean a picture
|
||||
* found in a font. A single glyph may have one or more
|
||||
* representations by sequences of Unicode characters, or it may not
|
||||
* be representable becuase it is only part of one Unicode character
|
||||
* or pictures a nonstandard character.</li> <li>When we talk about a
|
||||
* <i>stack</i>, we mean either a number (or half-number), a mark or
|
||||
* sign, a bit of punctuation, or a consonant stack.</li> <li>A
|
||||
* <i>consonant stack</i> is or one or more consonants stacked
|
||||
* vertically, plus an optional vocalic modification such as an
|
||||
* anusvara (DLC what do we call a bindu?) or visarga, plus zero or
|
||||
* more signs like <code>\u0F35</code>, plus an optional a-chung
|
||||
* (<code>\u0F71</code>), plus an optional simple vowel.</li> <li>By
|
||||
* <i>simple vowel</i>, we mean any of <code>\u0F72</code>,
|
||||
* <code>\u0F74</code>, <code>\u0F7A</code>, <code>\u0F7B</code>,
|
||||
* <ul> <li>When we talk about a <i>grapheme cluster</i> (or
|
||||
* <i>grcl</i>), we mean what the Unicode standard calls a "grapheme
|
||||
* cluster". Most glyphs (i.e., pictures) found in a font are
|
||||
* grapheme clusters, but the picture corresponding to the Unicode
|
||||
* codepoint <code>\u0F74</code> is not a grapheme cluster. In
|
||||
* addition, in English, many fonts have a single glyph (a
|
||||
* "ligature") for the combination of two grapheme clusters,
|
||||
* e.g. "fi". A single grapheme cluster may have one or more
|
||||
* representations by sequences of Unicode codepoints, or it may not
|
||||
* be representable becuase it is only part of one Unicode codepoint
|
||||
* or pictures a nonstandard character.</li> <li>We will attempt to
|
||||
* avoid using the word "character", as it sometimes refers to a
|
||||
* codepoint and sometimes refers to a glyph in a font and yet other
|
||||
* times refers to a grapheme cluster.</li> <li>We'll try to avoid
|
||||
* using the word "stack" because it sometimes refers to a sequence
|
||||
* of stacked Tibetan consonants and sometimes refers to an entire
|
||||
* grapheme cluster.</li> <li>A <i>Tibetan stack</i> is or one or
|
||||
* more consonants stacked vertically, plus an optional vocalic
|
||||
* modification such as an anusvara (DLC what do we call a bindu?) or
|
||||
* visarga, plus zero or more signs like <code>\u0F35</code>,
|
||||
* plus an optional a-chung (<code>\u0F71</code>), plus an
|
||||
* optional simple vowel.</li> <li>By <i>simple vowel</i>, we mean
|
||||
* any of <code>\u0F72</code>, <code>\u0F74</code>,
|
||||
* <code>\u0F7A</code>, <code>\u0F7B</code>,
|
||||
* <code>\u0F7C</code>, <code>\u0F7D</code>, or
|
||||
* <code>\u0F80</code>.</li> </ul>
|
||||
*
|
||||
* (Note: The string <code>"\u0F68\u0F7E\u0F7C"</code> seems to equal
|
||||
* <code>"\u0F00"</code>, though the Unicode standard does not
|
||||
* indicate that it is so. This code treats it that way.)</p>
|
||||
* <p>(Note: The string <code>"\u0F68\u0F7E\u0F7C"</code>
|
||||
* seems to equal <code>"\u0F00"</code>, though the Unicode
|
||||
* standard does not indicate that it is so. This code treats it
|
||||
* that way.)</p>
|
||||
*
|
||||
* <p> This class allows for invalid tsheg bars, like those
|
||||
* containing more than one prefix, more than two suffixes, an
|
||||
|
@ -55,10 +66,10 @@ package org.thdl.tib.text.tshegbar;
|
|||
* and for invalid tsheg bars. Note that correctness is at the tsheg
|
||||
* bar level only; it may be grammatically incorrect to concatenate
|
||||
* two valid tsheg bars. Some subclasses can be represented in
|
||||
* Unicode, but others contain nonstandard glyphs and cannot be.</p>
|
||||
* Unicode, but others contain nonstandard glyphs/characters and
|
||||
* cannot be.</p>
|
||||
*
|
||||
* @author David Chandler
|
||||
*/
|
||||
* @author David Chandler */
|
||||
public abstract class TshegBar implements UnicodeReadyThunk {
|
||||
/** Returns true, as we consider a transliteration in the Tibetan
|
||||
* alphabet of a non-Tibetan language, say Chinese, as being
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue