Jskad/source/org/thdl/tib/text/tshegbar/TshegBar.java

/*
The contents of this file are subject to the THDL Open Community License
Version 1.0 (the "License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License on the THDL web site 
(http://www.thdl.org/).

Software distributed under the License is distributed on an "AS IS" basis, 
WITHOUT WARRANTY OF ANY KIND, either express or implied. See the 
License for the specific terms governing rights and limitations under the 
License. 

The Initial Developer of this software is the Tibetan and Himalayan Digital
Library (THDL). Portions created by the THDL are Copyright 2001 THDL.
All Rights Reserved. 

Contributor(s): ______________________________________.
*/

package org.thdl.tib.text.tshegbar;

/** A TshegBar (pronounced <i>tsek bar</i>) is roughly a Tibetan
 *  syllable.  In truth, it is the stuff between two <i>tsek</i>s.
 *
 *  <p> First, some terminology.</p>
 *
 *  <ul> <li>When we talk about a <i>grapheme cluster</i> (or
 *  <i>grcl</i>), we mean what the Unicode standard calls a "grapheme
 *  cluster".  Most glyphs (i.e., pictures) found in a font are
 *  grapheme clusters, but the picture corresponding to the Unicode
 *  codepoint <code>&#92;u0F74</code> is not a grapheme cluster.  In
 *  addition, in English, many fonts have a single glyph (a
 *  "ligature") for the combination of two grapheme clusters,
 *  e.g. "fi".  A single grapheme cluster may have one or more
 *  representations by sequences of Unicode codepoints, or it may not
 *  be representable becuase it is only part of one Unicode codepoint
 *  or pictures a nonstandard character.</li> <li>We will attempt to
 *  avoid using the word "character", as it sometimes refers to a
 *  codepoint and sometimes refers to a glyph in a font and yet other
 *  times refers to a grapheme cluster.</li> <li>We'll try to avoid
 *  using the word "stack" because it sometimes refers to a sequence
 *  of stacked Tibetan consonants and sometimes refers to an entire
 *  grapheme cluster.</li> <li>A <i>Tibetan stack</i> is or one or
 *  more consonants stacked vertically, plus an optional vocalic
 *  modification such as an anusvara (DLC what do we call a bindu?) or
 *  visarga, plus zero or more signs like <code>&#92;u0F35</code>,
 *  plus an optional a-chung (<code>&#92;u0F71</code>), plus an
 *  optional simple vowel.</li> <li>By <i>simple vowel</i>, we mean
 *  any of <code>&#92;u0F72</code>, <code>&#92;u0F74</code>,
 *  <code>&#92;u0F7A</code>, <code>&#92;u0F7B</code>,
 *  <code>&#92;u0F7C</code>, <code>&#92;u0F7D</code>, or
 *  <code>&#92;u0F80</code>.</li> </ul>
 *
 *  <p>(Note: The string <code>"&#92;u0F68&#92;u0F7E&#92;u0F7C"</code>
 *  seems to equal <code>"&#92;u0F00"</code>, though the Unicode
 *  standard does not indicate that it is so.  This code treats it
 *  that way.)</p>
 *
 *  <p> This class allows for invalid tsheg bars, like those
 *  containing more than one prefix, more than two suffixes, an
 *  invalid postsuffix (secondary suffix), more than one consonant
 *  stack (excluding the special case of what we call in THDL Extended
 *  Wylie "'i", which is technically a consonant stack but is used in
 *  Tibetan like a suffix).</p>.
 *
 *  <p>Subclasses exist for valid, grammatically correct tsheg bars,
 *  and for invalid tsheg bars.  Note that correctness is at the tsheg
 *  bar level only; it may be grammatically incorrect to concatenate
 *  two valid tsheg bars.  Some subclasses can be represented in
 *  Unicode, but others contain nonstandard glyphs/characters and
 *  cannot be.</p>
 *
 *  @author David Chandler */
public abstract class TshegBar implements UnicodeReadyThunk {
    /** Returns true, as we consider a transliteration in the Tibetan
     *  alphabet of a non-Tibetan language, say Chinese, as being
     *  Tibetan.
     *  @return true */
    public boolean isTibetan() { return true; }
}
This commit is for my benefit only; these classes are not ready for prime time, and the build system is not yet aware of them. I'm adding some classes for representing legal tsheg-bars (syllables, for the most part) in Unicode. These classes were designed bottom-up (OK, OK -- they weren't designed designed, but I had to write down everything I knew about Tibetan syntax somewhere). The classes are aware of extended wylie. I doubt the Javadocs work yet, and I'm still testing (and am not committing my testing code with these as it is not yet ready). Next on my list--fix these up to reflect my new awareness of suffix particles (like le'u'i'o) add classes to support syntactically incorrect Unicode sequences. Then add a UnicodeReader, and we've got the back end of a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's Worldscript or FreeType Layout or Omega's OTPs). A top-down design would not have included LegalTshegBar. But now that my itch has been scratched, potential uses are lingering about. For example, it would be nice to scan some input and break it into LegalTshegBars, punctuation/marks/signs, and illegal stacks. Then we could alert the client of the illegality, its precise form, and its precise location. The real system for turning a Unicode stream into an internal representation suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be aware of Tibetan syntax. But to make the very best conversion from Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better represented as gskad, but that jaskad is not the same as jskad. 2002-12-09 01:02:23 +00:00			`/*`
			`The contents of this file are subject to the THDL Open Community License`
			`Version 1.0 (the "License"); you may not use this file except in compliance`
			`with the License. You may obtain a copy of the License on the THDL web site`
			`(http://www.thdl.org/).`

			`Software distributed under the License is distributed on an "AS IS" basis,`
			`WITHOUT WARRANTY OF ANY KIND, either express or implied. See the`
			`License for the specific terms governing rights and limitations under the`
			`License.`

			`The Initial Developer of this software is the Tibetan and Himalayan Digital`
			`Library (THDL). Portions created by the THDL are Copyright 2001 THDL.`
			`All Rights Reserved.`

			`Contributor(s): ______________________________________.`
			`*/`

			`package org.thdl.tib.text.tshegbar;`

			`/** A TshegBar (pronounced <i>tsek bar</i>) is roughly a Tibetan`
			`* syllable. In truth, it is the stuff between two <i>tsek</i>s.`
			`*`
			`* <p> First, some terminology.</p>`
			`*`
Now uses terminology from the Unicode standard. No more talk of characters, for example. Normalization forms NFKD and NFD are supported for the Tibetan Unicode range. I don't like either, actually. I've tested NFKD, but I've not yet committed the tests. 2002-12-15 03:35:24 +00:00			`* <ul> <li>When we talk about a <i>grapheme cluster</i> (or`
			`* <i>grcl</i>), we mean what the Unicode standard calls a "grapheme`
			`* cluster". Most glyphs (i.e., pictures) found in a font are`
			`* grapheme clusters, but the picture corresponding to the Unicode`
			`* codepoint <code>\u0F74</code> is not a grapheme cluster. In`
			`* addition, in English, many fonts have a single glyph (a`
			`* "ligature") for the combination of two grapheme clusters,`
			`* e.g. "fi". A single grapheme cluster may have one or more`
			`* representations by sequences of Unicode codepoints, or it may not`
			`* be representable becuase it is only part of one Unicode codepoint`
			`* or pictures a nonstandard character.</li> <li>We will attempt to`
			`* avoid using the word "character", as it sometimes refers to a`
			`* codepoint and sometimes refers to a glyph in a font and yet other`
			`* times refers to a grapheme cluster.</li> <li>We'll try to avoid`
			`* using the word "stack" because it sometimes refers to a sequence`
			`* of stacked Tibetan consonants and sometimes refers to an entire`
			`* grapheme cluster.</li> <li>A <i>Tibetan stack</i> is or one or`
			`* more consonants stacked vertically, plus an optional vocalic`
			`* modification such as an anusvara (DLC what do we call a bindu?) or`
			`* visarga, plus zero or more signs like <code>\u0F35</code>,`
			`* plus an optional a-chung (<code>\u0F71</code>), plus an`
			`* optional simple vowel.</li> <li>By <i>simple vowel</i>, we mean`
			`* any of <code>\u0F72</code>, <code>\u0F74</code>,`
			`* <code>\u0F7A</code>, <code>\u0F7B</code>,`
So that Unicode escape sequences appear correctly in javadocs. 2002-12-09 02:35:39 +00:00			`* <code>\u0F7C</code>, <code>\u0F7D</code>, or`
			`* <code>\u0F80</code>.</li> </ul>`
This commit is for my benefit only; these classes are not ready for prime time, and the build system is not yet aware of them. I'm adding some classes for representing legal tsheg-bars (syllables, for the most part) in Unicode. These classes were designed bottom-up (OK, OK -- they weren't designed designed, but I had to write down everything I knew about Tibetan syntax somewhere). The classes are aware of extended wylie. I doubt the Javadocs work yet, and I'm still testing (and am not committing my testing code with these as it is not yet ready). Next on my list--fix these up to reflect my new awareness of suffix particles (like le'u'i'o) add classes to support syntactically incorrect Unicode sequences. Then add a UnicodeReader, and we've got the back end of a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's Worldscript or FreeType Layout or Omega's OTPs). A top-down design would not have included LegalTshegBar. But now that my itch has been scratched, potential uses are lingering about. For example, it would be nice to scan some input and break it into LegalTshegBars, punctuation/marks/signs, and illegal stacks. Then we could alert the client of the illegality, its precise form, and its precise location. The real system for turning a Unicode stream into an internal representation suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be aware of Tibetan syntax. But to make the very best conversion from Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better represented as gskad, but that jaskad is not the same as jskad. 2002-12-09 01:02:23 +00:00			`*`
Now uses terminology from the Unicode standard. No more talk of characters, for example. Normalization forms NFKD and NFD are supported for the Tibetan Unicode range. I don't like either, actually. I've tested NFKD, but I've not yet committed the tests. 2002-12-15 03:35:24 +00:00			`* <p>(Note: The string <code>"\u0F68\u0F7E\u0F7C"</code>`
			`* seems to equal <code>"\u0F00"</code>, though the Unicode`
			`* standard does not indicate that it is so. This code treats it`
			`* that way.)</p>`
This commit is for my benefit only; these classes are not ready for prime time, and the build system is not yet aware of them. I'm adding some classes for representing legal tsheg-bars (syllables, for the most part) in Unicode. These classes were designed bottom-up (OK, OK -- they weren't designed designed, but I had to write down everything I knew about Tibetan syntax somewhere). The classes are aware of extended wylie. I doubt the Javadocs work yet, and I'm still testing (and am not committing my testing code with these as it is not yet ready). Next on my list--fix these up to reflect my new awareness of suffix particles (like le'u'i'o) add classes to support syntactically incorrect Unicode sequences. Then add a UnicodeReader, and we've got the back end of a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's Worldscript or FreeType Layout or Omega's OTPs). A top-down design would not have included LegalTshegBar. But now that my itch has been scratched, potential uses are lingering about. For example, it would be nice to scan some input and break it into LegalTshegBars, punctuation/marks/signs, and illegal stacks. Then we could alert the client of the illegality, its precise form, and its precise location. The real system for turning a Unicode stream into an internal representation suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be aware of Tibetan syntax. But to make the very best conversion from Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better represented as gskad, but that jaskad is not the same as jskad. 2002-12-09 01:02:23 +00:00			`*`
			`* <p> This class allows for invalid tsheg bars, like those`
			`* containing more than one prefix, more than two suffixes, an`
			`* invalid postsuffix (secondary suffix), more than one consonant`
Extended Wylie is referred to as THDL Extended Wylie or THDL Wylie because a Japanese scholar has an "Extended Wylie" also. NFKD and NFD have a new brother, NFTHDL. I wish there weren't a need, but as my yet-to-be-put-into-CVS break-unicode-into-grapheme-clusters code demonstrates, the-need-is-there. forgive-me for the hyphens, it's late. 2002-12-15 06:57:32 +00:00			`* stack (excluding the special case of what we call in THDL Extended`
This commit is for my benefit only; these classes are not ready for prime time, and the build system is not yet aware of them. I'm adding some classes for representing legal tsheg-bars (syllables, for the most part) in Unicode. These classes were designed bottom-up (OK, OK -- they weren't designed designed, but I had to write down everything I knew about Tibetan syntax somewhere). The classes are aware of extended wylie. I doubt the Javadocs work yet, and I'm still testing (and am not committing my testing code with these as it is not yet ready). Next on my list--fix these up to reflect my new awareness of suffix particles (like le'u'i'o) add classes to support syntactically incorrect Unicode sequences. Then add a UnicodeReader, and we've got the back end of a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's Worldscript or FreeType Layout or Omega's OTPs). A top-down design would not have included LegalTshegBar. But now that my itch has been scratched, potential uses are lingering about. For example, it would be nice to scan some input and break it into LegalTshegBars, punctuation/marks/signs, and illegal stacks. Then we could alert the client of the illegality, its precise form, and its precise location. The real system for turning a Unicode stream into an internal representation suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be aware of Tibetan syntax. But to make the very best conversion from Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better represented as gskad, but that jaskad is not the same as jskad. 2002-12-09 01:02:23 +00:00			`* Wylie "'i", which is technically a consonant stack but is used in`
			`* Tibetan like a suffix).</p>.`
			`*`
			`* <p>Subclasses exist for valid, grammatically correct tsheg bars,`
			`* and for invalid tsheg bars. Note that correctness is at the tsheg`
			`* bar level only; it may be grammatically incorrect to concatenate`
			`* two valid tsheg bars. Some subclasses can be represented in`
Now uses terminology from the Unicode standard. No more talk of characters, for example. Normalization forms NFKD and NFD are supported for the Tibetan Unicode range. I don't like either, actually. I've tested NFKD, but I've not yet committed the tests. 2002-12-15 03:35:24 +00:00			`* Unicode, but others contain nonstandard glyphs/characters and`
			`* cannot be.</p>`
This commit is for my benefit only; these classes are not ready for prime time, and the build system is not yet aware of them. I'm adding some classes for representing legal tsheg-bars (syllables, for the most part) in Unicode. These classes were designed bottom-up (OK, OK -- they weren't designed designed, but I had to write down everything I knew about Tibetan syntax somewhere). The classes are aware of extended wylie. I doubt the Javadocs work yet, and I'm still testing (and am not committing my testing code with these as it is not yet ready). Next on my list--fix these up to reflect my new awareness of suffix particles (like le'u'i'o) add classes to support syntactically incorrect Unicode sequences. Then add a UnicodeReader, and we've got the back end of a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's Worldscript or FreeType Layout or Omega's OTPs). A top-down design would not have included LegalTshegBar. But now that my itch has been scratched, potential uses are lingering about. For example, it would be nice to scan some input and break it into LegalTshegBars, punctuation/marks/signs, and illegal stacks. Then we could alert the client of the illegality, its precise form, and its precise location. The real system for turning a Unicode stream into an internal representation suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be aware of Tibetan syntax. But to make the very best conversion from Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better represented as gskad, but that jaskad is not the same as jskad. 2002-12-09 01:02:23 +00:00			`*`
Now uses terminology from the Unicode standard. No more talk of characters, for example. Normalization forms NFKD and NFD are supported for the Tibetan Unicode range. I don't like either, actually. I've tested NFKD, but I've not yet committed the tests. 2002-12-15 03:35:24 +00:00			`* @author David Chandler */`
This commit is for my benefit only; these classes are not ready for prime time, and the build system is not yet aware of them. I'm adding some classes for representing legal tsheg-bars (syllables, for the most part) in Unicode. These classes were designed bottom-up (OK, OK -- they weren't designed designed, but I had to write down everything I knew about Tibetan syntax somewhere). The classes are aware of extended wylie. I doubt the Javadocs work yet, and I'm still testing (and am not committing my testing code with these as it is not yet ready). Next on my list--fix these up to reflect my new awareness of suffix particles (like le'u'i'o) add classes to support syntactically incorrect Unicode sequences. Then add a UnicodeReader, and we've got the back end of a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's Worldscript or FreeType Layout or Omega's OTPs). A top-down design would not have included LegalTshegBar. But now that my itch has been scratched, potential uses are lingering about. For example, it would be nice to scan some input and break it into LegalTshegBars, punctuation/marks/signs, and illegal stacks. Then we could alert the client of the illegality, its precise form, and its precise location. The real system for turning a Unicode stream into an internal representation suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be aware of Tibetan syntax. But to make the very best conversion from Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better represented as gskad, but that jaskad is not the same as jskad. 2002-12-09 01:02:23 +00:00			`public abstract class TshegBar implements UnicodeReadyThunk {`
			`/** Returns true, as we consider a transliteration in the Tibetan`
			`* alphabet of a non-Tibetan language, say Chinese, as being`
			`* Tibetan.`
			`* @return true */`
			`public boolean isTibetan() { return true; }`
			`}`