This commit is for my benefit only; these classes are not ready for prime time,

and the build system is not yet aware of them.

I'm adding some classes for representing legal tsheg-bars (syllables, for the
most part) in Unicode.  These classes were designed bottom-up (OK, OK --
they weren't designed designed, but I had to write down everything I knew
about Tibetan syntax somewhere).  The classes are aware of extended
wylie.  I doubt the Javadocs work yet, and I'm still testing (and am not
committing my testing code with these as it is not yet ready).

Next on my list--fix these up to reflect my new awareness of suffix particles
(like le'u'i'o) add classes to support syntactically incorrect Unicode
sequences.  Then add a UnicodeReader, and we've got the back end of
a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's
Worldscript or FreeType Layout or Omega's OTPs).

A top-down design would not have included LegalTshegBar.  But now that
my itch has been scratched, potential uses are lingering about.  For example,
it would be nice to scan some input and break it into LegalTshegBars,
punctuation/marks/signs, and illegal stacks.  Then we could alert the client
of the illegality, its precise form, and its precise location.

The real system for turning a Unicode stream into an internal representation
suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be
aware of Tibetan syntax.  But to make the very best conversion from
Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better
represented as gskad, but that jaskad is not the same as jskad.
This commit is contained in:
dchandler 2002-12-09 01:02:23 +00:00
parent 03688b6137
commit f4a16f8e9d
7 changed files with 1837 additions and 0 deletions

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,68 @@
/*
The contents of this file are subject to the THDL Open Community License
Version 1.0 (the "License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License on the THDL web site
(http://www.thdl.org/).
Software distributed under the License is distributed on an "AS IS" basis,
WITHOUT WARRANTY OF ANY KIND, either express or implied. See the
License for the specific terms governing rights and limitations under the
License.
The Initial Developer of this software is the Tibetan and Himalayan Digital
Library (THDL). Portions created by the THDL are Copyright 2001 THDL.
All Rights Reserved.
Contributor(s): ______________________________________.
*/
package org.thdl.tib.text.tshegbar;
/** A TshegBar (pronounced <i>tsek bar</i>) is roughly a Tibetan
* syllable. In truth, it is the stuff between two <i>tsek</i>s.
*
* <p> First, some terminology.</p>
*
* <ul> <li>When we talk about a <i>glyph</i>, we mean a picture
* found in a font. A single glyph may have one or more
* representations by sequences of Unicode characters, or it may not
* be representable becuase it is only part of one Unicode character
* or pictures a nonstandard character.</li> <li>When we talk about a
* <i>stack</i>, we mean either a number (or half-number), a mark or
* sign, a bit of punctuation, or a consonant stack.</li> <li>A
* <i>consonant stack</i> is or one or more consonants stacked
* vertically, plus an optional vocalic modification such as an
* anusvara (DLC what do we call a bindu?) or visarga, plus zero or
* more signs like <code>\u0F35</code>, plus an optional a-chung
* (<code>\u0F71</code>), plus an optional simple vowel.</li> <li>By
* <i>simple vowel</i>, we mean any of <code>\u0F72</code>,
* <code>\u0F74</code>, <code>\u0F7A</code>, <code>\u0F7B</code>,
* <code>\u0F7C</code>, <code>\u0F7D</code>, or
* <code>\u0F80</code>.</li> </ul>
*
* (Note: The string <code>"\u0F68\u0F7E\u0F7C"</code> seems to equal
* <code>"\u0F00"</code>, though the Unicode standard does not
* indicate that it is so. This code treats it that way.)</p>
*
* <p> This class allows for invalid tsheg bars, like those
* containing more than one prefix, more than two suffixes, an
* invalid postsuffix (secondary suffix), more than one consonant
* stack (excluding the special case of what we call in Extended
* Wylie "'i", which is technically a consonant stack but is used in
* Tibetan like a suffix).</p>.
*
* <p>Subclasses exist for valid, grammatically correct tsheg bars,
* and for invalid tsheg bars. Note that correctness is at the tsheg
* bar level only; it may be grammatically incorrect to concatenate
* two valid tsheg bars. Some subclasses can be represented in
* Unicode, but others contain nonstandard glyphs and cannot be.</p>
*
* @author David Chandler
*/
public abstract class TshegBar implements UnicodeReadyThunk {
/** Returns true, as we consider a transliteration in the Tibetan
* alphabet of a non-Tibetan language, say Chinese, as being
* Tibetan.
* @return true */
public boolean isTibetan() { return true; }
}

View file

@ -0,0 +1,317 @@
/*
The contents of this file are subject to the THDL Open Community License
Version 1.0 (the "License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License on the THDL web site
(http://www.thdl.org/).
Software distributed under the License is distributed on an "AS IS" basis,
WITHOUT WARRANTY OF ANY KIND, either express or implied. See the
License for the specific terms governing rights and limitations under the
License.
The Initial Developer of this software is the Tibetan and Himalayan Digital
Library (THDL). Portions created by the THDL are Copyright 2001 THDL.
All Rights Reserved.
Contributor(s): ______________________________________.
*/
package org.thdl.tib.text.tshegbar;
import org.thdl.tib.text.TibetanMachineWeb;
/** This noninstantiable class allows for converting from Unicode
* characters (i.e., code points) to Extended Wylie. It cannot be
* used for long stretches of text, though, as it is unaware of
* context, which is essential to understanding a non-trivial string
* of Tibetan Unicode.
*
* <p>See the document by Nathaniel Garson and David Germano entitled
* <i>Extended Wylie Transliteration Scheme</i>. Note that there are
* a couple of issues with the November 18, 2001 revision of that
* document; these issues are in the Bugs tracker at {@see
* http://sourceforge.net/projects/thdltools}.</p>
*
* @author David Chandler */
public class UnicodeCharToExtendedWylie {
/** Returns the extended Wylie for the very simple sequence x.
* Returns null iff some (Unicode) char in s has no extended
* Wylie representation. This is unaware of context, so use it
* sparingly. */
public static StringBuffer getExtendedWylieForUnicodeString(String x) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i < x.length(); i++) {
String ew = getExtendedWylieForUnicodeChar(x.charAt(i));
if (null == ew)
return null;
sb.append(ew);
}
return sb;
}
/** Returns the extended Wylie for x, or null if there is none.
* Understand that multiple Unicode code points (chars) map to
* the same Extended Wylie representation. Understand also that
* the scrap of Extended Wylie returned is only valid in certain
* contexts. For example, not all consonants take ra-btags. DLC NOW what about canonicalization? */
public static String getExtendedWylieForUnicodeChar(char x) {
switch (x) {
case '\u0F00': return "oM";
case '\u0F01': return null;
case '\u0F02': return null;
case '\u0F03': return null;
case '\u0F04': return "@";
case '\u0F05': return "#";
case '\u0F06': return "$";
case '\u0F07': return "%";
case '\u0F08': return "!";
case '\u0F09': return null;
case '\u0F0A': return null;
case '\u0F0B': return " ";
case '\u0F0C': return "*"; // DLC NOW: Jskad does not support this!
case '\u0F0D': return "/";
case '\u0F0E': return "//"; // DLC FIXME: this is kind of a hack-- the Unicode standard says the spacing for this construct is different than the spacing for "\u0F0D\u0F0D"
case '\u0F0F': return ";";
case '\u0F10': return "[";
case '\u0F11': return "|";
case '\u0F12': return "]";
case '\u0F13': return "`";
case '\u0F14': return ":";
case '\u0F15': return null;
case '\u0F16': return null;
case '\u0F17': return null;
case '\u0F18': return null;
case '\u0F19': return null;
case '\u0F1A': return null;
case '\u0F1B': return null;
case '\u0F1C': return null;
case '\u0F1D': return null;
case '\u0F1E': return null;
case '\u0F1F': return null;
case '\u0F20': return "0";
case '\u0F21': return "1";
case '\u0F22': return "2";
case '\u0F23': return "3";
case '\u0F24': return "4";
case '\u0F25': return "5";
case '\u0F26': return "6";
case '\u0F27': return "7";
case '\u0F28': return "8";
case '\u0F29': return "9";
case '\u0F2A': return null;
case '\u0F2B': return null;
case '\u0F2C': return null;
case '\u0F2D': return null;
case '\u0F2E': return null;
case '\u0F2F': return null;
case '\u0F30': return null;
case '\u0F31': return null;
case '\u0F32': return null;
case '\u0F33': return null;
case '\u0F34': return "=";
case '\u0F35': return null;
case '\u0F36': return null;
case '\u0F37': return null;
case '\u0F38': return null;
case '\u0F39': return null;
case '\u0F3A': return "<";
case '\u0F3B': return ">";
case '\u0F3C': return "(";
case '\u0F3D': return ")";
case '\u0F3E': return "{";
case '\u0F3F': return "}";
case '\u0F40': return "k";
case '\u0F41': return "kh";
case '\u0F42': return "g";
case '\u0F43': return (getExtendedWylieForUnicodeChar('\u0F42')
+ TibetanMachineWeb.WYLIE_SANSKRIT_STACKING_KEY // DLC FIXME: is this right?
+ getExtendedWylieForUnicodeChar('\u0FB7'));
case '\u0F44': return "ng";
case '\u0F45': return "c";
case '\u0F46': return "ch";
case '\u0F47': return "j";
case '\u0F48': return null;
case '\u0F49': return "ny";
case '\u0F4A': return "T";
case '\u0F4B': return "Th";
case '\u0F4C': return "D";
case '\u0F4D': return (getExtendedWylieForUnicodeChar('\u0F4C')
+ TibetanMachineWeb.WYLIE_SANSKRIT_STACKING_KEY // DLC FIXME: is this right?
+ getExtendedWylieForUnicodeChar('\u0FB7'));
case '\u0F4E': return "N";
case '\u0F4F': return "t";
case '\u0F50': return "th";
case '\u0F51': return "d";
case '\u0F52': return (getExtendedWylieForUnicodeChar('\u0F51')
+ TibetanMachineWeb.WYLIE_SANSKRIT_STACKING_KEY // DLC FIXME: is this right?
+ getExtendedWylieForUnicodeChar('\u0FB7'));
case '\u0F53': return "n";
case '\u0F54': return "p";
case '\u0F55': return "ph";
case '\u0F56': return "b";
case '\u0F57': return (getExtendedWylieForUnicodeChar('\u0F56')
+ TibetanMachineWeb.WYLIE_SANSKRIT_STACKING_KEY // DLC FIXME: is this right?
+ getExtendedWylieForUnicodeChar('\u0FB7'));
case '\u0F58': return "m";
case '\u0F59': return "ts";
case '\u0F5A': return "tsh";
case '\u0F5B': return "dz";
case '\u0F5C': return (getExtendedWylieForUnicodeChar('\u0F5B')
+ TibetanMachineWeb.WYLIE_SANSKRIT_STACKING_KEY // DLC FIXME: is this right?
+ getExtendedWylieForUnicodeChar('\u0FB7'));
case '\u0F5D': return "w";
case '\u0F5E': return "zh";
case '\u0F5F': return "z";
case '\u0F60': return "'";
case '\u0F61': return "y";
case '\u0F62': return "r";
case '\u0F63': return "l";
case '\u0F64': return "sh";
case '\u0F65': return "Sh";
case '\u0F66': return "s";
case '\u0F67': return "h";
case '\u0F68': return "a"; // DLC: maybe the empty string is OK here because typing just 'i' into Jskad causes root letter \u0F68 to appear... yuck...
case '\u0F69': return (getExtendedWylieForUnicodeChar('\u0F40')
+ TibetanMachineWeb.WYLIE_SANSKRIT_STACKING_KEY // DLC FIXME: is this right?
+ getExtendedWylieForUnicodeChar('\u0FB5'));
case '\u0F6A': return "r";
case '\u0F6B': return null;
case '\u0F6C': return null;
case '\u0F6D': return null;
case '\u0F6E': return null;
case '\u0F6F': return null;
case '\u0F70': return null;
case '\u0F71': return "A";
case '\u0F72': return "i";
case '\u0F73': return "I";
case '\u0F74': return "u";
case '\u0F75': return "U";
case '\u0F76': return "r-i"; // DLC Ri or r-i? I put in a bug report.
case '\u0F77': return "r-I"; // DLC or RI?
case '\u0F78': return "l-i";
case '\u0F79': return "l-I";
case '\u0F7A': return "e";
case '\u0F7B': return "ai";
case '\u0F7C': return "o";
case '\u0F7D': return "au";
case '\u0F7E': return "M";
case '\u0F7F': return "H";
case '\u0F80': return "-i";
case '\u0F81': return "-I";
case '\u0F82': return "~^";// DLC unsupported in Jskad
case '\u0F83': return "~"; // DLC unsupported in Jskad
case '\u0F84': return "?";
case '\u0F85': return "&";
case '\u0F86': return null;
case '\u0F87': return null;
case '\u0F88': return null;
case '\u0F89': return null;
case '\u0F8A': return null;
case '\u0F8B': return null;
case '\u0F8C': return null;
case '\u0F8D': return null;
case '\u0F8E': return null;
case '\u0F8F': return null;
case '\u0F90': return "k";
case '\u0F91': return "kh";
case '\u0F92': return "g";
case '\u0F93': return (getExtendedWylieForUnicodeChar('\u0F92')
+ TibetanMachineWeb.WYLIE_SANSKRIT_STACKING_KEY // DLC FIXME: is this right?
+ getExtendedWylieForUnicodeChar('\u0FB7'));
case '\u0F94': return "ng";
case '\u0F95': return "c";
case '\u0F96': return "ch";
case '\u0F97': return "j";
case '\u0F98': return null;
case '\u0F99': return "ny";
case '\u0F9A': return "T";
case '\u0F9B': return "Th";
case '\u0F9C': return "D";
case '\u0F9D': return (getExtendedWylieForUnicodeChar('\u0F92')
+ TibetanMachineWeb.WYLIE_SANSKRIT_STACKING_KEY // DLC FIXME: is this right?
+ getExtendedWylieForUnicodeChar('\u0FB7'));
case '\u0F9E': return "N";
case '\u0F9F': return "t";
case '\u0FA0': return "th";
case '\u0FA1': return "d";
case '\u0FA2': return (getExtendedWylieForUnicodeChar('\u0FA1')
+ TibetanMachineWeb.WYLIE_SANSKRIT_STACKING_KEY // DLC FIXME: is this right?
+ getExtendedWylieForUnicodeChar('\u0FB7'));
case '\u0FA3': return "n";
case '\u0FA4': return "p";
case '\u0FA5': return "ph";
case '\u0FA6': return "b";
case '\u0FA7': return (getExtendedWylieForUnicodeChar('\u0FA6')
+ TibetanMachineWeb.WYLIE_SANSKRIT_STACKING_KEY // DLC FIXME: is this right?
+ getExtendedWylieForUnicodeChar('\u0FB7'));
case '\u0FA8': return "m";
case '\u0FA9': return "ts";
case '\u0FAA': return "tsh";
case '\u0FAB': return "dz";
case '\u0FAC': return (getExtendedWylieForUnicodeChar('\u0FAB')
+ TibetanMachineWeb.WYLIE_SANSKRIT_STACKING_KEY // DLC FIXME: is this right?
+ getExtendedWylieForUnicodeChar('\u0FB7'));
case '\u0FAD': return "w";
case '\u0FAE': return "zh";
case '\u0FAF': return "z";
case '\u0FB0': return "'";
case '\u0FB1': return "y";
case '\u0FB2': return "r";
case '\u0FB3': return "l";
case '\u0FB4': return "sh";
case '\u0FB5': return "Sh";
case '\u0FB6': return "s";
case '\u0FB7': return "h";
case '\u0FB8': return "a"; // DLC see note on \u0F68 ...
case '\u0FB9': return (getExtendedWylieForUnicodeChar('\u0F90')
+ TibetanMachineWeb.WYLIE_SANSKRIT_STACKING_KEY // DLC FIXME: is this right?
+ getExtendedWylieForUnicodeChar('\u0FB5'));
case '\u0FBA': return "w";
case '\u0FBB': return "y";
case '\u0FBC': return "r";
case '\u0FBD': return null;
case '\u0FBE': return null;
case '\u0FBF': return null;
case '\u0FC0': return null;
case '\u0FC1': return null;
case '\u0FC2': return null;
case '\u0FC3': return null;
case '\u0FC4': return null;
case '\u0FC5': return null;
case '\u0FC6': return null;
case '\u0FC7': return null;
case '\u0FC8': return null;
case '\u0FC9': return null;
case '\u0FCA': return null;
case '\u0FCB': return null;
case '\u0FCC': return null;
case '\u0FCD': return null;
case '\u0FCE': return null;
case '\u0FCF': return ""; // DLC i added this to the 'EWTS document misspeaks' bug report... null I think...
default: {
// DLC handle space (EW's "_")
// This character is in the range 0FD0-0FFF or is not in
// the Tibetan range at all. In either case, there is no
// corresponding Extended Wylie.
return null;
}
} // end switch
}
}

View file

@ -0,0 +1,98 @@
/*
The contents of this file are subject to the THDL Open Community License
Version 1.0 (the "License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License on the THDL web site
(http://www.thdl.org/).
Software distributed under the License is distributed on an "AS IS" basis,
WITHOUT WARRANTY OF ANY KIND, either express or implied. See the
License for the specific terms governing rights and limitations under the
License.
The Initial Developer of this software is the Tibetan and Himalayan Digital
Library (THDL). Portions created by the THDL are Copyright 2001 THDL.
All Rights Reserved.
Contributor(s): ______________________________________.
*/
package org.thdl.tib.text.tshegbar;
/** Provides handy Extended Wylie-inspired names for Unicode
* characters commonly used to represent Tibetan. The consonant that
* the Extended Wylie text "ka" refers to is named EWC_ka as in "The
* Extended Wylie Consonant ka", the vowel represented in Extended
* Wylie by "i" is EWV_i, and so on. There is at least one exception
* to the naming scheme, but exceptions are well-commented.
*
* @see org.thdl.tib.text.tshegbar#ValidTshegBar
*
* @author David Chandler */
public interface UnicodeConstants {
/** for those times when you need a char to represent a non-existent character */
static final char EW_ABSENT = '\u0000';
// the thirty consonants, in alphabetical order:
/** first letter of the alphabet: */
static final char EWC_ka = '\u0F40';
static final char EWC_kha = '\u0F41';
static final char EWC_ga = '\u0F42';
static final char EWC_nga = '\u0F44';
static final char EWC_ca = '\u0F45';
static final char EWC_cha = '\u0F46';
static final char EWC_ja = '\u0F47';
static final char EWC_nya = '\u0F49';
static final char EWC_ta = '\u0F4F';
static final char EWC_tha = '\u0F50';
static final char EWC_da = '\u0F51';
static final char EWC_na = '\u0F53';
static final char EWC_pa = '\u0F54';
static final char EWC_pha = '\u0F55';
static final char EWC_ba = '\u0F56';
static final char EWC_ma = '\u0F58';
static final char EWC_tsa = '\u0F59';
static final char EWC_tsha = '\u0F5A';
static final char EWC_dza = '\u0F5B';
static final char EWC_wa = '\u0F5D';
static final char EWC_zha = '\u0F5E';
static final char EWC_za = '\u0F5F';
/** Note the irregular name. The Extended Wylie representation is
<code>'a</code>. */
static final char EWC_achen = '\u0F60'; /* DLC NOW is this achen or achung? achen is EWC_a, right? comment it. replace EWC_achen everywhere if you change it. */
static final char EWC_ya = '\u0F61';
static final char EWC_ra = '\u0F62';
static final char EWC_la = '\u0F63';
static final char EWC_sha = '\u0F64';
static final char EWC_sa = '\u0F66';
static final char EWC_ha = '\u0F67';
static final char EWC_a = '\u0F68';
/** In the word for father, "pA lags", there is an a-chung (i.e.,
<code>\u0F71</code>). This is the constant for that little
guy. */
static final char EW_achung = '\u0F71';
/* Four of the five vowels, some say, or, others say, "the four
vowels": */
/** "gi gu" (DLC?), the 'i' sound in the English word keep: */
static final char EWV_i = '\u0F72';
/** "zhabs kyu", the 'u' sound in the English word tune: */
static final char EWV_u = '\u0F74';
/** "'greng bu" (also known as "'greng po", and pronounced <i>dang-bo</i>), the 'a' sound in the English word gate: */
static final char EWV_e = '\u0F7A';
/** "na ro" (DLC?), the 'o' sound in the English word bone: */
static final char EWV_o = '\u0F7C';
/** subscribed form of EWC_wa, a.k.a. wa-btags */
static final char EWSUB_wa_zur = '\u0FAD';
/** subscribed form of EWC_ya */
static final char EWSUB_ya_btags = '\u0FB1';
/** subscribed form of EWC_ra */
static final char EWSUB_ra_btags = '\u0FB2';
/** subscribed form of EWC_la */
static final char EWSUB_la_btags = '\u0FB3';
}

View file

@ -0,0 +1,63 @@
/*
The contents of this file are subject to the THDL Open Community License
Version 1.0 (the "License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License on the THDL web site
(http://www.thdl.org/).
Software distributed under the License is distributed on an "AS IS" basis,
WITHOUT WARRANTY OF ANY KIND, either express or implied. See the
License for the specific terms governing rights and limitations under the
License.
The Initial Developer of this software is the Tibetan and Himalayan Digital
Library (THDL). Portions created by the THDL are Copyright 2001 THDL.
All Rights Reserved.
Contributor(s): ______________________________________.
*/
package org.thdl.tib.text.tshegbar;
/** A UnicodeReadyThunk represents a string of characters. While
* there are ways to turn a string of Unicode characters into a list
* of UnicodeReadyThunks (DLC reference it), you cannot
* necessarily recover the exact sequence of Unicode characters from
* a UnicodeReadyThunk. For characters that are not Tibetan
* Unicode and are not one of a handful of other known characters,
* only the most primitive operations are available. Generally in
* this case you can recover the exact string of Unicode characters,
* but don't bank on it.
*
* @author David Chandler
*/
public interface UnicodeReadyThunk {
/** Returns true iff this thunk is entirely Tibetan (regardless of
whether or not all characters come from the Tibetan range of
Unicode 3, i.e. <code>0x0F00</code>-<code>0x0FFF</code>). */
public boolean isTibetan();
/** Returns a sequence of Unicode characters that is equivalent to
* this thunk if possible. It is only possible if {@link
* #hasEquivalentUnicode()} is true. Unicode has more than one
* way to refer to the same language element, so this is just one
* method. When more than one Unicode sequence exists, and when
* the thunk {@link #isTibetan() is Tibetan}, this method returns
* sequences that the Unicode 3.2 standard does not discourage.
* @exception UnsupportedOperationException if {@link
* #hasEquivalentUnicode()} is false
* @return a String of Unicode characters */
public String getEquivalentUnicode() throws UnsupportedOperationException;
/** Returns true iff there exists a sequence of Unicode characters
* that correctly represents this thunk. This will not be the
* case if the thunk contains Tibetan characters for which the
* Unicode standard does not provide. See the Extended Wylie
* Transliteration System (EWTS) document (DLC ref, DLC mention
* Dza,fa,va doc bug) for more info, and see the Unicode 3
* standard section 9.13. The presence of head marks or multiple
* vowels in the thunk would cause this to return false, for
* example. */
public boolean hasEquivalentUnicode();
}

View file

@ -0,0 +1,234 @@
/*
The contents of this file are subject to the THDL Open Community License
Version 1.0 (the "License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License on the THDL web site
(http://www.thdl.org/).
Software distributed under the License is distributed on an "AS IS" basis,
WITHOUT WARRANTY OF ANY KIND, either express or implied. See the
License for the specific terms governing rights and limitations under the
License.
The Initial Developer of this software is the Tibetan and Himalayan Digital
Library (THDL). Portions created by the THDL are Copyright 2001 THDL.
All Rights Reserved.
Contributor(s): ______________________________________.
*/
package org.thdl.tib.text.tshegbar;
/** <p>This non-instantiable class contains utility routines for
* dealing with Tibetan Unicode characters and strings of such
* characters.</p>
*
* @author David Chandler */
public class UnicodeUtils {
/** Do not use this, as this class is not instantiable. */
private UnicodeUtils() { super(); }
/** Returns true iff x is a Unicode character that represents a
consonant or two-consonant stack that has a Unicode code
point. Returns true only for the usual suspects (like
<code>\u0F40</code>) and for Sanskrit consonants (like
<code>\u0F71</code>) and the simple two-consonant stacks in
Unicode (like <code>\u0F43</code>). Returns false for, among
other things, subjoined consonants like
<code>\u0F90</code>. */
public static boolean isNonSubjoinedConsonant(char x) {
return ((x != '\u0F48' /* reserved in Unicode 3.2, but not in use */)
&& (x >= '\u0F40' && x <= '\u0F6A'));
}
/** Returns true iff x is a Unicode character that represents a
subjoined consonant or subjoined two-consonant stack that has
a Unicode code point. Returns true only for the usual
suspects (like <code>\u0F90</code>) and for Sanskrit
consonants (like <code>\u0F9C</code>) and the simple
two-consonant stacks in Unicode (like <code>\u0FAC</code>).
Returns false for, among other things, non-subjoined
consonants like <code>\u0F40</code>. */
public static boolean isSubjoinedConsonant(char x) {
return ((x != '\u0F98' /* reserved in Unicode 3.2, but not in use */)
&& (x >= '\u0F90' && x <= '\u0FBC'));
}
/** Returns true iff x is the preferred representation of a
Tibetan or Sanskrit consonant and cannot be broken down any
further. Returns false for, among other things, subjoined
consonants like <code>\u0F90</code>, two-component consonants
like <code>\u0F43</code>, and fixed-form consonants like
'\u0F6A'. The new consonants (for transcribing Chinese, I
believe) "\u0F55\u0F39" (which EWTS calls "fa"),
"\u0F56\u0F39" ("va"), and "\u0F5F\u0F39" ("Dza") are
two-character sequences, but you should be aware of them
also. */
public static boolean isPreferredFormOfConsonant(char x) {
return ((x != '\u0F48' /* reserved in Unicode 3.2, but not in use */)
&& (x >= '\u0F40' && x <= '\u0F68')
&& (x != '\u0F43')
&& (x != '\u0F4D')
&& (x != '\u0F52')
&& (x != '\u0F57')
&& (x != '\u0F5C'));
}
/** Returns true iff unicodeChar is a character from the Unicode
range U+0F00-U+0FFF.
@see #isEntirelyTibetanUnicode(String) */
public static boolean isInTibetanRange(char unicodeChar) {
return (unicodeChar >= '\u0F00' && unicodeChar <= '\u0FFF');
}
/** Returns true iff unicodeString consists only of characters
from the Unicode range U+0F00-U+0FFF. (Note that these
characters are typically not enough to represent a Tibetan
text, you may need ZWSP (zero-width space) and various
whitespace from other ranges.) */
public static boolean isEntirelyTibetanUnicode(String unicodeString) {
for (int i = 0; i < unicodeString.length(); i++) {
if (!isInTibetanRange(unicodeString.charAt(i)))
return false;
}
return true;
}
/** Modifies tibetanUnicode so that it is equivalent, according to
the Unicode 3.2 standard, to the input buffer. The Tibetan
passages of the returned string are in THDL-canonical form,
however. This form uses a maximum of characters, in general,
and never uses characters whose use has been {@link
#isDiscouraged(char) discouraged}. If the input contains
characters for which {@link #isInTibetanRange(char)} is not
true, then they will not be modified.
<p>Note well that only well-formed input guarantees
well-formed output.</p> */
public static void toCanonicalForm(StringBuffer tibetanUnicode) {
int offset = 0;
while (offset < tibetanUnicode.length()) {
String s = toCanonicalForm(tibetanUnicode.charAt(offset));
if (null == s) {
++offset;
} else {
// modify tibetanUnicode and update offset.
tibetanUnicode.deleteCharAt(offset);
tibetanUnicode.insert(offset, s);
}
}
}
/** Like {@link #toCanonicalForm(StringBuffer)}, but does not
modify its input. Instead, it returns the canonically-formed
version of tibetanUnicode. */
public static String toCanonicalForm(String tibetanUnicode) {
StringBuffer sb = new StringBuffer(tibetanUnicode);
toCanonicalForm(sb);
return sb.toString();
}
/** There are 19 characters in the Tibetan range of Unicode 3.2
which can be decomposed into longer strings of characters in
the Tibetan range of Unicode. These 19 are said not to be in
THDL-canonical form. This routine returns the canonical form
for such characters, and returns null for characters that are
already canonical or are not in the Tibetan range of Unicode.
@param tibetanUnicodeChar the character to canonicalize
@return null if tibetanUnicodeChar is canonical, or a string
of two or three characters otherwise */
public static String toCanonicalForm(char tibetanUnicodeChar) {
switch (tibetanUnicodeChar) {
case '\u0F43': return new String(new char[] { '\u0F42', '\u0FB7' });
case '\u0F4D': return new String(new char[] { '\u0F4C', '\u0FB7' });
case '\u0F52': return new String(new char[] { '\u0F51', '\u0FB7' });
case '\u0F57': return new String(new char[] { '\u0F56', '\u0FB7' });
case '\u0F5C': return new String(new char[] { '\u0F5B', '\u0FB7' });
case '\u0F69': return new String(new char[] { '\u0F40', '\u0FB5' });
case '\u0F73': return new String(new char[] { '\u0F71', '\u0F72' });
case '\u0F75': return new String(new char[] { '\u0F71', '\u0F74' });
case '\u0F76': return new String(new char[] { '\u0FB2', '\u0F80' });
case '\u0F77': return new String(new char[] { '\u0FB2', '\u0F71', '\u0F80' });
case '\u0F78': return new String(new char[] { '\u0FB3', '\u0F80' });
case '\u0F79': return new String(new char[] { '\u0FB3', '\u0F71', '\u0F80' });
case '\u0F81': return new String(new char[] { '\u0F71', '\u0F80' });
case '\u0F93': return new String(new char[] { '\u0F92', '\u0FB7' });
case '\u0F9D': return new String(new char[] { '\u0F9C', '\u0FB7' });
case '\u0FA2': return new String(new char[] { '\u0FA1', '\u0FB7' });
case '\u0FA7': return new String(new char[] { '\u0FA6', '\u0FB7' });
case '\u0FAC': return new String(new char[] { '\u0FAB', '\u0FB7' });
case '\u0FB9': return new String(new char[] { '\u0F90', '\u0FB5' });
default:
return null;
}
}
/** Returns true iff tibetanUnicodeChar {@link
#isInTibetanRange(char)} and if the Unicode 3.2 standard
discourages the use of tibetanUnicodeChar. */
public static boolean isDiscouraged(char tibetanUnicodeChar) {
return ('\u0F73' == tibetanUnicodeChar
|| '\u0F75' == tibetanUnicodeChar
|| '\u0F77' == tibetanUnicodeChar
|| '\u0F81' == tibetanUnicodeChar);
/* DLC FIXME -- I was using 3.0 p.437-440, check 3.2. */
}
/** Returns true iff ch corresponds to the Tibetan letter ra.
Several Unicode characters correspond to the Tibetan letter ra
(in its subscribed form or otherwise). Oftentimes,
<code>\u0F62</code> is thought of as the nominal
representation. Returns false for some characters that
contain ra but are not merely ra, such as <code>\u0F77</code> */
public static boolean isRa(char ch) {
return ('\u0F62' == ch
|| '\u0F6A' == ch
|| '\u0FB2' == ch
|| '\u0FBC' == ch);
}
/** Returns true iff ch corresponds to the Tibetan letter wa.
Several Unicode characters correspond to the Tibetan letter
wa. Oftentimes, <code>\u0F5D</code> is thought of as the
nominal representation. */
public static boolean isWa(char ch) {
return ('\u0F5D' == ch
|| '\u0FAD' == ch
|| '\u0FBA' == ch);
}
/** Returns true iff ch corresponds to the Tibetan letter ya.
Several Unicode characters correspond to the Tibetan letter
ya. Oftentimes, <code>\u0F61</code> is thought of as the
nominal representation. */
public static boolean isYa(char ch) {
return ('\u0F61' == ch
|| '\u0FB1' == ch
|| '\u0FBB' == ch);
}
/** Returns true iff there exists at least one character ch in
unicodeString such that ch {@link #isRa() is ra} or contains
ra (like <code>\u0F77</code>). This method is not implemented
as fast as it could be. It calls on the canonicalization code
in order to maximize reuse and minimize the possibility of
coder error. */
public static boolean containsRa(String unicodeString) {
String canonForm = toCanonicalForm(unicodeString);
for (int i = 0; i < canonForm.length(); i++) {
if (isRa(canonForm.charAt(i)))
return true;
}
return false;
}
/** Inefficient shortcut.
@see #containsRa(String) */
public static boolean containsRa(char unicodeChar) {
return containsRa(new String(new char[] { unicodeChar }));
}
public static String unicodeCharToString(char ch) {
return "U+" + Integer.toHexString((int)ch);
}
}

View file

@ -0,0 +1,30 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<!--
@(#)package.html
Copyright 2002 Tibetan and Himalayan Digital Library
This software is the confidential and proprietary information of
the Tibetan and Himalayan Digital Library. You shall use such
information only in accordance with the terms of the license
agreement you entered into with the THDL.
-->
</head>
<body bgcolor="white">
Provides for manipulating Tibetan text at the <i>tsek bar</i> level.
Roughly speaking, a "tsheg bar" (pronounced <i>tsek bar</i>) is a
syllable.
<p>
This package allows for turning a string of Unicode characters into
our <i>TTBIR</i>, our Tibetan Tsheg Bar Internal Representation.
Said Unicode document may contain non-Tibetan characters also.
</p>
</body>
</html>