This commit is for my benefit only; these classes are not ready for prime time,
and the build system is not yet aware of them.
I'm adding some classes for representing legal tsheg-bars (syllables, for the
most part) in Unicode. These classes were designed bottom-up (OK, OK --
they weren't designed designed, but I had to write down everything I knew
about Tibetan syntax somewhere). The classes are aware of extended
wylie. I doubt the Javadocs work yet, and I'm still testing (and am not
committing my testing code with these as it is not yet ready).
Next on my list--fix these up to reflect my new awareness of suffix particles
(like le'u'i'o) add classes to support syntactically incorrect Unicode
sequences. Then add a UnicodeReader, and we've got the back end of
a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's
Worldscript or FreeType Layout or Omega's OTPs).
A top-down design would not have included LegalTshegBar. But now that
my itch has been scratched, potential uses are lingering about. For example,
it would be nice to scan some input and break it into LegalTshegBars,
punctuation/marks/signs, and illegal stacks. Then we could alert the client
of the illegality, its precise form, and its precise location.
The real system for turning a Unicode stream into an internal representation
suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be
aware of Tibetan syntax. But to make the very best conversion from
Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better
represented as gskad, but that jaskad is not the same as jskad.
2002-12-09 01:02:23 +00:00
|
|
|
/*
|
|
|
|
The contents of this file are subject to the THDL Open Community License
|
|
|
|
Version 1.0 (the "License"); you may not use this file except in compliance
|
|
|
|
with the License. You may obtain a copy of the License on the THDL web site
|
|
|
|
(http://www.thdl.org/).
|
|
|
|
|
|
|
|
Software distributed under the License is distributed on an "AS IS" basis,
|
|
|
|
WITHOUT WARRANTY OF ANY KIND, either express or implied. See the
|
|
|
|
License for the specific terms governing rights and limitations under the
|
|
|
|
License.
|
|
|
|
|
|
|
|
The Initial Developer of this software is the Tibetan and Himalayan Digital
|
|
|
|
Library (THDL). Portions created by the THDL are Copyright 2001 THDL.
|
|
|
|
All Rights Reserved.
|
|
|
|
|
|
|
|
Contributor(s): ______________________________________.
|
|
|
|
*/
|
|
|
|
|
|
|
|
package org.thdl.tib.text.tshegbar;
|
|
|
|
|
2002-12-15 03:35:24 +00:00
|
|
|
/** A UnicodeReadyThunk represents a string of codepoints. While
|
|
|
|
* there are ways to turn a string of Unicode codepoints into a list
|
This commit is for my benefit only; these classes are not ready for prime time,
and the build system is not yet aware of them.
I'm adding some classes for representing legal tsheg-bars (syllables, for the
most part) in Unicode. These classes were designed bottom-up (OK, OK --
they weren't designed designed, but I had to write down everything I knew
about Tibetan syntax somewhere). The classes are aware of extended
wylie. I doubt the Javadocs work yet, and I'm still testing (and am not
committing my testing code with these as it is not yet ready).
Next on my list--fix these up to reflect my new awareness of suffix particles
(like le'u'i'o) add classes to support syntactically incorrect Unicode
sequences. Then add a UnicodeReader, and we've got the back end of
a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's
Worldscript or FreeType Layout or Omega's OTPs).
A top-down design would not have included LegalTshegBar. But now that
my itch has been scratched, potential uses are lingering about. For example,
it would be nice to scan some input and break it into LegalTshegBars,
punctuation/marks/signs, and illegal stacks. Then we could alert the client
of the illegality, its precise form, and its precise location.
The real system for turning a Unicode stream into an internal representation
suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be
aware of Tibetan syntax. But to make the very best conversion from
Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better
represented as gskad, but that jaskad is not the same as jskad.
2002-12-09 01:02:23 +00:00
|
|
|
* of UnicodeReadyThunks (DLC reference it), you cannot
|
2002-12-15 03:35:24 +00:00
|
|
|
* necessarily recover the exact sequence of Unicode codepoints from
|
|
|
|
* a UnicodeReadyThunk. For codepoints that are not Tibetan
|
|
|
|
* Unicode and are not one of a handful of other known codepoints,
|
This commit is for my benefit only; these classes are not ready for prime time,
and the build system is not yet aware of them.
I'm adding some classes for representing legal tsheg-bars (syllables, for the
most part) in Unicode. These classes were designed bottom-up (OK, OK --
they weren't designed designed, but I had to write down everything I knew
about Tibetan syntax somewhere). The classes are aware of extended
wylie. I doubt the Javadocs work yet, and I'm still testing (and am not
committing my testing code with these as it is not yet ready).
Next on my list--fix these up to reflect my new awareness of suffix particles
(like le'u'i'o) add classes to support syntactically incorrect Unicode
sequences. Then add a UnicodeReader, and we've got the back end of
a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's
Worldscript or FreeType Layout or Omega's OTPs).
A top-down design would not have included LegalTshegBar. But now that
my itch has been scratched, potential uses are lingering about. For example,
it would be nice to scan some input and break it into LegalTshegBars,
punctuation/marks/signs, and illegal stacks. Then we could alert the client
of the illegality, its precise form, and its precise location.
The real system for turning a Unicode stream into an internal representation
suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be
aware of Tibetan syntax. But to make the very best conversion from
Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better
represented as gskad, but that jaskad is not the same as jskad.
2002-12-09 01:02:23 +00:00
|
|
|
* only the most primitive operations are available. Generally in
|
2002-12-15 03:35:24 +00:00
|
|
|
* this case you can recover the exact string of Unicode codepoints,
|
This commit is for my benefit only; these classes are not ready for prime time,
and the build system is not yet aware of them.
I'm adding some classes for representing legal tsheg-bars (syllables, for the
most part) in Unicode. These classes were designed bottom-up (OK, OK --
they weren't designed designed, but I had to write down everything I knew
about Tibetan syntax somewhere). The classes are aware of extended
wylie. I doubt the Javadocs work yet, and I'm still testing (and am not
committing my testing code with these as it is not yet ready).
Next on my list--fix these up to reflect my new awareness of suffix particles
(like le'u'i'o) add classes to support syntactically incorrect Unicode
sequences. Then add a UnicodeReader, and we've got the back end of
a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's
Worldscript or FreeType Layout or Omega's OTPs).
A top-down design would not have included LegalTshegBar. But now that
my itch has been scratched, potential uses are lingering about. For example,
it would be nice to scan some input and break it into LegalTshegBars,
punctuation/marks/signs, and illegal stacks. Then we could alert the client
of the illegality, its precise form, and its precise location.
The real system for turning a Unicode stream into an internal representation
suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be
aware of Tibetan syntax. But to make the very best conversion from
Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better
represented as gskad, but that jaskad is not the same as jskad.
2002-12-09 01:02:23 +00:00
|
|
|
* but don't bank on it.
|
|
|
|
*
|
|
|
|
* @author David Chandler
|
|
|
|
*/
|
|
|
|
public interface UnicodeReadyThunk {
|
|
|
|
|
|
|
|
/** Returns true iff this thunk is entirely Tibetan (regardless of
|
2002-12-15 03:35:24 +00:00
|
|
|
whether or not all codepoints come from the Tibetan range of
|
|
|
|
Unicode 3, i.e. <code>U+0F00</code>-<code>U+0FFF</code>, and
|
|
|
|
regardless of whether or not this thunk is syntactically legal
|
|
|
|
Tibetan). */
|
This commit is for my benefit only; these classes are not ready for prime time,
and the build system is not yet aware of them.
I'm adding some classes for representing legal tsheg-bars (syllables, for the
most part) in Unicode. These classes were designed bottom-up (OK, OK --
they weren't designed designed, but I had to write down everything I knew
about Tibetan syntax somewhere). The classes are aware of extended
wylie. I doubt the Javadocs work yet, and I'm still testing (and am not
committing my testing code with these as it is not yet ready).
Next on my list--fix these up to reflect my new awareness of suffix particles
(like le'u'i'o) add classes to support syntactically incorrect Unicode
sequences. Then add a UnicodeReader, and we've got the back end of
a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's
Worldscript or FreeType Layout or Omega's OTPs).
A top-down design would not have included LegalTshegBar. But now that
my itch has been scratched, potential uses are lingering about. For example,
it would be nice to scan some input and break it into LegalTshegBars,
punctuation/marks/signs, and illegal stacks. Then we could alert the client
of the illegality, its precise form, and its precise location.
The real system for turning a Unicode stream into an internal representation
suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be
aware of Tibetan syntax. But to make the very best conversion from
Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better
represented as gskad, but that jaskad is not the same as jskad.
2002-12-09 01:02:23 +00:00
|
|
|
public boolean isTibetan();
|
|
|
|
|
2002-12-15 03:35:24 +00:00
|
|
|
/** Returns a sequence of Unicode codepoints that is equivalent to
|
This commit is for my benefit only; these classes are not ready for prime time,
and the build system is not yet aware of them.
I'm adding some classes for representing legal tsheg-bars (syllables, for the
most part) in Unicode. These classes were designed bottom-up (OK, OK --
they weren't designed designed, but I had to write down everything I knew
about Tibetan syntax somewhere). The classes are aware of extended
wylie. I doubt the Javadocs work yet, and I'm still testing (and am not
committing my testing code with these as it is not yet ready).
Next on my list--fix these up to reflect my new awareness of suffix particles
(like le'u'i'o) add classes to support syntactically incorrect Unicode
sequences. Then add a UnicodeReader, and we've got the back end of
a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's
Worldscript or FreeType Layout or Omega's OTPs).
A top-down design would not have included LegalTshegBar. But now that
my itch has been scratched, potential uses are lingering about. For example,
it would be nice to scan some input and break it into LegalTshegBars,
punctuation/marks/signs, and illegal stacks. Then we could alert the client
of the illegality, its precise form, and its precise location.
The real system for turning a Unicode stream into an internal representation
suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be
aware of Tibetan syntax. But to make the very best conversion from
Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better
represented as gskad, but that jaskad is not the same as jskad.
2002-12-09 01:02:23 +00:00
|
|
|
* this thunk if possible. It is only possible if {@link
|
2002-12-15 03:35:24 +00:00
|
|
|
* #hasUnicodeRepresentation()} is true. Unicode has more than one
|
This commit is for my benefit only; these classes are not ready for prime time,
and the build system is not yet aware of them.
I'm adding some classes for representing legal tsheg-bars (syllables, for the
most part) in Unicode. These classes were designed bottom-up (OK, OK --
they weren't designed designed, but I had to write down everything I knew
about Tibetan syntax somewhere). The classes are aware of extended
wylie. I doubt the Javadocs work yet, and I'm still testing (and am not
committing my testing code with these as it is not yet ready).
Next on my list--fix these up to reflect my new awareness of suffix particles
(like le'u'i'o) add classes to support syntactically incorrect Unicode
sequences. Then add a UnicodeReader, and we've got the back end of
a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's
Worldscript or FreeType Layout or Omega's OTPs).
A top-down design would not have included LegalTshegBar. But now that
my itch has been scratched, potential uses are lingering about. For example,
it would be nice to scan some input and break it into LegalTshegBars,
punctuation/marks/signs, and illegal stacks. Then we could alert the client
of the illegality, its precise form, and its precise location.
The real system for turning a Unicode stream into an internal representation
suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be
aware of Tibetan syntax. But to make the very best conversion from
Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better
represented as gskad, but that jaskad is not the same as jskad.
2002-12-09 01:02:23 +00:00
|
|
|
* way to refer to the same language element, so this is just one
|
|
|
|
* method. When more than one Unicode sequence exists, and when
|
|
|
|
* the thunk {@link #isTibetan() is Tibetan}, this method returns
|
|
|
|
* sequences that the Unicode 3.2 standard does not discourage.
|
|
|
|
* @exception UnsupportedOperationException if {@link
|
2002-12-15 03:35:24 +00:00
|
|
|
* #hasUnicodeRepresentation()} is false
|
|
|
|
* @return a String of Unicode codepoints */
|
|
|
|
public String getUnicodeRepresentation() throws UnsupportedOperationException;
|
This commit is for my benefit only; these classes are not ready for prime time,
and the build system is not yet aware of them.
I'm adding some classes for representing legal tsheg-bars (syllables, for the
most part) in Unicode. These classes were designed bottom-up (OK, OK --
they weren't designed designed, but I had to write down everything I knew
about Tibetan syntax somewhere). The classes are aware of extended
wylie. I doubt the Javadocs work yet, and I'm still testing (and am not
committing my testing code with these as it is not yet ready).
Next on my list--fix these up to reflect my new awareness of suffix particles
(like le'u'i'o) add classes to support syntactically incorrect Unicode
sequences. Then add a UnicodeReader, and we've got the back end of
a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's
Worldscript or FreeType Layout or Omega's OTPs).
A top-down design would not have included LegalTshegBar. But now that
my itch has been scratched, potential uses are lingering about. For example,
it would be nice to scan some input and break it into LegalTshegBars,
punctuation/marks/signs, and illegal stacks. Then we could alert the client
of the illegality, its precise form, and its precise location.
The real system for turning a Unicode stream into an internal representation
suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be
aware of Tibetan syntax. But to make the very best conversion from
Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better
represented as gskad, but that jaskad is not the same as jskad.
2002-12-09 01:02:23 +00:00
|
|
|
|
2002-12-15 03:35:24 +00:00
|
|
|
/** Returns true iff there exists a sequence of Unicode codepoints
|
This commit is for my benefit only; these classes are not ready for prime time,
and the build system is not yet aware of them.
I'm adding some classes for representing legal tsheg-bars (syllables, for the
most part) in Unicode. These classes were designed bottom-up (OK, OK --
they weren't designed designed, but I had to write down everything I knew
about Tibetan syntax somewhere). The classes are aware of extended
wylie. I doubt the Javadocs work yet, and I'm still testing (and am not
committing my testing code with these as it is not yet ready).
Next on my list--fix these up to reflect my new awareness of suffix particles
(like le'u'i'o) add classes to support syntactically incorrect Unicode
sequences. Then add a UnicodeReader, and we've got the back end of
a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's
Worldscript or FreeType Layout or Omega's OTPs).
A top-down design would not have included LegalTshegBar. But now that
my itch has been scratched, potential uses are lingering about. For example,
it would be nice to scan some input and break it into LegalTshegBars,
punctuation/marks/signs, and illegal stacks. Then we could alert the client
of the illegality, its precise form, and its precise location.
The real system for turning a Unicode stream into an internal representation
suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be
aware of Tibetan syntax. But to make the very best conversion from
Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better
represented as gskad, but that jaskad is not the same as jskad.
2002-12-09 01:02:23 +00:00
|
|
|
* that correctly represents this thunk. This will not be the
|
|
|
|
* case if the thunk contains Tibetan characters for which the
|
|
|
|
* Unicode standard does not provide. See the Extended Wylie
|
|
|
|
* Transliteration System (EWTS) document (DLC ref, DLC mention
|
|
|
|
* Dza,fa,va doc bug) for more info, and see the Unicode 3
|
|
|
|
* standard section 9.13. The presence of head marks or multiple
|
|
|
|
* vowels in the thunk would cause this to return false, for
|
|
|
|
* example. */
|
2002-12-15 03:35:24 +00:00
|
|
|
public boolean hasUnicodeRepresentation();
|
This commit is for my benefit only; these classes are not ready for prime time,
and the build system is not yet aware of them.
I'm adding some classes for representing legal tsheg-bars (syllables, for the
most part) in Unicode. These classes were designed bottom-up (OK, OK --
they weren't designed designed, but I had to write down everything I knew
about Tibetan syntax somewhere). The classes are aware of extended
wylie. I doubt the Javadocs work yet, and I'm still testing (and am not
committing my testing code with these as it is not yet ready).
Next on my list--fix these up to reflect my new awareness of suffix particles
(like le'u'i'o) add classes to support syntactically incorrect Unicode
sequences. Then add a UnicodeReader, and we've got the back end of
a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's
Worldscript or FreeType Layout or Omega's OTPs).
A top-down design would not have included LegalTshegBar. But now that
my itch has been scratched, potential uses are lingering about. For example,
it would be nice to scan some input and break it into LegalTshegBars,
punctuation/marks/signs, and illegal stacks. Then we could alert the client
of the illegality, its precise form, and its precise location.
The real system for turning a Unicode stream into an internal representation
suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be
aware of Tibetan syntax. But to make the very best conversion from
Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better
represented as gskad, but that jaskad is not the same as jskad.
2002-12-09 01:02:23 +00:00
|
|
|
}
|
|
|
|
|