Now uses terminology from the Unicode standard. No more talk of
characters, for example. Normalization forms NFKD and NFD are supported for the Tibetan Unicode range. I don't like either, actually. I've tested NFKD, but I've not yet committed the tests.
This commit is contained in:
parent
3199ff7926
commit
a42347b224
7 changed files with 210 additions and 136 deletions
|
@ -18,14 +18,14 @@ Contributor(s): ______________________________________.
|
|||
|
||||
package org.thdl.tib.text.tshegbar;
|
||||
|
||||
/** A UnicodeReadyThunk represents a string of characters. While
|
||||
* there are ways to turn a string of Unicode characters into a list
|
||||
/** A UnicodeReadyThunk represents a string of codepoints. While
|
||||
* there are ways to turn a string of Unicode codepoints into a list
|
||||
* of UnicodeReadyThunks (DLC reference it), you cannot
|
||||
* necessarily recover the exact sequence of Unicode characters from
|
||||
* a UnicodeReadyThunk. For characters that are not Tibetan
|
||||
* Unicode and are not one of a handful of other known characters,
|
||||
* necessarily recover the exact sequence of Unicode codepoints from
|
||||
* a UnicodeReadyThunk. For codepoints that are not Tibetan
|
||||
* Unicode and are not one of a handful of other known codepoints,
|
||||
* only the most primitive operations are available. Generally in
|
||||
* this case you can recover the exact string of Unicode characters,
|
||||
* this case you can recover the exact string of Unicode codepoints,
|
||||
* but don't bank on it.
|
||||
*
|
||||
* @author David Chandler
|
||||
|
@ -33,23 +33,25 @@ package org.thdl.tib.text.tshegbar;
|
|||
public interface UnicodeReadyThunk {
|
||||
|
||||
/** Returns true iff this thunk is entirely Tibetan (regardless of
|
||||
whether or not all characters come from the Tibetan range of
|
||||
Unicode 3, i.e. <code>0x0F00</code>-<code>0x0FFF</code>). */
|
||||
whether or not all codepoints come from the Tibetan range of
|
||||
Unicode 3, i.e. <code>U+0F00</code>-<code>U+0FFF</code>, and
|
||||
regardless of whether or not this thunk is syntactically legal
|
||||
Tibetan). */
|
||||
public boolean isTibetan();
|
||||
|
||||
/** Returns a sequence of Unicode characters that is equivalent to
|
||||
/** Returns a sequence of Unicode codepoints that is equivalent to
|
||||
* this thunk if possible. It is only possible if {@link
|
||||
* #hasEquivalentUnicode()} is true. Unicode has more than one
|
||||
* #hasUnicodeRepresentation()} is true. Unicode has more than one
|
||||
* way to refer to the same language element, so this is just one
|
||||
* method. When more than one Unicode sequence exists, and when
|
||||
* the thunk {@link #isTibetan() is Tibetan}, this method returns
|
||||
* sequences that the Unicode 3.2 standard does not discourage.
|
||||
* @exception UnsupportedOperationException if {@link
|
||||
* #hasEquivalentUnicode()} is false
|
||||
* @return a String of Unicode characters */
|
||||
public String getEquivalentUnicode() throws UnsupportedOperationException;
|
||||
* #hasUnicodeRepresentation()} is false
|
||||
* @return a String of Unicode codepoints */
|
||||
public String getUnicodeRepresentation() throws UnsupportedOperationException;
|
||||
|
||||
/** Returns true iff there exists a sequence of Unicode characters
|
||||
/** Returns true iff there exists a sequence of Unicode codepoints
|
||||
* that correctly represents this thunk. This will not be the
|
||||
* case if the thunk contains Tibetan characters for which the
|
||||
* Unicode standard does not provide. See the Extended Wylie
|
||||
|
@ -58,6 +60,6 @@ public interface UnicodeReadyThunk {
|
|||
* standard section 9.13. The presence of head marks or multiple
|
||||
* vowels in the thunk would cause this to return false, for
|
||||
* example. */
|
||||
public boolean hasEquivalentUnicode();
|
||||
public boolean hasUnicodeRepresentation();
|
||||
}
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue