I really hesitate to commit this because I'm not sure what it brings to the

table exactly and I fear that it makes the ACIP->Tibetan converter code
a lot uglier.  The TODO(DLC)[EWTS->Tibetan] comments littered throughout
are part of the ugliness; they point to the ugliness.  If each were addressed,
cleanliness could perhaps be achieved.

I've largely forgotten exactly what this change does, but it attempts to
improve EWTS->Tibetan conversion.  The lexer is probably really, really
primitive.  I concentrate here on converting a single tsheg bar rather than
a whole document.

Eclipse was used during part of my journey here and some imports were
reorganized merely because I could.  :)

(Eclipse was needed when the usual ant build failed to run a new test
EWTSTest.  And I wanted its debugger.)

Next steps: end-to-end EWTS tests should bring many problems to light.  Fix
those.  Triage all the TODO comments.

I don't know that I'll ever really trust the implementation.  The tests are
valuable, though.  A clean implementation of EWTS->Tibetan in Jython
might hold enough interest for me; I'd like to learn Python.
This commit is contained in:
dchandler 2005-06-20 06:18:00 +00:00
parent f64bae8ea6
commit 7198f23361
45 changed files with 1666 additions and 695 deletions

View file

@ -506,5 +506,25 @@ public class UnicodeUtils implements UnicodeConstants {
} while (mutated_this_time_through);
return mutated;
}
/** Returns true iff ch is a valid Tibetan codepoint in Unicode
* 4.0: */
public boolean isTibetanUnicodeCodepoint(char ch) {
// NOTE: could use an array of 256 booleans for speed but I'm lazy
return ((ch >= '\u0f00' && ch <= '\u0fcf')
&& !(ch == '\u0f48'
|| (ch > '\u0f6a' && ch < '\u0f71')
|| (ch > '\u0f8b' && ch < '\u0f90')
|| ch == '\u0f98'
|| ch == '\u0fbd'
|| ch == '\u0fcd'
|| ch == '\u0fce'));
}
/** Returns true iff ch is in 0F00-0FFF but isn't a valid Tibetan
* codepoint in Unicode 4.0: */
public boolean isInvalidTibetanUnicode(char ch) {
return (isInTibetanRange(ch) && !isTibetanUnicodeCodepoint(ch));
}
}