I really hesitate to commit this because I'm not sure what it brings to the

table exactly and I fear that it makes the ACIP->Tibetan converter code a lot uglier. The TODO(DLC)[EWTS->Tibetan] comments littered throughout are part of the ugliness; they point to the ugliness. If each were addressed, cleanliness could perhaps be achieved. I've largely forgotten exactly what this change does, but it attempts to improve EWTS->Tibetan conversion. The lexer is probably really, really primitive. I concentrate here on converting a single tsheg bar rather than a whole document. Eclipse was used during part of my journey here and some imports were reorganized merely because I could. :) (Eclipse was needed when the usual ant build failed to run a new test EWTSTest. And I wanted its debugger.) Next steps: end-to-end EWTS tests should bring many problems to light. Fix those. Triage all the TODO comments. I don't know that I'll ever really trust the implementation. The tests are valuable, though. A clean implementation of EWTS->Tibetan in Jython might hold enough interest for me; I'd like to learn Python.
2005-06-20 06:18:00 +00:00 · 2005-06-20 06:18:00 +00:00 · 7198f23361
commit 7198f23361
parent f64bae8ea6
45 changed files with 1666 additions and 695 deletions
--- a/source/org/thdl/tib/text/tshegbar/UnicodeUtils.java
+++ b/source/org/thdl/tib/text/tshegbar/UnicodeUtils.java
@ -506,5 +506,25 @@ public class UnicodeUtils implements UnicodeConstants {
        } while (mutated_this_time_through);
        return mutated;
    }
+
+    /** Returns true iff ch is a valid Tibetan codepoint in Unicode
+     *  4.0: */
+    public boolean isTibetanUnicodeCodepoint(char ch) {
+        // NOTE: could use an array of 256 booleans for speed but I'm lazy
+        return ((ch >= '\u0f00' && ch <= '\u0fcf')
+                && !(ch == '\u0f48'
+                     || (ch > '\u0f6a' && ch < '\u0f71')
+                     || (ch > '\u0f8b' && ch < '\u0f90')
+                     || ch == '\u0f98'
+                     || ch == '\u0fbd'
+                     || ch == '\u0fcd'
+                     || ch == '\u0fce'));
+    }
+
+    /** Returns true iff ch is in 0F00-0FFF but isn't a valid Tibetan
+     *  codepoint in Unicode 4.0: */
+    public boolean isInvalidTibetanUnicode(char ch) {
+        return (isInTibetanRange(ch) && !isTibetanUnicodeCodepoint(ch));
+    }
 }