conversion. The tag 'TODO(DLC)[EWTS->Tibetan]' exists all over the
place. EWTS->Tibetan isn't here yet; lexing isn't here yet; this is
mainly a refactoring so that the ACIP->Tibetan code can be reused to
do EWTS->Tibetan.
I'm committing this because tests pass (it shouldn't be breaking
anything), because I want a checkpoint, and because the laptop this
sandbox was on isn't my preferred development environment.
Fixed part of bug 998476 and part of an undocumented bug. Discovered a
new bug, "aM" should be generated but only "M" is.
The undocumented bug was that laMA was generated when lAM should have been.
The part of bug 998476 that was fixed: laM, laH, etc. are now generated.
This does nothing about paN etc.
Some refactoring here; this is not a minimal diff.
Added tests of TMW->EWTS that use ACIP to get the TMW in place
because EWTS->TMW is a faulty keyboard at present.
dropped the ball on) introduced two lines for 8,95. This is a bad thing, so
I've taken out the second line. I've also introduced a check in
TibetanMachineWeb.java such that we'll know that tibwn.ini has no such
error in the future just by running 'ant clean jskad-run' and making sure that
the GUI is indeed visible.
I also updated the test baselines now that F03A and 0F82 are squared away.
confused; many glyphs that should have yielded errors were not.
I've added a test case that transforms every TMW glyph save the one with
no TM mapping to ACIP. I hand-checked that it was correct.
ACIP->TMW is fixed for # and *. I never noticed it, but each needed an
extra swoosh (U+0F05).
Round-tripping would be good, as would testing real-world use of
TMW->ACIP.
Support for U+F021-U+F0FF, the PUA that the latest EWTS uses, is not provided.
Also, we've traded some speed for memory -- DuffCode now uses bytes, not ints.
Fixed ACIP->Unicode/TMW for BDE, which should be B-DE, not B+DE, because the former is legal Tibetan.
The ACIP->EWTS subroutine has improved.
TMW->Wylie and TMW->ACIP are improved in error cases.
TMW->ACIP has friendly embedded error messages now.
mappings, so I've put them back, but with the EWTS non-correspondences
\tmwXYYY.
Jskad no longer supports superscribed or subscribed numerals, because
EWTS does not.
that say "ya can take a ga prefix" etc.
The ACIP->Unicode converter now gives warnings (optionally, and by
default, inline). This converter now produces output even when
lexical errors occur, but the output has errors and warnings inline.
Our disambiguation is now perfect, happening when and only when it is
necessary. These are all illegal, so it shouldn't affect many
existing conversions. But if there were typos, it could.
or <?Numbers?> commands; it instead hard-codes the appropriate comma-
delimited lists. This is cleaner because WylieWord and Jskad had different
values for these lists.
TMW->Wylie conversions with the new-and-improved TMW->Wylie
algorithm faulty.
Now I'm using it a little more than you need to, e.g. b.lha instead of blha is
generated because bla and b.la are ambiguous.
like \bullet, \emdash, etc., and this fix only works for Windows or OS/2 RTF
files, not for Mac RTF files. So if you want a TM->TMW conversion to work,
use MS Word for Windows, not for the Mac.
'<' and '>'. The current keyboard implementation makes this an either-or
proposition, when fundamentally it need not be.
Added a <?Numbers?> command and an <?Input:Numbers?> command to
tibwn.ini; broke the numbers apart from the consonants. This facilitates the
new-and-improved Tibetan->Wylie conversion.
Tibetan->Wylie is now done by forming legal tsheg-bars. A legal tsheg bar
is converted into perfect THDL Wylie. See code comments to learn what
it thinks is a legal tsheg-bar, but it inlcudes bskyUMbsH minus the trailing
punctuation (H), e.g.
Illegal sequences, such as runs of transliterated Sanskrit, are turned into
unambiguous Wylie; each glyph is followed by a vowel or a disambiguator
('.').
I've made it so that the illegal sequences are as beautiful as possible. You
get 'pad+me', for example, not the equivalent but uglier 'pad+m.e.'.
Added support for two more oddballs.
Deprecated the oddball lookup method because it drops up to 30 glyphs in
TibetanMachine. The correct solution is to transform the RTF before Java's
busted RTF readers ever see it. \'97 becomes \u151, e.g.
Is the EWTS '_' to be represented as U+0020, or is it a wider space?
Does TMW9.42, Dza, map to U+0F5F,U+0F39?
Does TMW6.60, r+y, map to U+0F62,U+0FBB or to U+0F6A,U+0FBB? (Likewise with r+w, TMW6.61, TMW6.62, etc.)
Is U+0F7E a bindu? What Unicode does TMW7.96 map to, for example? What does TMW7.91 map to?
Should TMW8.97 and TMW8.98 map to swastiskas elsewhere in Unicode? If so, which codepoints? Likewise with TMW9.60, a Chinese character.
Does TMW7.68 map to U+0F39?
Does TMW7.74, the ITHI secret sign, have a Unicode mapping? f68,fa0,f80,f72 comes close, but fa0 would be too large, wouldn't it?
What Unicode does TMW9.61 map to? Is it for sequences like f40,f7c,f60,f72? Or is it for f60,f72,f7c?