bugs; it is pre-alpha. It's usable, though, and finds tons of errors
in ACIP input files, with the user deciding just how pedantic to be.
The biggest outstanding bug is the silent one: treating { }, space, as
tsheg instead of whitespace when we ought to know better.
that say "ya can take a ga prefix" etc.
The ACIP->Unicode converter now gives warnings (optionally, and by
default, inline). This converter now produces output even when
lexical errors occur, but the output has errors and warnings inline.
\, the Sanskrit virama, is not used. Of the 1370-odd ACIP texts I've
got here, about 57% make it through the gauntlet (fewer if you demand
a vowel or disambiguator on every stack of a non-Tibetan tsheg bar).
b, c, d, e, ... do not belong in ACIP, so the scanner rejects them.
This should make it even easier to distinguish automatically between
Tibetan and English texts.
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
up that String into tsheg bars, punctuation, etc., while finding
errors. I've tested it some, but I'm not yet committing the tests.
Next step: a converter that takes an ACIP file as input and outputs
TMW+Latin.
and it has the capability to produce error messages and warnings that
make sense to the user. One can now get the correct parse, if one
exists, for an ACIP tsheg bar.
One could even feed in ACIP and get a list of warnings about things as
innocuous as PADMA, which a dumb converter would have trouble with.
One could then turn ACIP into well-behaved ACIP for that dumb
converter, if you really wanted to.
Still to do:
o Scan ACIP files into tsheg bars.
o Produce TMW/Latin (from which you can get Unicode, etc.).
o E-mail the illegal tsheg bars to the ACIP fellows so they can fix
the affected documents (most of the Kangyur has unparseable
creatures).
Our disambiguation is now perfect, happening when and only when it is
necessary. These are all illegal, so it shouldn't affect many
existing conversions. But if there were typos, it could.
or <?Numbers?> commands; it instead hard-codes the appropriate comma-
delimited lists. This is cleaner because WylieWord and Jskad had different
values for these lists.
TMW->Wylie conversions with the new-and-improved TMW->Wylie
algorithm faulty.
Now I'm using it a little more than you need to, e.g. b.lha instead of blha is
generated because bla and b.la are ambiguous.
like \bullet, \emdash, etc., and this fix only works for Windows or OS/2 RTF
files, not for Mac RTF files. So if you want a TM->TMW conversion to work,
use MS Word for Windows, not for the Mac.
'<' and '>'. The current keyboard implementation makes this an either-or
proposition, when fundamentally it need not be.
Added a <?Numbers?> command and an <?Input:Numbers?> command to
tibwn.ini; broke the numbers apart from the consonants. This facilitates the
new-and-improved Tibetan->Wylie conversion.
Tibetan->Wylie is now done by forming legal tsheg-bars. A legal tsheg bar
is converted into perfect THDL Wylie. See code comments to learn what
it thinks is a legal tsheg-bar, but it inlcudes bskyUMbsH minus the trailing
punctuation (H), e.g.
Illegal sequences, such as runs of transliterated Sanskrit, are turned into
unambiguous Wylie; each glyph is followed by a vowel or a disambiguator
('.').
I've made it so that the illegal sequences are as beautiful as possible. You
get 'pad+me', for example, not the equivalent but uglier 'pad+m.e.'.
which means that the command-line tool can finally function with a headless
graphics device. Hopefully it will speed things up, too. It also means that
entering Roman text into the TMW->Unicode conversion and TMW->TM
conversion will be easy.
Added support for two more oddballs.
Deprecated the oddball lookup method because it drops up to 30 glyphs in
TibetanMachine. The correct solution is to transform the RTF before Java's
busted RTF readers ever see it. \'97 becomes \u151, e.g.
beginning of the document as they should and as they are documented to.
They now do, and they bracket the bad characters with the TM or TMW for
U+0F3C on the left and the TM or TMW for U+0F3D on the right.
Some cleanup.
the troublesome glyphs are now put at the beginning of the document
AFTER AN ACHEN. This makes a glyph like \tmw7095 visible atop the
achen.
Major fix to the handling of paragraphs in conversion; we were (for
whatever reason) dropping paragraphs before.
inserting 5 characters at a time and then skipping ahead just one
position. I don't think this affected correctness.
I believe there's still a terrible (exponential?) slowdown as the
input file gets bigger, however. Perhaps not -- but we run through
the first 1000 TMW glyphs in 6 seconds, the 20th thousand takes at
least 60 seconds. Is TMW->Wylie faster than TMW->Unicode? If so,
why?
Thought: don't use a DuffPane within TibetanConverter -- it can only
add overhead, right? My hprof profile said that the conversion was
taking just a couple of percent of the work; the rest was going to
display-related stuff that you should only see if you were displaying
the document. I'm not!