because I don't know which glyphs o and x correspond to. For that
reason, they cause ERRORs.
The proposed THDL Extended Wylie ~X and X is now used for U+0F35 and
U+0F37 respectively.
bugs; it is pre-alpha. It's usable, though, and finds tons of errors
in ACIP input files, with the user deciding just how pedantic to be.
The biggest outstanding bug is the silent one: treating { }, space, as
tsheg instead of whitespace when we ought to know better.
that say "ya can take a ga prefix" etc.
The ACIP->Unicode converter now gives warnings (optionally, and by
default, inline). This converter now produces output even when
lexical errors occur, but the output has errors and warnings inline.
\, the Sanskrit virama, is not used. Of the 1370-odd ACIP texts I've
got here, about 57% make it through the gauntlet (fewer if you demand
a vowel or disambiguator on every stack of a non-Tibetan tsheg bar).
b, c, d, e, ... do not belong in ACIP, so the scanner rejects them.
This should make it even easier to distinguish automatically between
Tibetan and English texts.
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
up that String into tsheg bars, punctuation, etc., while finding
errors. I've tested it some, but I'm not yet committing the tests.
Next step: a converter that takes an ACIP file as input and outputs
TMW+Latin.