Also, TMW->EWTS now generates \uF021-\uF0FF or \u0F00-\u0FFF escapes when appropriate. A few TMW glyphs still give errors.
Also, there's now a test to be sure that TM<->TMW and TMW->EWTS won't break in the future (except for the one glyph in TMW that isn't in TM, that one isn't tested). The baselines have not been hand-verified, but changes will be detected.
Readded the line for reversed dza; it should never have been deleted, as that breaks TM<->TMW. I tested the whole mapping by hand once; this incident shows that automation is very helpful.
'{' and '}' were swapped...
The Unicode for something was "", not "none".
+R, +W, +Y, R+ now in use (though more testing is needed)
stacks that use full-form subjoined RA and YA consonants.
ACIP {RVA} was converting to the wrong things.
The TMW for {RVA} was converting to the wrong ACIP.
Checked all the 'DLC' tags in the ttt (ACIP->Tibetan) package.
non-reference, publicly available ACIP files (hundreds of megabytes of
them) through the converter. The frequencies of these tsheg bars in
in the file, too.
This mechanism is for Andres (who noticed KAsh=>K+sh in practice) and power users only, and not power users until I document the thing outside of the source code.
Now Jskad has two "Convert selected ACIP to Tibetan" conversions, one with and one without warnings, built in to Jskad proper (not the converter, that is).
Now Jskad has two "Convert selected ACIP to Tibetan" conversions, one with and one without warnings, built in to Jskad proper (not the converter, that is).
The code would be cleaner if I could bear to delete my terrible hack. Maybe in a month, when I don't feel so dumb for coding it up in the first place.
The correct solution for such things is to give the ACIP->Tibetan converters a pre-filter mechanism. This would be before the lexer or part of the lexer (maybe you only want to filter tsheg bars), and it would allow the end user to specify things like "s/SNYAM'AM/S+NYAMA'AMA/g".
Fixed ACIP->Unicode/TMW for BDE, which should be B-DE, not B+DE, because the former is legal Tibetan.
The ACIP->EWTS subroutine has improved.
TMW->Wylie and TMW->ACIP are improved in error cases.
TMW->ACIP has friendly embedded error messages now.
because I don't know which glyphs o and x correspond to. For that
reason, they cause ERRORs.
The proposed THDL Extended Wylie ~X and X is now used for U+0F35 and
U+0F37 respectively.
TMW->Wylie text. All the conversions show you which format they take
as input and which format they give as output.
File filter for ACIP files added.
The GUI converter suggests a file extension wisely.
Fixed newline bug in ACIP->Unicode converter.
that the following would look better with shortened 'dreng-bu also,
but I'm sticking with the TM/TMW docs:
dz+r~137,2~~4,46~1,110~4,120~1,123~1,126~4,106~4,113~f5b,fb2
dz+w~138,2~~4,47~1,110~4,120~1,123~1,126~4,106~4,113~f5b,fad
dz+h~139,2~~4,48~1,110~4,120~1,123~1,126~4,106~4,113~0F5C
dz+h+y~140,2~~4,49~1,110~4,121~1,123~1,126~4,107~4,114~f5c,fb1
dz+h+r~141,2~~4,50~1,110~4,121~1,123~1,126~4,107~4,114~f5c,fb2
dz+h+l~249,2~~4,51~1,110~4,123~1,123~1,126~4,110~4,117~f5c,fb3
dz+h+w~143,2~~4,52~1,110~4,122~1,123~1,126~4,108~4,115~f5c,fad
mappings, so I've put them back, but with the EWTS non-correspondences
\tmwXYYY.
Jskad no longer supports superscribed or subscribed numerals, because
EWTS does not.
Brought <?Other?> closer to EWTS
Removed __TILDE__ (no longer in EWTS)
Changed M^ to ^M per new EWTS draft
Added ai, au, -i from WW tibwn.ini -- they were missing in this version
bugs; it is pre-alpha. It's usable, though, and finds tons of errors
in ACIP input files, with the user deciding just how pedantic to be.
The biggest outstanding bug is the silent one: treating { }, space, as
tsheg instead of whitespace when we ought to know better.
that say "ya can take a ga prefix" etc.
The ACIP->Unicode converter now gives warnings (optionally, and by
default, inline). This converter now produces output even when
lexical errors occur, but the output has errors and warnings inline.
\, the Sanskrit virama, is not used. Of the 1370-odd ACIP texts I've
got here, about 57% make it through the gauntlet (fewer if you demand
a vowel or disambiguator on every stack of a non-Tibetan tsheg bar).
b, c, d, e, ... do not belong in ACIP, so the scanner rejects them.
This should make it even easier to distinguish automatically between
Tibetan and English texts.
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
up that String into tsheg bars, punctuation, etc., while finding
errors. I've tested it some, but I'm not yet committing the tests.
Next step: a converter that takes an ACIP file as input and outputs
TMW+Latin.
and it has the capability to produce error messages and warnings that
make sense to the user. One can now get the correct parse, if one
exists, for an ACIP tsheg bar.
One could even feed in ACIP and get a list of warnings about things as
innocuous as PADMA, which a dumb converter would have trouble with.
One could then turn ACIP into well-behaved ACIP for that dumb
converter, if you really wanted to.
Still to do:
o Scan ACIP files into tsheg bars.
o Produce TMW/Latin (from which you can get Unicode, etc.).
o E-mail the illegal tsheg bars to the ACIP fellows so they can fix
the affected documents (most of the Kangyur has unparseable
creatures).
Our disambiguation is now perfect, happening when and only when it is
necessary. These are all illegal, so it shouldn't affect many
existing conversions. But if there were typos, it could.