Commit graph

11 commits

Author SHA1 Message Date
dchandler
717c3b94f3 Fixed ACIP->Unicode spaces/tshegs and newlines, especially with shads.
"NGA," becomes "NGA-tsheg-," automatically now.
2003-09-05 05:08:47 +00:00
dchandler
cc615f34df ACIP->TMW and ACIP->Unicode have my pre-stamp of non-approval. Except
for (NYAx} and {NYAo}, they're as good as I'll get them without input
from experts of the employ of a complementary, syllabary-based
approach.
2003-09-04 04:34:18 +00:00
dchandler
d2749cecd0 ACIP->TMW and ACIP->Unicode are now smart about when a newline is
really a newline and when a space is really a tsheg. The space in {KA
,MDO} is a tsheg, but the space in {GA ,MDO} is not.
2003-09-04 04:04:21 +00:00
dchandler
1982c5847b Jskad's converter now has ACIP-to-Unicode built in. There are known
bugs; it is pre-alpha.  It's usable, though, and finds tons of errors
in ACIP input files, with the user deciding just how pedantic to be.
The biggest outstanding bug is the silent one: treating { }, space, as
tsheg instead of whitespace when we ought to know better.
2003-08-24 06:40:53 +00:00
dchandler
d5ad760230 TMW->Wylie conversion now takes advantage of prefix rules, the rules
that say "ya can take a ga prefix" etc.

The ACIP->Unicode converter now gives warnings (optionally, and by
default, inline).  This converter now produces output even when
lexical errors occur, but the output has errors and warnings inline.
2003-08-23 22:03:37 +00:00
dchandler
1afb3a0fdd ACIP->Unicode, without going through TMW, is now possible, so long as
\, the Sanskrit virama, is not used.  Of the 1370-odd ACIP texts I've
got here, about 57% make it through the gauntlet (fewer if you demand
a vowel or disambiguator on every stack of a non-Tibetan tsheg bar).
2003-08-18 02:38:54 +00:00
dchandler
245aac4911 I'm now stricter about accepting alphabetic characters. F, Q, X, a,
b, c, d, e, ... do not belong in ACIP, so the scanner rejects them.
This should make it even easier to distinguish automatically between
Tibetan and English texts.
2003-08-17 02:38:58 +00:00
dchandler
39451d8879 Fixed a couple of small bugs.
Only 250 errors are reported now; this is important if you try to
convert an English document.
2003-08-17 02:12:49 +00:00
dchandler
4581a2d8ab Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.)  It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs.  The error checking is more user-friendly.  There are now
tests.

Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests.  Many thanks, Peter.  I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
dchandler
0b91ed0beb I've improved the ACIP tsheg bar scanner to handle a lot of illegal
constructions that occur in practice.
2003-08-16 16:13:53 +00:00
dchandler
2b59d9838d I now have a function that takes as input a String of ACIP and breaks
up that String into tsheg bars, punctuation, etc., while finding
errors.  I've tested it some, but I'm not yet committing the tests.

Next step: a converter that takes an ACIP file as input and outputs
TMW+Latin.
2003-08-14 05:10:47 +00:00