Jskad

Author	SHA1	Message	Date
dchandler	aa5d86a6e3	The *->Unicode conversions were outputting Unicode that was not well-formed. They still do, but they do it less often. Chris Fynn wrote this a while back: By normal Tibetan & Dzongkha spelling, writing, and input rules Tibetan script stacks should be entered and written: 1 headline consonant (0F40->0F6A), any subjoined consonant(s) (0F90-> 0F9C), achung (0F71), shabkyu (0F74), any above headline vowel(s) (0F72 0F7A 0F7B 0F7C 0F7D and 0F80); any ngaro (0F7E, 0F82 and 0F83). Now efforts are made to ensure that the converters conform to the above rules.	2004-12-13 02:32:46 +00:00
dchandler	e2d42f36eb	Robert Chilton's experience inspired me to make the handling of errors and warnings in ACIP->Tibetan conversion much more configurable. You can now choose from short or long error messages, for one thing. You can change the severity of almost all warnings. Each error and warning has an error code. Errors and warnings are better tested. The converter GUI has a new checkbox for short messages; the converter CLI has a new mandatory option for short messages. I also fixed a bug whereby certain errors were not being appended to the 'errors' StringBuffer.	2004-04-24 17:49:16 +00:00
dchandler	542fb50bf1	The ~M and ~M` EWTS change had not fully been made. Someone submitted a bug report 911472 that alerted me to this.	2004-03-07 17:02:35 +00:00
dchandler	48b4c5cb07	Added a Unicode->ASCII dump for debugging *->Unicode conversions. To use it, use 'java -cp Jskad.jar org.thdl.util.VerboseUnicodeDump'.	2004-01-17 17:10:12 +00:00
dchandler	6232ee9170	Added comments referring to a user guide in development now.	2003-12-06 20:26:15 +00:00
dchandler	e7c4cc1874	Updated to be in sync with latest EWTS draft.	2003-11-29 22:59:39 +00:00
dchandler	4023be9612	Better prettyprinting. Untested.	2003-11-11 03:43:26 +00:00
dchandler	d99ae50d8a	The ACIP "BNA" was converting to B-NA instead of B+NA, even though NA cannot take a BA prefix. This was because BNA was interpreted as root-suffix. In ACIP, BN is surely B+N unless N takes a B prefix, so root-suffix is out of the question. Now Jskad has two "Convert selected ACIP to Tibetan" conversions, one with and one without warnings, built in to Jskad proper (not the converter, that is).	2003-10-26 00:24:28 +00:00
dchandler	306cf2817c	Private correspondence with Robert Chilton led to me to add and remove a few prefix rules. BLC and BGL are here, BLK, BLG, BLNG, BLJ, BNG, BJ, BNY, BN, and BDZ are gone. Added a few new tests.	2003-10-25 21:47:34 +00:00
dchandler	f106deb884	Private correspondence with Robert Chilton led to me to add and remove a few prefix rules. BLC and BGL are here, BLK, BLG, BLNG, BLJ, BNG, BJ, BNY, BN, and BDZ are gone. Added a few new tests.	2003-10-25 21:40:21 +00:00
dchandler	3b55ea509f	Prefix rules have changed. A few are gone; a few new ones are here. I've implemented here a list that Robert Chilton sent me in private correspondence. He doesn't describe it as definitive, but since it affects ACIP->Tibetan conversions, and it's the best I've got, here they are. There's still an optional warning about "Hey, prefix rules matter for this tsheg bar." I've left in a few rules that I didn't find on RC's list; I've asked him to look into these further.	2003-10-18 05:48:53 +00:00
dchandler	d5ad760230	TMW->Wylie conversion now takes advantage of prefix rules, the rules that say "ya can take a ga prefix" etc. The ACIP->Unicode converter now gives warnings (optionally, and by default, inline). This converter now produces output even when lexical errors occur, but the output has errors and warnings inline.	2003-08-23 22:03:37 +00:00
dchandler	1afb3a0fdd	ACIP->Unicode, without going through TMW, is now possible, so long as \, the Sanskrit virama, is not used. Of the 1370-odd ACIP texts I've got here, about 57% make it through the gauntlet (fewer if you demand a vowel or disambiguator on every stack of a non-Tibetan tsheg bar).	2003-08-18 02:38:54 +00:00
dchandler	b387c512e9	Fixed two bugs.	2003-06-15 03:08:57 +00:00
dchandler	6636d03a41	ant private-javadocs runs without warnings; cleaned up some as-yet-unused code.	2003-04-13 01:46:20 +00:00
dchandler	daacf6ee3b	I've got too many sandboxes, so I'm committing these changes, half-done, from one sandbox so as to consolidate my sandboxes.	2003-04-12 20:56:20 +00:00
dchandler	6e05b60cff	I'll need these when I turn a sequence of UnicodeGraphemeClusters into LegalTshegBars.	2003-04-12 20:19:02 +00:00
dchandler	eb71fb6075	"sgom pa'am " is correct, not "sgom pa'm ".	2003-04-07 23:49:07 +00:00
dchandler	d836b850e8	"sgom pa'm ", not "sgom pa'am", is now used. "pe'm " was being produced already, so the code was inconsistent. If it turns out that "pe'am " is preferred, I'll fix it later. Consistency is very appealing.	2003-03-31 01:38:27 +00:00
dchandler	33b3080068	Fixed a bunch of bugs; supports le'u'i'o, sgom pa'am, etc. Better tests. As part of that, I had to break TibetanMachineWeb into TibetanMachineWeb+THDLWylieConstants, because I don't want the class-wide initialization code from TibetanMachineWeb causing errors in LegalTshegBarTest.	2003-03-31 00:33:50 +00:00
dchandler	2b81020b0e	More and better tests; fixed some bugs in LegalTshegBar.	2003-03-28 03:49:49 +00:00
dchandler	08d2a5d702	Added a test for org.thdl.tib.text.tshegbar.UnicodeCodepointToThdlWylie.	2003-03-22 04:55:17 +00:00
dchandler	f2dcb0cbc3	I said I removed this earlier; I lied. Now it's gone.	2003-03-22 03:58:13 +00:00
dchandler	16cbfb6033	Moved ad-hoc test.java test cases to UnicodeGraphemeClusterTest.java, a JUnit test which can be run via 'ant check'. Removed test.java and its build process.	2003-03-22 03:55:39 +00:00
dchandler	395eca7bb1	Moved ad-hoc test.java test cases to LegalTshegBarTest.java, a JUnit test which can be run via 'ant check'.	2003-03-22 03:46:32 +00:00
dchandler	879b477902	Made some ad-hoc tests in test.java into JUnit tests, run by 'ant check'. NORM_NFD was replaced with NORM_NFKD in three cases in testMostlyNFKD.	2003-03-22 03:24:56 +00:00
dchandler	e5a63df1c1	Added a class skeleton that may not stay for long. I'm committing in order to sync with my laptop, really. This stuff will disappear and reappear in better form later, after a holiday of coding and eggless, alcohol-free nog.	2002-12-20 04:46:13 +00:00
dchandler	fdfedb4419	Added some tests for org.thdl.tib.text.tshegbar. These tests are preliminary, and for this package only. I'm committing in order to sync with my laptop, really. This stuff will disappear and reappear in better form later, after a holiday of coding and eggless, alcohol-free nog.	2002-12-20 04:34:56 +00:00
dchandler	7ea185fa01	Renamed UnicodeCharToExtendedWylie to UnicodeCodepointToThdlWylie.java. Added a new class, UnicodeGraphemeCluster, that can tell you the components of a grapheme cluster from top to bottom. It does not yet have good error checking; it is not yet finished. Next is to parse clean Unicode into GraphemeClusters. After that comes scanning dirty Unicode into best-guess GraphemeClusters, and scanning dirty Unicode to get nice error messages.	2002-12-17 13:51:18 +00:00
dchandler	8e8a23c6a6	Extended Wylie is referred to as THDL Extended Wylie or THDL Wylie because a Japanese scholar has an "Extended Wylie" also. NFKD and NFD have a new brother, NFTHDL. I wish there weren't a need, but as my yet-to-be-put-into-CVS break-unicode-into-grapheme-clusters code demonstrates, the-need-is-there. forgive-me for the hyphens, it's late.	2002-12-15 06:57:32 +00:00
dchandler	a42347b224	Now uses terminology from the Unicode standard. No more talk of characters, for example. Normalization forms NFKD and NFD are supported for the Tibetan Unicode range. I don't like either, actually. I've tested NFKD, but I've not yet committed the tests.	2002-12-15 03:35:24 +00:00
dchandler	26993a5093	So that Unicode escape sequences appear correctly in javadocs.	2002-12-09 02:35:39 +00:00
dchandler	2d6c8be804	So that Unicode escape sequences appear correctly in javadocs.	2002-12-09 02:29:09 +00:00
dchandler	22c6ec5406	Javadoc now works without warnings.	2002-12-09 01:48:34 +00:00
dchandler	f4a16f8e9d	This commit is for my benefit only; these classes are not ready for prime time, and the build system is not yet aware of them. I'm adding some classes for representing legal tsheg-bars (syllables, for the most part) in Unicode. These classes were designed bottom-up (OK, OK -- they weren't designed designed, but I had to write down everything I knew about Tibetan syntax somewhere). The classes are aware of extended wylie. I doubt the Javadocs work yet, and I'm still testing (and am not committing my testing code with these as it is not yet ready). Next on my list--fix these up to reflect my new awareness of suffix particles (like le'u'i'o) add classes to support syntactically incorrect Unicode sequences. Then add a UnicodeReader, and we've got the back end of a Tibetan Unicode shaping system (like half of MS's Uniscribe or Apple's Worldscript or FreeType Layout or Omega's OTPs). A top-down design would not have included LegalTshegBar. But now that my itch has been scratched, potential uses are lingering about. For example, it would be nice to scan some input and break it into LegalTshegBars, punctuation/marks/signs, and illegal stacks. Then we could alert the client of the illegality, its precise form, and its precise location. The real system for turning a Unicode stream into an internal representation suitable for conversion to EWTS/ACIP/XHTML/what-have-you need not be aware of Tibetan syntax. But to make the very best conversion from Unicode to, e.g., EWTS, it is necessary to konw that gaskad is better represented as gskad, but that jaskad is not the same as jskad.	2002-12-09 01:02:23 +00:00

35 commits