From 0378e38d4a42399780b59ece0c1cbcffa3d6b488 Mon Sep 17 00:00:00 2001 From: dchandler Date: Wed, 10 Dec 2003 06:57:12 +0000 Subject: [PATCH] Better docs w.r.t. the lexer's handling of ACIP spaces etc. --- htdocs/ACIP_To_Tibetan_Converter.html | 93 ++++++++++++++++++++++++++- 1 file changed, 92 insertions(+), 1 deletion(-) diff --git a/htdocs/ACIP_To_Tibetan_Converter.html b/htdocs/ACIP_To_Tibetan_Converter.html index e4cb38a..8bba84e 100644 --- a/htdocs/ACIP_To_Tibetan_Converter.html +++ b/htdocs/ACIP_To_Tibetan_Converter.html @@ -804,7 +804,81 @@ TIBETAN FONT AND NEEDS TO BE REDONE BY DOUBLE INPUT]"

- FIXME: describe when the converter treats a space as a tsheg and when a space is Tibetan whitespace.  Describe how a tsheg does not appear after {KA} and {GA} with most vowels, describe the handling of {NGA,} as {NGA ,}.  Talk about dzongkha vs. tibetan when it comes to a tsheg at the end of a string of tsheg bars.  Describe treatment of final line break or lack thereof.  Warn users to watch out for lines that end with {-}.  Describe treatment of {.} in certain contexts as U+0F0C.  Etc. + The converters will insert a tsheg in some places where no ACIP + { } appears; this happens after {PA} and {DANG,} below: +

+
+GA PA
+
+GA PHA 
+
+DAM,
+LHAG
+
+GA CA,
+
+GA 
+
+ +

+ Note that a space appears after {PHA}, and a comma appears after + {CA}, but {PA} has nothing between it and a line break.  The + converters are smart enough to insert a tsheg regardless. +

+ +

+ Also missing from the above ACIP, but inserted automatically by the + converters, is Tibetan whitespace; the converter sees + {DAM, LHAG} instead of {DAM,LHAG} above. +

+ +

+ If such automatic corrections are not desired, try using a Unicode + escape before the line break instead of {PA} + or {,}. +

+ +

+ The converters also treat {NGA,} as a typo for {NGA ,} + (actually, {NGA\u0F0C,} since one wouldn't want a line break to + occur after the tsheg and cause a shad to begin a + line; see the section on formatting Tibetan texts in the Tibetan! + 5.1 documentation) because Tibetan typesetting requires that NGA + not appear directly before a shad.  (Perhaps {NGA,} + would look too much like {KA}.) +

+ +

+ The converters embody the rule that a shad does not appear + after GA or KA unless a shabs kyu vowel is on the GA or + KA.  For example, the space in {MA ,HA} is a tsheg, + and the space in {KU ,HA} is a tsheg, but the space in + {GA ,HA} is Tibetan whitespace. +

+ +

+ If you find that the converters put a tsheg where it does not + belong, miss a tsheg, or put whitespace where it does belong, + please contact the + developers. +

+ +

+ Though the ACIP standard does not mention it, it appears that some + ACIP Release IV texts use a period (i.e., {.}) to indicate a + non-breaking tsheg (i.e., U+0F0C).  Search for {NGO.,}, + {....,DAM}, etc.  Unless {,}, {.}, or a letter (i.e., a through + z) follows the {.}, it is only grudingly interpreted as a + non-breaking tsheg -- a warning is generated, too.  FIXME: Is + this right?  Allow for treating {.} as an outright error. +

+ +

+ Note that the treatment of the very last line in an input text is + circumspect. +

+ +
  • + The treatment of {:} directly before a line break is likely + incorrect; a tsheg is inserted right now after the + visarga. +
  • @@ -1486,6 +1572,11 @@ Nativeness The converter should warn for each occurrence of the vowels {'E}, {'O}, {'EE}, or {'OO}. +
  • + Default substitution rules should handle + {KAsh}, which seems to always mean {K+sh} in ACIP Release IV + texts. +