Better docs w.r.t. the lexer's handling of ACIP spaces etc.

This commit is contained in:
dchandler 2003-12-10 06:57:12 +00:00
parent 8561623b5e
commit 0378e38d4a

View file

@ -804,7 +804,81 @@ TIBETAN FONT AND NEEDS TO BE REDONE BY DOUBLE INPUT]"
</p>
<p>
FIXME: describe when the converter treats a space as a <i>tsheg</i> and when a space is Tibetan whitespace.&nbsp; Describe how a tsheg does not appear after {KA} and {GA} with most vowels, describe the handling of {NGA,} as {NGA&nbsp;,}.&nbsp; Talk about dzongkha vs. tibetan when it comes to a <i>tsheg</i> at the end of a string of <i>tsheg bar</i>s.&nbsp; Describe treatment of final line break or lack thereof.&nbsp; Warn users to watch out for lines that end with {-}.&nbsp; Describe treatment of {.} in certain contexts as U+0F0C.&nbsp; Etc.
The converters will insert a <i>tsheg</i> in some places where no ACIP
{&nbsp;} appears; this happens after {PA} and {DANG,} below:
</p>
<pre>
GA PA
GA PHA
DAM,
LHAG
GA CA,
GA
</pre>
<p>
Note that a space appears after {PHA}, and a comma appears after
{CA}, but {PA} has nothing between it and a line break.&nbsp; The
converters are smart enough to insert a <i>tsheg</i> regardless.
</p>
<p>
Also missing from the above ACIP, but inserted automatically by the
converters, is Tibetan whitespace; the converter sees
{DAM,&nbsp;LHAG} instead of {DAM,LHAG} above.
</p>
<p>
If such automatic corrections are not desired, try using a Unicode
<a href="#escapes">escape</a> before the line break instead of {PA}
or {,}.
</p>
<p>
The converters also treat {NGA,} as a typo for {NGA&nbsp;,}
(actually, {NGA\u0F0C,} since one wouldn't want a line break to
occur after the <i>tsheg</i> and cause a <i>shad</i> to begin a
line; see the section on formatting Tibetan texts in the <i>Tibetan!
5.1</i> documentation) because Tibetan typesetting requires that NGA
not appear directly before a <i>shad</i>.&nbsp; (Perhaps {NGA,}
would look too much like {KA}.)
</p>
<p>
The converters embody the rule that a <i>shad</i> does not appear
after GA or KA unless a <i>shabs kyu</i> vowel is on the GA or
KA.&nbsp; For example, the space in {MA&nbsp;,HA} is a <i>tsheg</i>,
and the space in {KU&nbsp;,HA} is a <i>tsheg</i>, but the space in
{GA&nbsp;,HA} is Tibetan whitespace.
</p>
<p>
If you find that the converters put a <i>tsheg</i> where it does not
belong, miss a <i>tsheg</i>, or put whitespace where it does belong,
please contact <a
href="mailto:thdltools-devel@lists.sourceforge.net">the
developers</a>.
</p>
<p>
Though the ACIP standard does not mention it, it appears that some
ACIP Release IV texts use a period (i.e., {.}) to indicate a
non-breaking tsheg (i.e., U+0F0C).&nbsp; Search for {NGO.,},
{....,DAM}, etc.&nbsp; Unless {,}, {.}, or a letter (i.e., a through
z) follows the {.}, it is only grudingly interpreted as a
non-breaking tsheg -- a warning is generated, too.&nbsp; FIXME: Is
this right?&nbsp; Allow for treating {.} as an outright error.<!--
DLC FIXME -->
</p>
<p>
Note that the treatment of the very last line in an input text is
circumspect.<!-- DLC FIXME -->
</p>
<!-- <h1>DLC</h1>
@ -1397,6 +1471,18 @@ Nativeness</h2>
a change in font size.)&nbsp; [<a
href="http://sourceforge.net/tracker/index.php?func=detail&aid=855519&group_id=61934&atid=502515">855519</a>]
</li>
<li>
A folio marker {@0B1} can appear; it gives an error at present.
</li>
<li>
The treatment of the very last line in an input text may be buggy
with regard to treatment of ACIP spaces, etc.<!-- DLC -->
</li>
<li>
The treatment of {:} directly before a line break is likely
incorrect; a <i>tsheg</i> is inserted right now after the
visarga.<!-- DLC FIXME -->
</li>
</ul>
@ -1486,6 +1572,11 @@ Nativeness</h2>
The converter should warn for each occurrence of the vowels {'E},
{'O}, {'EE}, or {'OO}.
</li>
<li>
Default <a href="#sub">substitution</a> rules should handle
{KAsh}, which seems to always mean {K+sh} in ACIP Release IV
texts.
</li>
</ul>