4aac262355
a couple of references that I didn't grok.
1933 lines
70 KiB
HTML
1933 lines
70 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
|
|
<!-- @author David Chandler -->
|
|
<!-- @editor Emacs, baby! -->
|
|
|
|
|
|
<head>
|
|
<title>ACIP To Tibetan Converters</title>
|
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
|
<script type="text/javascript" src="http://orion.lib.virginia.edu/thdl/scripts/thdl_scripts.js"></script>
|
|
<link rel="stylesheet" type="text/css" href="http://orion.lib.virginia.edu/thdl/style/thdl-styles.css"/>
|
|
</head>
|
|
|
|
<body>
|
|
|
|
<div id="banner">
|
|
<a id="logo" href="http://orion.lib.virginia.edu/thdl/index.html"><img id="test" alt="THDL Logo" src="http://orion.lib.virginia.edu/thdl/images/logo.png"/></a>
|
|
<h1>The Tibetan & Himalayan Digital Library</h1>
|
|
|
|
<div id="menubar">
|
|
<script type='text/javascript'>function Go(){return}</script>
|
|
<script type='text/javascript' src='http://orion.lib.virginia.edu/thdl/scripts/new/thdl_menu_config.js'></script>
|
|
|
|
<script type='text/javascript' src='http://orion.lib.virginia.edu/thdl/scripts/new/menu_new.js'></script>
|
|
<script type='text/javascript' src='http://orion.lib.virginia.edu/thdl/scripts/new/menu9_com.js'></script>
|
|
<noscript><p>Your browser does not support javascript.</p></noscript>
|
|
<div id='MenuPos' >Menu Loading... </div>
|
|
</div><!--END menubar-->
|
|
|
|
</div><!--END banner-->
|
|
|
|
<div id="sub_banner">
|
|
<div id="search">
|
|
<form method="get" action="http://www.google.com/u/thdl">
|
|
<p>
|
|
<input type="text" name="q" id="q" size="15" maxlength="255" value="" />
|
|
<input type="submit" name="sa" id="sa" value="Search"/>
|
|
<input type="hidden" name="hq" id="hq" value="inurl:orion.lib.virginia.edu"/>
|
|
</p>
|
|
</form>
|
|
|
|
</div>
|
|
<div id="breadcrumbs">
|
|
<a href="http://orion.lib.virginia.edu/thdl/index.html">Home</a> > <a href="index.html">Tools</a> > <a href="http://orion.lib.virginia.edu/thdl/tools/allfonts.html">Fonts & Input</a> > <a href="http://orion.lib.virginia.edu/thdl/tools/conv.html">Converters</a> > <a href="TMW_RTF_TO_THDL_WYLIE.html">Converters in Jskad</a> > ACIP To Tibetan Converters
|
|
</div>
|
|
</div><!--END banner-->
|
|
|
|
|
|
<div id="main">
|
|
|
|
<h2>ACIP To Tibetan Converters</h2>
|
|
|
|
<p>
|
|
This document describes the ACIP->Tibetan converters built atop
|
|
<a
|
|
href="http://orion.lib.virginia.edu/thdl/tools/jskad.html">Jskad</a>.
|
|
These converters were initially written by David Chandler, a
|
|
volunteer with the <a
|
|
href="http://orion.lib.virginia.edu/thdl/index.html">Tibetan and
|
|
Himalayan Digital Library</a>, in the latter half of 2003.
|
|
They built upon the work of Tony Duff, Edward Garrett, and Than
|
|
Garson, and they would not be possible without the assistance of
|
|
David Chapman, Robert Chilton, and Andrés Montano
|
|
Pellegrini. (Please correct, and forgive, any omissions from
|
|
these lists.)
|
|
</p>
|
|
|
|
<p>
|
|
These converters accept <a href="http://asianclassics.org">Asian
|
|
Classics Input Project</a> (ACIP) transliteration of Tibetan (using
|
|
ACIP's <a
|
|
href="http://asianclassics.org/download/tibetancode/ticode.pdf">Tibetan
|
|
Input Code</a>), a Roman transliteration scheme. ACIP has many
|
|
Buddhist texts available in ACIP transliteration, which alone makes
|
|
ACIP transliteration (or just ACIP for short) important.
|
|
</p>
|
|
|
|
<p>
|
|
The converters here accept a text file of ACIP and output either a
|
|
Unicode UTF-8-encoded text file or a Rich Text Format (RTF) file of
|
|
<a href="http://orion.lib.virginia.edu/thdl/tools/tmw.html">Tibetan
|
|
Machine Web</a> (TMW). The latter is ready to use onscreen and
|
|
to make beautiful hardcopy today; the former will be understood by
|
|
software for a long time to come.
|
|
</p>
|
|
|
|
<p>
|
|
The converters are meant to produce perfect results even for
|
|
imperfect input. To give you an idea of the thought and care
|
|
that went into these converters, consider the following partial list
|
|
of features:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
Four tiers of <a href="#diagnostics">warning and error
|
|
messages</a> are available.
|
|
</li>
|
|
<li>
|
|
Some transliterations specified by the ACIP standard are not
|
|
accepted (i.e., they cause <a href="#diagnostics">errors</a>)
|
|
because they are used too often improperly in Release V texts
|
|
(e.g., {\}); some non-standard transliteration is understood
|
|
because it is used in ACIP Release V texts (e.g., {[DD1]}).
|
|
</li>
|
|
<li>
|
|
Non-standard <a href="#escapes">Unicode character escapes</a> are
|
|
supported. (In this way, the glyph that the ACIP {\} refers
|
|
to according to the standard can in fact be represented, via
|
|
{\u0F84}.)
|
|
</li>
|
|
<li>
|
|
<a href="#colors">Color-coding</a> can help find typos in the
|
|
input.
|
|
</li>
|
|
<li>
|
|
A <a href="#sub">substitution</a> mechanism allows for correcting
|
|
erroneous documents on the fly.
|
|
</li>
|
|
<li>
|
|
The converters can output frequency <a
|
|
href="#stats">statistics</a>.
|
|
</li>
|
|
<li>
|
|
The <a href="#lex">"lexical analyzer"</a> and <a
|
|
href="#parse">"parser"</a> handle every intricacy of
|
|
real ACIP Release V texts.
|
|
</li>
|
|
<li>
|
|
The knowledge regarding the TMW font has been verified by
|
|
independent teams as described <a
|
|
href="TMW_or_TM_To_X_Converters.html#vv">here</a>.
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
The ACIP->Unicode and ACIP->TMW converters are equally
|
|
good. There are some differences between the two,
|
|
though. The TMW font has only a fixed set of glyphs, whereas
|
|
Unicode can encode arbitrary Tibetan glyphs. Thus, the
|
|
hypothetical ACIP {GAI}, which parses as {G+AI} due to <a
|
|
href="#prefix">prefix rules</a>, will give an error in an
|
|
ACIP->TMW conversion because no glyph exists for this
|
|
stack. The ACIP->Unicode conversion will succeed, having
|
|
generated correct Unicode. This is the only difference between
|
|
the two conversions.
|
|
</p>
|
|
|
|
<p>
|
|
The converters are actively maintained; your <a
|
|
href="mailto:thdltools-devel@lists.sourceforge.net">feedback</a> is
|
|
valued.
|
|
</p>
|
|
|
|
<p>
|
|
Note that there are also <a
|
|
href="TMW_or_TM_To_X_Converters.html">TMW->ACIP</a> converters
|
|
available; this document does not cover them.
|
|
</p>
|
|
|
|
<p>
|
|
In what follows, you will learn <a href="#using">how to use</a> the
|
|
converters, including all the features listed above, and you'll find
|
|
a list of <a href="#bugs">known bugs</a> and places where there is
|
|
<a href="#room">room for improvement</a>.
|
|
</p>
|
|
|
|
|
|
<a name="using"></a><h2>Using the Converters</h2>
|
|
|
|
<p>
|
|
This section briefly describes how the converters are best used.
|
|
</p>
|
|
|
|
<p>
|
|
The GUI and command-line interfaces are both sufficient; the GUI
|
|
interface is your best bet if you've not used the converters
|
|
before. To learn how to invoke these interfaces, read <a
|
|
href="TMW_RTF_TO_THDL_WYLIE.html#invok">these instructions</a>.
|
|
</p>
|
|
|
|
<p>
|
|
First, review the <a href="#bugs">known bugs</a> and be sure you can
|
|
live with them.
|
|
</p>
|
|
|
|
<p>
|
|
Now perform a trial conversion of your document with <a
|
|
href="#diagnostics">warnings</a> disabled. You will first
|
|
ensure that no outright <a href="#diagnostics">errors</a> appear in
|
|
the input. If any do, make a copy of the input, edit the
|
|
input, and feed it through again. Feel free to try this out as
|
|
soon as you're comfortable; the error messages themselves are
|
|
sometimes self-explanatory.
|
|
</p>
|
|
|
|
<p>
|
|
Once all errors have been corrected, do a conversion with warning
|
|
level 'Some'. If any warnings mark real problems, correct
|
|
those problems.
|
|
</p>
|
|
|
|
<p>
|
|
If you have the patience, now do a conversion with warning level
|
|
'Most' and correct further problems. If any warnings mark real
|
|
problems, correct those problems.
|
|
</p>
|
|
|
|
<p>
|
|
The 'All' warning level is pedantic; you might find it useful if
|
|
you're writing software that is to produce ACIP transliteration that
|
|
is easily read by machines. If you find any useful warnings at
|
|
this level, report it as a bug -- such warnings should be 'Most' or
|
|
'Some' level.
|
|
</p>
|
|
|
|
<p>
|
|
For best results, produce <a href="#colors">color-coded
|
|
output</a>. Scan the output for non-<a
|
|
href="#native">native</a> <i>tsheg bar</i>s and ensure that they
|
|
match the original document (the one from which the ACIP
|
|
transliteration was produced). Color-coding is useful because,
|
|
for example, {ZHIGN} is probably a typo for {ZHING}; {ZHIGN} will
|
|
appear colored, whereas {ZHING} is not colored.
|
|
</p>
|
|
|
|
<p>
|
|
Note that the ACIP {%} gives a warning every time. Use the <a
|
|
href="#escapes">Unicode escape</a> {\u0F35} if you want to avoid
|
|
this warning, but <i>note well</i> that Unicode escapes are not part
|
|
of the ACIP standard. Thus, other tools that work with ACIP
|
|
transliteration will likely not understand {\u0F35}.
|
|
</p>
|
|
|
|
<p>
|
|
To save time, you may use the <a href="#sub"><i>tsheg-bar</i>
|
|
substitution</a> mechanism when appropriate.
|
|
</p>
|
|
|
|
<p>
|
|
Even if your desired end result is Unicode output, an ACIP->TMW
|
|
conversion is sometimes useful. One benefit is that errors
|
|
will appear for any ACIP <i>tsheg bar</i> that refers to a consonant
|
|
stack not included in TMW. These stacks should be scrutinized,
|
|
because TMW contains over 500 of the most common consonant stacks.
|
|
</p>
|
|
|
|
<p>
|
|
Finally, check a few folios by hand against the original document to
|
|
be sure that you're satisfied with the conversion.
|
|
</p>
|
|
|
|
|
|
|
|
|
|
|
|
<a name="diagnostics"></a><h2>Diagnostics: Warnings and Errors</h2>
|
|
|
|
<p>
|
|
These converters are designed such that the output is just what you
|
|
yourself would create by hand. Whenever there is doubt about
|
|
what output is desired, a warning or error is issued. This
|
|
means that a helpful warning or error message will appear in the
|
|
output, and that you will be told at the end of the conversion that
|
|
one or more warnings or errors have indeed occurred. You can
|
|
then search your output document for the text <tt>[#ERROR</tt> or
|
|
<tt>[#WARNING</tt>.
|
|
</p>
|
|
|
|
<p>
|
|
Some warning or error messages refer to lexical errors, that is,
|
|
errors that occurs when <a href="#lex">breaking an input text up
|
|
into <i>tsheg bar</i>s</a>. Others are parsing errors, that
|
|
is, errors that occur during the <a href="#parse">interpretation of
|
|
ACIP <i>tsheg bar</i>s</a>. It helps to understand both these
|
|
processes.
|
|
</p>
|
|
|
|
<p>
|
|
There are four warning levels: 'None', 'Some', 'Most', and
|
|
'All'. Choose 'None' if you don't want any warnings to appear
|
|
in your output and be brought to your attention at the end of
|
|
conversion. Choose 'Some' if you want to see the most
|
|
important warnings, 'Most' if you want some real confidence in your
|
|
output, and 'All' if you've absolutely got to know that the output
|
|
is right.
|
|
</p>
|
|
|
|
<p>
|
|
Errors will always appear; you cannot disable them.
|
|
</p>
|
|
|
|
<p>
|
|
It is possible to alter the severity of a warning at runtime.
|
|
It is not possible to make an error a warning, however, and it is
|
|
not possible to make a warning into an error (though that might be
|
|
useful [vote for RFE <a
|
|
href="http://sourceforge.net/tracker/index.php?func=detail&aid=954903&group_id=61934&atid=502518">#954903</a>
|
|
if you want it]. To change the severity of a warning, set the
|
|
system property <tt>thdl.acip.to.tibetan.warning.severity.XXX</tt>,
|
|
where XXX is the error number, e.g. 501, to your choice of
|
|
<tt>DISABLED</tt>, <tt>Some</tt>, <tt>Most</tt>, or
|
|
<tt>All</tt>. Alternatively, alter <tt>options.txt</tt>, a
|
|
file found inside the top level of the JAR file, as the comments in
|
|
that file indicate. These instructions are for experts; please
|
|
contact <a href="mailto:thdltools-devel@lists.sourceforge.net">the
|
|
developers</a> if you need help.
|
|
</p>
|
|
|
|
<p>
|
|
One may choose to have ACIP->Tibetan ERRORS appear in long (i.e.,
|
|
verbose) form or in short (i.e., terse) forms. When short
|
|
forms appear, they are embedded in the output like <tt>[#ERROR 130:
|
|
{X}]</tt>. The long forms are as follows:
|
|
</p>
|
|
|
|
<a name="101">
|
|
<p><tt>101: There's not even a unique, non-illegal parse for {X}</tt></p>
|
|
</a>
|
|
|
|
<a name="102">
|
|
<p><tt>102: Found an open bracket, 'X', within a [#COMMENT]-style comment. Brackets may not appear in comments.</tt></p>
|
|
</a>
|
|
|
|
<a name="103">
|
|
<p><tt>103: Found a truly unmatched close bracket, 'X'.</tt></p>
|
|
</a>
|
|
|
|
<a name="104">
|
|
<p><tt>104: Found a closing bracket, 'X', without a matching open bracket. Perhaps a [#COMMENT] incorrectly written as [COMMENT], or a [*CORRECTION] written incorrectly as [CORRECTION], caused this.</tt></p>
|
|
</a>
|
|
|
|
<a name="105">
|
|
<p><tt>105: Found a truly unmatched open bracket, '[' or '{', prior to this current illegal open bracket, 'X'.</tt></p>
|
|
</a>
|
|
|
|
<a name="106">
|
|
<p><tt>106: Found an illegal open bracket (in context, this is 'X'). Perhaps there is a [#COMMENT] written incorrectly as [COMMENT], or a [*CORRECTION] written incorrectly as [CORRECTION], or an unmatched open bracket?</tt></p>
|
|
</a>
|
|
|
|
<a name="107">
|
|
<p><tt>107: Found an illegal at sign, @ (in context, this is X). This folio marker has a period, '.', at the end of it, which is illegal.</tt></p>
|
|
</a>
|
|
|
|
<a name="108">
|
|
<p><tt>108: Found an illegal at sign, @ (in context, this is X). This folio marker is not followed by whitespace, as is expected.</tt></p>
|
|
</a>
|
|
|
|
<a name="109">
|
|
<p><tt>109: Found an illegal at sign, @ (in context, this is X). @012B is an example of a legal folio marker.</tt></p>
|
|
</a>
|
|
|
|
<a name="110">
|
|
<p><tt>110: Found //, which could be legal (the Unicode would be \u0F3C\u0F3D), but is likely in an illegal construct like //NYA\\.</tt></p>
|
|
</a>
|
|
|
|
<a name="111">
|
|
<p><tt>111: Found an illegal open parenthesis, '('. Nesting of parentheses is not allowed.</tt></p>
|
|
</a>
|
|
|
|
<a name="112">
|
|
<p><tt>112: Unexpected closing parenthesis, ')', found.</tt></p>
|
|
</a>
|
|
|
|
<a name="113">
|
|
<p><tt>113: The ACIP {?}, found alone, may intend U+0F08, but it may intend a question mark, i.e. '?', in the output. It may even mean that the original text could not be deciphered with certainty, like the ACIP {[?]} does.</tt></p>
|
|
</a>
|
|
|
|
<a name="114">
|
|
<p><tt>114: Found an illegal, unprintable character.</tt></p>
|
|
</a>
|
|
|
|
<a name="115">
|
|
<p><tt>115: Found a backslash, \, which the ACIP Tibetan Input Code standard says represents a Sanskrit virama. In practice, though, this is so often misused (to represent U+0F3D) that {\} always generates this error. If you want a Sanskrit virama, change the input document to use {\u0F84} instead of {\}. If you want U+0F3D, use {/NYA/} or {/NYA\u0F3D}.</tt></p>
|
|
</a>
|
|
|
|
<a name="116">
|
|
<p><tt>116: Found an illegal character, 'X', with ordinal (in decimal) 88.</tt></p>
|
|
</a>
|
|
|
|
<a name="117">
|
|
<p><tt>117: Unexpected end of input; truly unmatched open bracket found.</tt></p>
|
|
</a>
|
|
|
|
<a name="118">
|
|
<p><tt>118: Unmatched open bracket found. A comment does not terminate.</tt></p>
|
|
</a>
|
|
|
|
<a name="119">
|
|
<p><tt>119: Unmatched open bracket found. A correction does not terminate.</tt></p>
|
|
</a>
|
|
|
|
<a name="120">
|
|
<p><tt>120: Slashes are supposed to occur in pairs, but the input had an unmatched '/' character.</tt></p>
|
|
</a>
|
|
|
|
<a name="121">
|
|
<p><tt>121: Parentheses are supposed to occur in pairs, but the input had an unmatched parenthesis, '('.</tt></p>
|
|
</a>
|
|
|
|
<a name="122">
|
|
<p><tt>122: Warning, empty tsheg bar found while converting from ACIP!</tt></p>
|
|
</a>
|
|
|
|
<a name="123">
|
|
<p><tt>123: Cannot convert ACIP {X} because it contains a number but also a non-number.</tt></p>
|
|
</a>
|
|
|
|
<a name="124">
|
|
<p><tt>124: Cannot convert ACIP {X} because {V}, wa-zur, appears without being subscribed to a consonant.</tt></p>
|
|
</a>
|
|
|
|
<a name="125">
|
|
<p><tt>125: Cannot convert ACIP {X} because we would be required to assume that {A} is a consonant, when it is not clear if it is a consonant or a vowel.</tt></p>
|
|
</a>
|
|
|
|
<a name="126">
|
|
<p><tt>126: Cannot convert ACIP {X} because it ends with a '+'.</tt></p>
|
|
</a>
|
|
|
|
<a name="127">
|
|
<p><tt>127: Cannot convert ACIP {X} because it ends with a '-'.</tt></p>
|
|
</a>
|
|
|
|
<a name="128">
|
|
<p><tt>128: Cannot convert ACIP {X} because A: is a "vowel" without an associated consonant.</tt></p>
|
|
</a>
|
|
|
|
<a name="129">
|
|
<p><tt>129: Cannot convert ACIP {X} because + is not an ACIP consonant.</tt></p>
|
|
</a>
|
|
|
|
<a name="130">
|
|
<p><tt>130: The tsheg bar ("syllable") {X} is essentially nothing.</tt></p>
|
|
</a>
|
|
|
|
<a name="131">
|
|
<p><tt>131: The ACIP caret, {^}, must precede a tsheg bar.</tt></p>
|
|
</a>
|
|
|
|
<a name="132">
|
|
<p><tt>132: The ACIP {X} must be glued to the end of a tsheg bar, but this one was not.</tt></p>
|
|
</a>
|
|
|
|
<a name="133">
|
|
<p><tt>133: Cannot convert the ACIP {X} to Tibetan because it is unclear what the result should be. The correct output would likely require special mark-up.</tt></p>
|
|
</a>
|
|
|
|
<a name="134">
|
|
<p><tt>134: The tsheg bar ("syllable") {X} has no legal parses.</tt></p>
|
|
</a>
|
|
|
|
<a name="135">
|
|
<p><tt>135: The Unicode escape 'X' with ordinal (in decimal) 88 is specified by the Extended Wylie Transliteration Scheme (EWTS), but is in the private-use area (PUA) of Unicode and will thus not be written out into the output lest you think other tools will be able to understand this non-standard construction.</tt></p>
|
|
</a>
|
|
|
|
<a name="136">
|
|
<p><tt>136: The Unicode escape with ordinal (in decimal) 88 does not match up with any TibetanMachineWeb glyph.</tt></p>
|
|
</a>
|
|
|
|
<a name="137">
|
|
<p><tt>137: The ACIP {X} cannot be represented with the TibetanMachine or TibetanMachineWeb fonts because no such glyph exists in these fonts. The TibetanMachineWeb font has only a limited number of ready-made, precomposed glyphs, and {X} is not one of them.</tt></p>
|
|
</a>
|
|
|
|
<a name="138">
|
|
<p><tt>138: The Unicode escape 'X' with ordinal (in decimal) 88 is in the Tibetan range of Unicode (i.e., [U+0F00, U+0FFF]), but is a reserved code in that area.</tt></p>
|
|
</a>
|
|
|
|
<a name="139">
|
|
<p><tt>139: Found an illegal open bracket (in context, this is 'X'). There is no matching closing bracket.</tt></p>
|
|
</a>
|
|
|
|
<a name="140">
|
|
<p><tt>140: Unmatched closing bracket, 'X', found. Pairs are expected, as in [#THIS] or [THAT]. Nesting is not allowed.</tt></p>
|
|
</a>
|
|
|
|
<a name="141">
|
|
<p><tt>141: While waiting for a closing bracket, an opening bracket, 'X', was found instead. Nesting of bracketed expressions is not permitted.</tt></p>
|
|
</a>
|
|
|
|
<a name="142">
|
|
<p><tt>142: Because you requested conversion to a Unicode text file, there is no way to indicate that the font size is supposed to decrease starting here and continuing until error 143. That is, this is the beginning of a region in YIG CHUNG.</tt></p>
|
|
</a>
|
|
|
|
<a name="143">
|
|
<p><tt>143: Because you requested conversion to a Unicode text file, there is no way to indicate that the font size is supposed to increase (go back to the size it was before the last error 142, that is) starting here. That is, this is the end of a region in YIG CHUNG.</tt></p>
|
|
</a>
|
|
|
|
|
|
|
|
<hr>
|
|
|
|
<p>
|
|
Just as with ERRORS, one may choose to have WARNINGS appear in
|
|
either short or long form. The long forms of warnings are as
|
|
follows:
|
|
</p>
|
|
|
|
<a name="501">
|
|
<p><tt>501: Using X, but only because the tool's knowledge of prefix rules (see the documentation) says that XX is not a legal Tibetan tsheg bar ("syllable")</tt></p>
|
|
</a>
|
|
|
|
<a name="502">
|
|
<p><tt>502: The last stack does not have a vowel in {X}; this may indicate a typo, because Sanskrit, which this probably is (because it's not legal Tibetan), should have a vowel after each stack.</tt></p>
|
|
</a>
|
|
|
|
<a name="503">
|
|
<p><tt>503: Though {X} is unambiguous, it would be more computer-friendly if '+' signs were used to stack things because there are two (or more) ways to interpret this ACIP if you're not careful.</tt></p>
|
|
</a>
|
|
|
|
<a name="504">
|
|
<p><tt>504: The ACIP {X} is treated by this converter as U+0F35, but sometimes might represent U+0F14 in practice. To avoid seeing this warning again, change the input to use {\u0F35} instead of {X}.</tt></p>
|
|
</a>
|
|
|
|
<a name="505">
|
|
<p><tt>505: There is a useless disambiguator in {X}.</tt></p>
|
|
</a>
|
|
|
|
<a name="506">
|
|
<p><tt>506: There is a stack of three or more consonants in {X} that uses at least one '+' but does not use a '+' between each consonant.</tt></p>
|
|
</a>
|
|
|
|
<a name="507">
|
|
<p><tt>507: There is a chance that the ACIP {X} was intended to represent more consonants than we parsed it as representing -- GHNYA, e.g., means GH+NYA, but you can imagine seeing GH+N+YA and typing GHNYA for it too.</tt></p>
|
|
</a>
|
|
|
|
<a name="508">
|
|
<p><tt>508: The ACIP {X} has been interpreted as two stacks, not one, but you may wish to confirm that the original text had two stacks as it would be an easy mistake to make to see one stack (because there is such a stack used in Sanskrit transliteration for this particular sequence) and forget to input it with '+' characters.</tt></p>
|
|
</a>
|
|
|
|
<a name="509">
|
|
<p><tt>509: The ACIP {X} has an initial sequence that has been interpreted as two stacks, a prefix and a root stack, not one nonnative stack, but you may wish to confirm that the original text had two stacks as it would be an easy mistake to make to see one stack (because there is such a stack used in Sanskrit transliteration for this particular sequence) and forget to input it with '+' characters.</tt></p>
|
|
</a>
|
|
|
|
<a name="510">
|
|
<p><tt>510: A non-breaking tsheg, 'X', appeared, but not like "...," or ".," or ".dA" or ".DA".</tt></p>
|
|
</a>
|
|
|
|
<a name="511">
|
|
<p><tt>511: The ACIP {X} cannot be represented with the TibetanMachine or TibetanMachineWeb fonts because no such glyph exists in these fonts. The TibetanMachineWeb font has only a limited number of ready-made, precomposed glyphs, and {X} is not one of them.</tt></p>
|
|
</a>
|
|
|
|
<a name="512">
|
|
<p><tt>512: There is a chance that the ACIP {X} was intended to represent more consonants than we parsed it as representing -- GHNYA, e.g., means GH+NYA, but you can imagine seeing GH+N+YA and typing GHNYA for it too. In fact, there are glyphs in the Tibetan Machine font for N+N+Y, N+G+H, G+N+Y, G+H+N+Y, T+N+Y, T+S+TH, T+S+N, T+S+N+Y, TS+NY, TS+N+Y, H+N+Y, M+N+Y, T+S+M, T+S+M+Y, T+S+Y, T+S+R, T+S+V, N+T+S, T+S, S+H, R+T+S, R+T+S+N, R+T+S+N+Y, and N+Y, indicating the importance of these easily mistyped stacks, so the possibility is very real.</tt></p>
|
|
</a>
|
|
|
|
<hr>
|
|
|
|
<p>
|
|
The above messages are perhaps not verbose enough to help you figure
|
|
out what the converter thinks is wrong or questionable, so below is
|
|
further explanation of a few error and warning messages:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
Error <a href="#131">131</a> appears for
|
|
{^ GONG SA}, for example, because only
|
|
{^GONG SA} and {^ GONG SA} are supported in this
|
|
implementation.
|
|
</li>
|
|
<li>
|
|
Error <a href="#128">128</a> appears for the input {:} because {:}
|
|
cannot appear alone. (Sloppily, this message exposes you to
|
|
the internals of the converter, where {:} is thought of as {A:} in
|
|
some contexts.)
|
|
</li>
|
|
<li>
|
|
Error <a href="#132">132</a> appears because {%}, {o}, and {x} are
|
|
really only to be applied to whole <i>tsheg bar</i>s, and should
|
|
not occur alone.
|
|
</li>
|
|
<li>
|
|
Each of warnings <a href="#501">501</a>, <a href="#508">508</a>
|
|
and <a href="#509">509</a> appears because it helps evince the
|
|
impact of <a href="#prefix">prefix rules</a>, a subtle point with
|
|
regard to ACIP because they are implied, but not discussed
|
|
explicitly in depth, by the ACIP standard.
|
|
</li>
|
|
<li>
|
|
Warning <a href="#504">504</a> appears because some ACIP
|
|
transliteration out there does use {%} to mean U+0F14.
|
|
</li>
|
|
</ul>
|
|
|
|
<a name="colors"></a><h2>Coloration</h2>
|
|
|
|
<p>
|
|
For ACIP->TMW conversions (not ACIP->Unicode), color-coding of
|
|
<i>tsheg bar</i>s is an option. The command-line converters
|
|
accept a flag <tt>--colors yes|no</tt>; the conversion GUI in
|
|
Jskad has a checkbox for color-coding.
|
|
</p>
|
|
|
|
<p>
|
|
Warnings and errors appear in <font color="red">red</font>; <i>tsheg
|
|
bar</i>s that would parse differently if other <a
|
|
href="#prefix">prefix rules</a> were used appear in <font
|
|
color="yellow">yellow</font>; non-<a href="#native">native</a>
|
|
<i>tsheg bar</i>s appear in <font color="green">green</font>.
|
|
</p>
|
|
|
|
|
|
<a name="stats"></a><h2><i>Tsheg-bar</i> Statistics</h2>
|
|
|
|
<p>
|
|
The ACIP->Tibetan converters provide a simple-minded accounting
|
|
mechanism with which one can determine which <i>tsheg bar</i>s
|
|
appear in a conversion or how many times each <i>tsheg bar</i>
|
|
appears. This mechanism is for power users only at this point;
|
|
its user interface leaves much to be desired. If you wish to
|
|
produce frequency information, and if you are not familiar with some
|
|
sort of scripting (via Excel macros, Unix shell scripts, etc.), then
|
|
the output produced will likely be useless to you.
|
|
</p>
|
|
|
|
<p>
|
|
To support the calculation of frequency statistics, that is, how
|
|
many times each <i>tsheg bar</i> appears, the converter can output
|
|
all <i>tsheg bar</i>s to the Java error console (i.e.,
|
|
<tt>System.err</tt>). Each will appear on the console as many
|
|
times as it appears in the input. To activate this
|
|
functionality, <a href="#sysprops">set the system property</a>
|
|
<tt>org.thdl.tib.text.ttt.OutputAllTshegBars</tt> to <tt>true</tt>,
|
|
and be prepared for voluminous output. Massaging this output
|
|
into a friendly tabular format is quite possible but not described
|
|
here; contact <a
|
|
href="mailto:thdltools-devel@lists.sourceforge.net">the
|
|
developers</a> for help.
|
|
</p>
|
|
|
|
<p>
|
|
To support the generation of syllabaries, the converter can output
|
|
each <i>tsheg bar</i> encountered to the Java error console (i.e.,
|
|
<tt>System.err</tt>). Each will appear on the console only
|
|
once, no matter how many times it appears in the input. To
|
|
activate this functionality, <a href="#sysprops">set the system
|
|
property</a> <tt>org.thdl.tib.text.ttt.OutputUniqueTshegBars</tt> to
|
|
<tt>true</tt>, and be prepared for voluminous output.
|
|
</p>
|
|
|
|
<p>
|
|
If desired, each <i>tsheg bar</i> output can be prefixed with a
|
|
string of your choice by <a href="#sysprops">setting the system
|
|
property</a> <tt>org.thdl.tib.text.ttt.PrefixForOutputTshegBars</tt>
|
|
to that string. This is useful if the converter is producing
|
|
other output on the console and you want to separate that output
|
|
from the statistics.
|
|
</p>
|
|
|
|
<!-- DLC LINK TO THE EXCEL SPREADSHEET OF STATS -->
|
|
|
|
|
|
|
|
<a name="sub"></a><h2><i>Tsheg-bar</i> Substitution</h2>
|
|
|
|
<!-- NOTE WELL: The text here is largely the same as the text in the
|
|
class comment for org.thdl.tib.text.ttt.MidLexSubstitution. -->
|
|
|
|
<p>
|
|
The ACIP->Tibetan converters provide a mechanism for
|
|
automatically correcting common transliteration typos. For
|
|
example, if your document contains 100 occurrences of {KAsh} that
|
|
all in fact intend {K+sh}, then you can specify just once the rule
|
|
<tt>{KAsh}->{K+sh}</tt>, and all 100 occurrences will be treated
|
|
correctly. This mechanism is not very easy to use, but it is
|
|
completely customizable; you can specify any number of rules.
|
|
You can only perform such substitutions at the <i>tsheg bar</i>
|
|
level, though. This means, for example, that you cannot
|
|
specify the rule <tt>{GONG SA}->{^GONG SA}</tt>; you can only
|
|
specify <tt>{GONG}->{^GONG}</tt>, which would affect {GONG LA}
|
|
just as it would affect {GONG SA}.
|
|
</p>
|
|
|
|
<p>
|
|
To perform substitutions, <a href="#sysprops">set the system
|
|
property</a> <tt>org.thdl.tib.text.ttt.ReplacementMap</tt> to be a
|
|
comma-delimited list of <tt>x=>y</tt> pairs. For example,
|
|
if you think BLKU, which parses as B+L+KU, should parse as B-L+KU,
|
|
and you want KAsh to be parsed as K+sh because the input operators
|
|
mistyped it, then set <tt>org.thdl.tib.text.ttt.ReplacementMap</tt>
|
|
to <tt>BLKU=>B-L+KU,KAsh=>K+sh</tt>. Note that this will
|
|
not cause {B+L+KU} to become {B-L+KU} -- we are doing the
|
|
replacement during lexical analysis of the input file, not during
|
|
parsing. And it will cause {SBLKU} to become {SB-L+KU}, which
|
|
is parsed as {S+B-L+KU}, probably not what you wanted. If you
|
|
fear such things, you can see if they happen by setting the system
|
|
property <tt>org.thdl.tib.text.ttt.VerboseReplacementMap</tt> to
|
|
<tt>true</tt>, which will cause an informational message to be
|
|
printed on the Java console every time a replacement is made.
|
|
</p>
|
|
|
|
<p>
|
|
Furthermore, you can use the regular expression notations <tt>^</tt>
|
|
and <tt>$</tt> to denote the beginning and end of the <i>tsheg
|
|
bar</i>, respectively. For example, <tt>^BLKU$=>B-L+KU</tt>
|
|
is a useful rule. Note that full regular expressions are not
|
|
supported -- the tool just borrows a bit of the notation. The
|
|
rule <tt>^BLKU=>B-L+KU</tt> means that {BLKUM} and {BLKU} will
|
|
both be replaced, but {SBLKU} and {SBLKUM} will not be. The
|
|
caret, <tt>^</tt>, means that we only match if BLKU is at the
|
|
beginning. The dollar sign, <tt>$</tt>, means that we only
|
|
match if the pattern is at the end. The rule
|
|
<tt>BLKU$=>B-L+KU</tt> will cause {SBLKU} to be replaced, but not
|
|
{BLKUM}. Note that performance is far better for
|
|
<tt>^FOO$</tt> than for <tt>^FOO</tt>, <tt>FOO$</tt>, or
|
|
<tt>FOO</tt> alone.
|
|
</p>
|
|
|
|
<p>
|
|
Only one substitution is made per <i>tsheg bar</i>.
|
|
<tt>^FOO$</tt>-style mappings will be tried first, then
|
|
<tt>^FOO</tt>-style, then <tt>FOO$</tt>-style, and finally
|
|
<tt>FOO</tt>-style.
|
|
</p>
|
|
|
|
<p>
|
|
An example of a useful substitution is <tt>o$=>\u0F35</tt>.
|
|
This is useful because the converters interpret the ACIP {o} as
|
|
U+0F37 by default, but you might prefer U+0F35 in your output.
|
|
</p>
|
|
|
|
<p>
|
|
Note that you cannot literally replace {FOO} with {BAR} using this
|
|
mechanism -- because {F} is not an ACIP character, the lex will not
|
|
get far enough to use this substitution mechanism. This is not
|
|
considered a design flaw -- serious errors require user
|
|
intervention. Sophisticated users can use something akin to
|
|
perl, sed, or awk scripts to preprocess the input.
|
|
</p>
|
|
|
|
<p>
|
|
Note also that you cannot use the rule <tt>ONYA=>O&</tt>,
|
|
although it would be nice if you could. Technically, {&}
|
|
is considered to be punctuation (i.e., that which divides <i>tsheg
|
|
bar</i>s) and is not understood inside a <i>tsheg bar</i>.
|
|
</p>
|
|
|
|
<p>
|
|
Note that this mechanism is also useful for fixing problems in the
|
|
converter itself rather than in the input.
|
|
</p>
|
|
|
|
<a name="escapes"></a><h2>Unicode Character Escapes</h2>
|
|
|
|
<p>
|
|
The ACIP->Tibetan converters support some non-standard extensions
|
|
to the <a
|
|
href="http://asianclassics.org/download/tibetancode/ticode.pdf">ACIP
|
|
Tibetan Input Code Standard</a>. One of those is Unicode
|
|
character escape sequences. This extension makes it possible
|
|
to represent characters that the <a
|
|
href="http://asianclassics.org/download/tibetancode/ticode.pdf">ACIP
|
|
standard</a> does not address, and to represent one character,
|
|
U+0F84, that ACIP does address with the transliteration {\} but that
|
|
is misused in practice so often to refer to U+0F3C that the
|
|
ACIP->Tibetan converters always produce an error upon seeing {\}.
|
|
</p>
|
|
|
|
<p>
|
|
Outside of comments and the like, {\uKLMN} is interpreted as
|
|
referring to the Unicode character with ordinal <i>KLMN</i>, where
|
|
each of K, L, M, and N are case-insensitive hexadecimal
|
|
digits. For example, the ACIP {KA KHA GA NGA } is exactly
|
|
equivalent to
|
|
{\u0F40\u0f0B\u0F41\u0F0B\u0F42\u0F0B\u0F44\u0f0b}. Unicode
|
|
escapes produce the obvious Unicode in an ACIP->Unicode
|
|
conversion, and they produce the correct TMW glyph in an
|
|
ACIP->TMW conversion. There are limits, though, when
|
|
converting to TMW; multiple escapes in sequence are not handled
|
|
correctly. It would take a Unicode to TMW converter to produce
|
|
the correct glyphs for {\u0F42\u0F92\u0FB7\u0F7C}. The escapes
|
|
for vowels and other characters that are mapped to multiple TMW
|
|
glyphs are also not handled perfectly. Best practice is to use
|
|
escapes only when necessary in an ACIP->TMW conversion.
|
|
</p>
|
|
|
|
<p>
|
|
The Unicode character represented need not be a Tibetan one; for
|
|
example, {\u0040} produces the at sign, <tt>@</tt>.
|
|
</p>
|
|
|
|
<p>
|
|
The latest <a
|
|
href="http://orion.lib.virginia.edu/thdl/collections/langling/ewts/">Extended
|
|
Wylie Transliteration Scheme</a> standard has assigned private-use
|
|
area (PUA) Unicode codepoints to some TMW glyphs. ACIP
|
|
documents that have a <a href="#escapes">Unicode escape</a> in the
|
|
range U+F021 to U+F0FF, inclusive, are interpreted as intending
|
|
these TMW glyphs. ACIP->Unicode produces an error for such
|
|
an escape because it is font-dependent and not standard. Other
|
|
tools will likely not understand such Unicode, so the converter will
|
|
not produce it. If you want it in the output, it is there in
|
|
the error message.
|
|
</p>
|
|
|
|
|
|
<p>
|
|
Note well the <a href="#bugs">known bug</a> with regard to
|
|
whitespace in transliteration that follows a Unicode escape.
|
|
In large part, this bug affects characters that can be
|
|
transliterated by other, simpler, standard means.
|
|
</p>
|
|
|
|
<p>
|
|
If you do want to disable the use of Unicode escapes, <a
|
|
href="#sysprops">set the system property</a>
|
|
<tt>thdl.tib.text.disallow.unicode.character.escapes.in.acip</tt> to
|
|
<tt>true</tt>.
|
|
</p>
|
|
|
|
|
|
<a name="lex"></a><h2>Breaking a Text Up Into <i>tsheg bar</i>s</h2>
|
|
|
|
<p>
|
|
The ACIP->Tibetan converters all take ACIP transliteration as
|
|
input. The first step in conversion is to break up the input
|
|
into manageable pieces. (This is known as <i>lexical
|
|
analysis</i> in the context of programming languages, and you may
|
|
see the term in diagnostic messages though a linguist who studies
|
|
human language like Tibetan might balk at the term.) The
|
|
correct pieces in this case are <i>tsheg bar</i>s (in ACIP, {TSEG
|
|
BAR}), punctuation, comments, whitespace, folio markers, formatting
|
|
codes, etc. In this section, the intracacies of how the
|
|
converter does that will be laid bare. With luck, this will
|
|
help you understand why the converter treated one space character
|
|
(i.e, ' ', U+0020) as a <i>tsheg</i> and another as Tibetan
|
|
whitespace.
|
|
</p>
|
|
|
|
<p>
|
|
The Tibetan term <i>tsheg bar</i> refers to "the stuff between
|
|
the dots". In the ACIP {BKRA SHIS [# Notice that
|
|
this comment is embedded in the Tibetan greeting pronounced 'tashi
|
|
delay']BDE LEGS,}, there are four <i>tsheg bar</i>s, 'BKRA',
|
|
'SHIS', 'BDE', and 'LEGS'. In this case 'BDE' is literally
|
|
"between the dots"; i.e., it is sandwiched by two U+0F0B
|
|
characters (because comments are in a sense invisible). One of
|
|
the "dots" that touches 'LEGS' does not look like a dot --
|
|
it is a <i>shad</i>, U+0F0D. The lexical analyzer also finds
|
|
one comment, which will appear in a Latin typeface in the output,
|
|
and it finds four pieces of punctuation -- three <i>tsheg</i>s and a
|
|
<i>shad</i>.
|
|
</p>
|
|
|
|
<p>
|
|
The converter will not allow an illegal character into a <i>tsheg
|
|
bar</i>. For example, {jA} is an error and causes an error
|
|
message to appear in the output.
|
|
</p>
|
|
|
|
<p>
|
|
Now that the basic operation is clear from the above example, let's
|
|
cover the fine points of how standard ACIP is handled. We'll
|
|
also cover some non-standard constructs that appear commonly in
|
|
actual ACIP Release V texts.
|
|
</p>
|
|
|
|
<p>
|
|
The first construct that deserves explanation is the line
|
|
break. By the ACIP standard, line breaks in the input do not
|
|
become line breaks in the output unless there are two line breaks in
|
|
the input. For example, the ACIP snippet below has only one
|
|
line break in the output although three line breaks appear in the
|
|
input:
|
|
</p>
|
|
|
|
<pre>
|
|
BKRA SHIS
|
|
BDE LEGS,
|
|
|
|
THUGS RJE CHE ... and so on ...
|
|
</pre>
|
|
|
|
<p>
|
|
One fine point is that the converter does not require a space before
|
|
a line break. If {SHIS} appears before a line break, the converter
|
|
inserts a space so that it's treated just like {SHIS } is
|
|
treated. This oddity is needed to convert real ACIP documents.
|
|
</p>
|
|
|
|
<p>
|
|
Another fine point is that ACIP's {^} character "eats" a
|
|
following space or a newline. This is so that
|
|
{^ GONG SA } is treated identically to
|
|
{^GONG SA }.
|
|
</p>
|
|
|
|
<p>
|
|
Text inside a matching pair of square brackets (e.g., <tt>[# A
|
|
COMMENT]</tt> or <tt>[BP]</tt>) is passed through untouched into the
|
|
output; the brackets <em>remain</em>. Nesting is not
|
|
allowed. Text inside a matching pair of curly brackets (e.g.,
|
|
<tt>{# A COMMENT}</tt> or <tt>{BP}</tt>) is passed through untouched
|
|
into the output; the brackets <em>disappear</em>. Nesting is
|
|
not allowed. (Note that the source code implements two
|
|
algorithms for handling square and curly brackets; the one described
|
|
here is presently in use. But if you desire different
|
|
handling, please e-mail the <a
|
|
href="mailto:thdltools-devel@lists.sourceforge.net">developers</a>
|
|
to ask if it isn't a five-minute job to make that happen.)
|
|
</p>
|
|
<!-- The old method, ACIPTshegBarScanner.BRACKETED_SECTIONS_PASS_THROUGH_UNMODIFIED==false:
|
|
<p>
|
|
Comments appear in a Latin typeface always. Comments are not
|
|
allowed just anywhere - - a comment cannot occur within a single
|
|
<i>tsheg bar</i>, for example, and it cannot appear between a
|
|
<i>tsheg bar</i> and the <i>tsheg</i> that follows it. That
|
|
is, {BD[#COMMENT]E} is not like {BDE}, and {BDE[#COMMENT] LEGS}
|
|
is not like {BDE LEGS} (though {BDE [#COMMENT]LEGS} is).
|
|
</p>
|
|
|
|
<p>
|
|
Corrections are interpreted as Tibetan, not English, by default, but
|
|
there is a built-in list of corrections that should appear in the
|
|
output in a Latin typeface. (Actually, any correction that
|
|
starts with a certain string will appear in a Latin typeface.)
|
|
The full list is the following:
|
|
</p>
|
|
|
|
<pre>
|
|
"LINE" // from KD0001I1.ACT
|
|
"DATA" // from KL0009I2.INC
|
|
"BLANK" // from KL0009I2.INC
|
|
"NOTE" // from R0001F.ACM
|
|
"alternate" // from R0018F.ACE
|
|
"02101-02150 missing" // from R1003A3.INC
|
|
"51501-51550 missing" // from R1003A52.ACT
|
|
"BRTAGS ETC" // from S0002N.ACT
|
|
"TSAN, ETC" // from S0015N.ACT
|
|
"SNYOMS, THROUGHOUT" // from S0016N.ACT
|
|
"KYIS ETC" // from S0019N.ACT
|
|
"MISSING" // from S0455M.ACT
|
|
"this" // from S6850I1B.ALT
|
|
"THIS" // from S0057M.ACT
|
|
</pre>
|
|
|
|
<p>
|
|
Somewhat related is the converter's treatment of a few oddball
|
|
comments. The oddity is that these comments use the syntax
|
|
{[COMMENT]} rather than the standard syntax {[#COMMENT]}. The
|
|
converter will treat the following as comments:
|
|
</p>
|
|
|
|
<pre>
|
|
From S5274I.ACT:
|
|
"[FIRST]"
|
|
From S5274I.ACT:
|
|
"[SECOND]"
|
|
From S0216M.ACT:
|
|
"[Additional verses added by Khen Rinpoche here are]"
|
|
From S0216M.ACT:
|
|
"[ADDENDUM: The text of]"
|
|
From S0216M.ACT:
|
|
"[END OF ADDENDUM]"
|
|
From S0216M.ACT:
|
|
"[Some of the verses added here by Khen Rinpoche include:]"
|
|
From S0216M.ACT (note the typo):
|
|
"[Note that, in the second verse, the {YUL LJONG} was orignally {GANG LJONG},
|
|
and is now recited this way since the ceremony is not only taking place in Tibet.]"
|
|
From S6954E1.ACT:
|
|
"[text missing]"
|
|
From TD3817I.INC:
|
|
"[INCOMPLETE]"
|
|
From S0935m.act:
|
|
"[MISSING PAGE]"
|
|
From S0975I.INC:
|
|
"[MISSING FOLIO]"
|
|
From S0839D1I.INC:
|
|
"[UNCLEAR LINE]"
|
|
From SE6260A.INC:
|
|
"[THE FOLLOWING TEXT HAS INCOMPLETE SECTIONS, WHICH ARE ON ORDER]"
|
|
From SE6260A.INC:
|
|
"[@DATA INCOMPLETE HERE]"
|
|
From SE6260A.INC:
|
|
"[@DATA MISSING HERE]"
|
|
From TD4035I.INC:
|
|
"[LINE APPARENTLY MISSING THIS PAGE]"
|
|
From TD4226I2.INC:
|
|
"[DATA INCOMPLETE HERE]"
|
|
To be consistent with the above:
|
|
"[DATA MISSING HERE]"
|
|
From S0018N.ACT:
|
|
"[FOLLOWING SECTION WAS NOT AVAILABLE WHEN THIS EDITION WAS
|
|
PRINTED, AND IS SUPPLIED FROM ANOTHER, PROBABLY THE ORIGINAL:]"
|
|
From S0018N.ACT:
|
|
"[THESE PAGE NUMBERS RESERVED IN THIS EDITION FOR PAGES
|
|
MISSING FROM ORIGINAL ON WHICH IT WAS BASED]"
|
|
From S0018N.ACT:
|
|
"[PAGE NUMBERS RESERVED FROM THIS EDITION FOR MISSING
|
|
SECTION SUPPLIED BY PRECEDING]"
|
|
From S0057M.ACT:
|
|
"[SW: OK]"
|
|
From S0057M.ACT:
|
|
"[m:ok]"
|
|
From S0057M.ACT:
|
|
"[A FIRST ONE
|
|
MISSING HERE?]"
|
|
From S0195A1.INC:
|
|
"[THE INITIAL PART OF THIS TEXT WAS INPUT BY THE SERA MEY LIBRARY IN
|
|
TIBETAN FONT AND NEEDS TO BE REDONE BY DOUBLE INPUT]"
|
|
</pre>
|
|
-->
|
|
|
|
<p>
|
|
The converter also supports several non-standard folio
|
|
markers. A review of ACIP Release V texts determined that the
|
|
following types of folio markers can appear:
|
|
</p>
|
|
|
|
<pre>
|
|
@001
|
|
@001A
|
|
@001B
|
|
@01A.3
|
|
@012A.3
|
|
@[07B]
|
|
@00007B
|
|
@00007
|
|
@B00007
|
|
@[00007A]
|
|
</pre>
|
|
|
|
<!-- If ACIPTshegBarScanner.BRACKETED_SECTIONS_PASS_THROUGH_UNMODIFIED==false:
|
|
<p>
|
|
Similarly, to support real ACIP Release V texts, the converter
|
|
treats {[DD1]}, {[DD2]}, {[ DD ]}, and {[DDD]} just like {[DD]}
|
|
(which is specified in the ACIP standard). It treats {[ BP ]}
|
|
and {[BLANK PAGE]} just like {[BP]}, also.
|
|
</p> -->
|
|
|
|
<p>
|
|
The <!-- The old method,
|
|
ACIPTshegBarScanner.BRACKETED_SECTIONS_PASS_THROUGH_UNMODIFIED==false:
|
|
lists above were --> list above was created by a most fallible
|
|
process of reviewing a large number of ACIP Release V texts.
|
|
Your suggestions for additions to <!-- The old method,
|
|
ACIPTshegBarScanner.BRACKETED_SECTIONS_PASS_THROUGH_UNMODIFIED==false:
|
|
these lists --> this list are highly valued; please contact <a
|
|
href="mailto:thdltools-devel@lists.sourceforge.net">the
|
|
developers</a>.
|
|
</p>
|
|
|
|
<p>
|
|
The converters will insert a <i>tsheg</i> in some places where no ACIP
|
|
{ } appears; this happens after {PA} and {DANG,} below:
|
|
</p>
|
|
<pre>
|
|
GA PA
|
|
|
|
GA PHA
|
|
|
|
DAM,
|
|
LHAG
|
|
|
|
GA CA,
|
|
|
|
GA
|
|
</pre>
|
|
|
|
<p>
|
|
Note that a space appears after {PHA}, and a comma appears after
|
|
{CA}, but {PA} has nothing between it and a line break. The
|
|
converters are smart enough to insert a <i>tsheg</i> regardless.
|
|
</p>
|
|
|
|
<p>
|
|
Also missing from the above ACIP, but inserted automatically by the
|
|
converters, is Tibetan whitespace; the converter sees
|
|
{DAM, LHAG} instead of {DAM,LHAG} above.
|
|
</p>
|
|
|
|
<p>
|
|
If such automatic corrections are not desired, try using a Unicode
|
|
<a href="#escapes">escape</a> before the line break instead of {PA}
|
|
or {,}.
|
|
</p>
|
|
|
|
<p>
|
|
The converters also treat {NGA,} as a typo for {NGA ,}
|
|
(actually, {NGA\u0F0C,} since one wouldn't want a line break to
|
|
occur after the <i>tsheg</i> and cause a <i>shad</i> to begin a
|
|
line; see the section on formatting Tibetan texts in the <i>Tibetan!
|
|
5.1</i> documentation) because Tibetan typesetting requires that NGA
|
|
not appear directly before a <i>shad</i>. (Perhaps {NGA,}
|
|
would look too much like {KA}.)
|
|
</p>
|
|
|
|
<p>
|
|
The converters embody the rule that a <i>shad</i> does not appear
|
|
after GA or KA unless a <i>shabs kyu</i> vowel is on the GA or
|
|
KA. For example, the space in {MA ,HA} is a <i>tsheg</i>,
|
|
and the space in {KU ,HA} is a <i>tsheg</i>, but the space in
|
|
{GA ,HA} is Tibetan whitespace.
|
|
</p>
|
|
|
|
<p>
|
|
If you find that the converters put a <i>tsheg</i> where it does not
|
|
belong, miss a <i>tsheg</i>, or put whitespace where it does belong,
|
|
please contact <a
|
|
href="mailto:thdltools-devel@lists.sourceforge.net">the
|
|
developers</a>.
|
|
</p>
|
|
|
|
<p>
|
|
Though the ACIP standard does not mention it, it appears that some
|
|
ACIP Release V texts use a period (i.e., {.}) to indicate a
|
|
non-breaking tsheg (i.e., U+0F0C). Search for {NGO.,},
|
|
{....,DAM}, etc. Unless {,}, {.}, or a letter (i.e., 'a'
|
|
through 'z' or 'A' through 'Z') follows the {.}, it is only
|
|
grudingly interpreted as a non-breaking tsheg -- a warning is
|
|
generated, too.<!-- FIXME: Is this right? Allow for
|
|
treating {.} as an outright error. DLC FIXME -->
|
|
</p>
|
|
|
|
<p>
|
|
Note that the treatment of the very last line in an input text is
|
|
circumspect.<!-- DLC FIXME -->
|
|
</p>
|
|
|
|
|
|
<!-- <h1>DLC</h1>
|
|
|
|
<pre>
|
|
DLC warn on BUR'ANG because BUR'ANG and BUR-'ANG both appear once in ACIP files. Tell RC. DLC: subst it!
|
|
|
|
lex: Spaces
|
|
DLC configurable tibetan spaces bhutan vs. tibet
|
|
{NGA,} -> {NGA ,}
|
|
|
|
DLC hyphens and ' end lines but shouldn't, eh?
|
|
|
|
DLC ACIP {.} is just an error! Have the error message mention {\u0F0C} for those desiring a non-breaking tsheg.
|
|
|
|
DLC whitespace - - newlines [final newline DLC?], spaces
|
|
|
|
</pre> -->
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<a name="parse"></a><h2>Parsing <i>tsheg bar</i>s: Greedy Stacking and
|
|
Nativeness</h2>
|
|
|
|
<p>
|
|
This section is a technical reference sufficiently detailed so that
|
|
you can fully understand the inner workings of the converter as it
|
|
decides which Unicode or TMW to use for a given <i>tsheg
|
|
bar</i>. The problem of <a href="#lex">breaking up a text into
|
|
<i>tsheg bar</i>s</a> is a separate issue; this section describes
|
|
what happens to a <i>tsheg bar</i> after it's been chipped away from
|
|
the text.
|
|
</p>
|
|
|
|
<a name="native"></a>
|
|
<p>
|
|
The ACIP->Tibetan converters have a notion of
|
|
<i>nativeness</i>. Each <i>tsheg bar</i> is either native
|
|
Tibetan or non-native. For example, in Buddhist texts written
|
|
in Tibetan, Sanskrit mantras often appear in Tibetan
|
|
characters. This "Tibetanized Sanskrit" is
|
|
non-native. The <i>tsheg bar</i>s that make up this mantra
|
|
(and here, take "tsheg bar" somewhat literally to mean the
|
|
characters delimited by punctuation and whitespace) are some native
|
|
and some non-native in the converter's eyes. For example, the
|
|
<i>tsheg bar</i> {MA } appears in some mantras, and is thus in
|
|
fact non-native. The converter, however, treats {MA } as
|
|
native in all contexts. Thus, "native" is a
|
|
technical term with a slightly different meaning than usual.
|
|
</p>
|
|
|
|
<p>
|
|
The idea of nativeness is important because it affects how the
|
|
converter treats a <i>tsheg bar</i>. In ACIP transliteration,
|
|
the rule is that consonants stack up until punctuation, whitespace,
|
|
or a vowel appears. For example, {RDZYA} is equivalent to
|
|
{R+DZ+YA}. ({DZA} always means the letter {DZA} itself, never
|
|
{D+ZA}.) But this greedy stacking does not apply to {SOGS},
|
|
which is equivalent to {SOG-S}, not {SOG+S}. Why not?
|
|
Because {SOGS} is a native <i>tsheg bar</i> where GA is the suffix
|
|
and SA is the postsuffix. Similarly, {GNAD} is {G-NAD}, not
|
|
{G+NAD}. Why? Because GA is a prefix in this native
|
|
Tibetan <i>tsheg bar</i>.
|
|
</p>
|
|
|
|
<p>
|
|
In this section, we will illustrate the inner workings of this
|
|
aspect of the converter. You will be able to determine which
|
|
snippets of transliteration the converter considers to be native
|
|
<i>tsheg bar</i>s, where greedy stacking does not apply except for
|
|
the root stack, and which snippets are non-native, and thus wholly
|
|
subject to greedy stacking.
|
|
</p>
|
|
|
|
<h3>Anatomy of a Native <i>tsheg bar</i></h3>
|
|
|
|
<p>
|
|
First, the <a href="#lex">lexical analyzer</a> ensures that only the
|
|
Tibetan and Sanskrit consonants, the vowels {A}, {I}, {U}, {E}, {O},
|
|
{OO}, {EE}, {i}, {'A}, {'I}, {'U}, {'E}, {'O}, {'OO}, {'EE}, and
|
|
{'i}, and the adornments {m} and {:} are allowed in a <i>tsheg
|
|
bar</i>.
|
|
</p>
|
|
|
|
<p>
|
|
As far as the converter is concerned, a native <i>tsheg bar</i>
|
|
consists of an optional prefix, a native root stack, an optional
|
|
suffix, an optional postsuffix (also known as a secondary suffix)
|
|
that may only be present if a suffix is present, and zero or more
|
|
<i>appendages</i> (my term, created because I don't know what a
|
|
grammarian calls such a thing). An appendage is one of the
|
|
following stack sequences:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>{'E}</li>
|
|
<li>{'I}</li>
|
|
<li>{'O}</li>
|
|
<li>{'U}</li>
|
|
<li>{'US}</li>
|
|
<li>{'UR}</li>
|
|
<li>{'UM}</li>
|
|
<li>{'ONG}</li>
|
|
<li>{'ONGS}</li>
|
|
<li>{'OS}</li>
|
|
<li>{'IS}</li>
|
|
<li>{'UNG}</li>
|
|
<li>{'ANG}</li>
|
|
<li>{'AM}</li>
|
|
</ul>
|
|
|
|
<p>
|
|
A <i>tsheg bar</i> is non-native if it has a non-native root stack
|
|
or if it contains the {:} character. Any vowel is allowed on a
|
|
native root stack, even {'EEm}, {i}, or the like.
|
|
</p>
|
|
<p>
|
|
The rule about native root stacks is important, for example, in
|
|
determining that {KTYAMS} is {K+T+YAM+SA} instead of {K+T+YAMASA}
|
|
(because K+T+YA is not a native stack). Another example is
|
|
{GNVA}, which is treated like {G+N+VA}, not {G-N+VA}, even though
|
|
{GNA} is treated like {G-NA} because NA can take a GA prefix.
|
|
The complete list of native stacks is the following:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>KA</li>
|
|
<li>KHA</li>
|
|
<li>GA</li>
|
|
<li>NGA</li>
|
|
<li>CA</li>
|
|
<li>CHA</li>
|
|
<li>JA</li>
|
|
<li>NYA</li>
|
|
<li>TA</li>
|
|
<li>THA</li>
|
|
<li>DA</li>
|
|
<li>NA</li>
|
|
<li>PA</li>
|
|
<li>PHA</li>
|
|
<li>BA</li>
|
|
<li>MA</li>
|
|
<li>TZA</li>
|
|
<li>TSA</li>
|
|
<li>DZA</li>
|
|
<li>WA</li>
|
|
<li>ZHA</li>
|
|
<li>ZA</li>
|
|
<li>'A</li>
|
|
<li>YA</li>
|
|
<li>RA</li>
|
|
<li>LA</li>
|
|
<li>SHA</li>
|
|
<li>SA</li>
|
|
<li>HA</li>
|
|
<li>AA</li>
|
|
<li>R+KA (RKA)</li>
|
|
<li>R+GA (RGA)</li>
|
|
<li>R+NGA (RNGA)</li>
|
|
<li>R+JA (RJA)</li>
|
|
<li>R+NYA (RNYA)</li>
|
|
<li>R+TA (RTA)</li>
|
|
<li>R+DA (RDA)</li>
|
|
<li>R+NA (RNA)</li>
|
|
<li>R+BA (RBA)</li>
|
|
<li>R+MA (RMA)</li>
|
|
<li>R+TZA (RTZA)</li>
|
|
<li>R+DZA (RDZA)</li>
|
|
<li>L+KA (LKA)</li>
|
|
<li>L+GA (LGA)</li>
|
|
<li>L+NGA (LNGA)</li>
|
|
<li>L+CA (LCA)</li>
|
|
<li>L+JA (LJA)</li>
|
|
<li>L+TA (LTA)</li>
|
|
<li>L+DA (LDA)</li>
|
|
<li>L+PA (LPA)</li>
|
|
<li>L+BA (LBA)</li>
|
|
<li>L+HA (LHA)</li>
|
|
<li>S+KA (SKA)</li>
|
|
<li>S+GA (SGA)</li>
|
|
<li>S+NGA (SNGA)</li>
|
|
<li>S+NYA (SNYA)</li>
|
|
<li>S+TA (STA)</li>
|
|
<li>S+DA (SDA)</li>
|
|
<li>S+NA (SNA)</li>
|
|
<li>S+PA (SPA)</li>
|
|
<li>S+BA (SBA)</li>
|
|
<li>S+MA (SMA)</li>
|
|
<li>S+TZA (STZA)</li>
|
|
<li>K+VA (KVA)</li>
|
|
<li>KH+VA (KHVA)</li>
|
|
<li>G+VA (GVA)</li>
|
|
<li>C+VA (CVA)</li>
|
|
<li>NY+VA (NYVA)</li>
|
|
<li>T+VA (TVA)</li>
|
|
<li>D+VA (DVA)</li>
|
|
<li>TZ+VA (TZVA)</li>
|
|
<li>TS+VA (TSVA)</li>
|
|
<li>ZH+VA (ZHVA)</li>
|
|
<li>Z+VA (ZVA)</li>
|
|
<li>R+VA (RVA)</li>
|
|
<li>SH+VA (SHVA)</li>
|
|
<li>S+VA (SVA)</li>
|
|
<li>H+VA (HVA)</li>
|
|
<li>K+YA (KYA)</li>
|
|
<li>KH+YA (KHYA)</li>
|
|
<li>G+YA (GYA)</li>
|
|
<li>P+YA (PYA)</li>
|
|
<li>PH+YA (PHYA)</li>
|
|
<li>B+YA (BYA)</li>
|
|
<li>M+YA (MYA)</li>
|
|
<li>K+RA (KRA)</li>
|
|
<li>KH+RA (KHRA)</li>
|
|
<li>G+RA (GRA)</li>
|
|
<li>T+RA (TRA)</li>
|
|
<li>TH+RA (THRA)</li>
|
|
<li>D+RA (DRA)</li>
|
|
<li>P+RA (PRA)</li>
|
|
<li>PH+RA (PHRA)</li>
|
|
<li>B+RA (BRA)</li>
|
|
<li>M+RA (MRA)</li>
|
|
<li>SH+RA (SHRA)</li>
|
|
<li>S+RA (SRA)</li>
|
|
<li>H+RA (HRA)</li>
|
|
<li>K+LA (KLA)</li>
|
|
<li>G+LA (GLA)</li>
|
|
<li>B+LA (BLA)</li>
|
|
<li>Z+LA (ZLA)</li>
|
|
<li>R+LA (RLA)</li>
|
|
<li>S+LA (SLA)</li>
|
|
<li>R+K+YA (RKYA)</li>
|
|
<li>R+G+YA (RGYA)</li>
|
|
<li>R+M+YA (RMYA)</li>
|
|
<li>R+G+VA (RGVA)</li>
|
|
<li>R+TZ+VA (RTZVA)</li>
|
|
<li>S+K+YA (SKYA)</li>
|
|
<li>S+G+YA (SGYA)</li>
|
|
<li>S+P+YA (SPYA)</li>
|
|
<li>S+B+YA (SBYA)</li>
|
|
<li>S+M+YA (SMYA)</li>
|
|
<li>S+K+RA (SKRA)</li>
|
|
<li>S+G+RA (SGRA)</li>
|
|
<li>S+N+RA (SNRA)</li>
|
|
<li>S+P+RA (SPRA)</li>
|
|
<li>S+B+RA (SBRA)</li>
|
|
<li>S+M+RA (SMRA)</li>
|
|
<li>G+R+VA (GRVA)</li>
|
|
<li>D+R+VA (DRVA)</li>
|
|
<li>PH+Y+VA (PHYVA)</li>
|
|
</ul>
|
|
|
|
<p>
|
|
(Some would argue that LVA is notably absent. It is seen in
|
|
ACIP Buddhist texts in {AELVA}, {LVAm}, {LVU}, {LVUN}, {LVAR},
|
|
{LVE}, {LVANG}, and {LVA}. Greedy stacking affects none of
|
|
these <i>tsheg bar</i>s' parsing, however.)
|
|
</p>
|
|
|
|
<a name="prefix"></a>
|
|
<p>
|
|
Not all characters can be prefixes and the like. Only the five
|
|
prefixes (GA, DA, BA, MA, 'A), ten suffixes (GA, NGA, DA, NA, BA,
|
|
MA, 'A, RA, LA, SA), and two postsuffixes (DA, SA) every Tibetan
|
|
student knows are allowed, and they cannot appear with vowels.
|
|
(In {LE'U}, {'} is not a suffix -- it is part of an
|
|
appendage.) In fact, certain prefixes may only appear with
|
|
certain root stacks. The reason that these prefix rules matter
|
|
is that they govern how <i>tsheg bar</i>s are parsed. For
|
|
example, {GNA} is parsed like {G-NA}, because NA takes a GA
|
|
prefix. But {GPA} is parsed like {G+PA}, because PA does not
|
|
take a GA prefix.
|
|
</p>
|
|
|
|
<p>
|
|
Prefix rules are a topic of some controversy; different grammars
|
|
give different lists of prefix rules. For a converter, it is
|
|
important that the converter's knowledge of prefix rules matches the
|
|
knowledge of the person who typed in the ACIP transliteration, not
|
|
that the converter agrees with a grammarian. For example, if
|
|
the input technician thought that PA could take a GA prefix, then
|
|
the converter will produce {G+PA} when {G-PA} was intended.
|
|
For this reason, the converter can produce a warning every time a
|
|
prefix rule prohibited the treatment of one of the five prefixes as
|
|
a prefix. For example, {GPA} produces this warning.
|
|
However, {GNA} produces no warning, because the converter assumes
|
|
that it is unlikely that an input technician would enter {GNA} upon
|
|
seeing {G+NA}. Part of the reason for this assumption is that
|
|
the <i>Asian Classics Input Project Entry Operator Transcription
|
|
Chart</i> as of Spring, 1993, explicitly enumerates the following
|
|
cases for special treatment by input operators:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>{BDA'} vs. {B+DA}</li>
|
|
<li>{DBANG} vs. {D+BA}</li>
|
|
<li>{DGA'} vs. {D+GA}</li>
|
|
<li>{DGRA} vs. {D+GRA}</li>
|
|
<li>{DGYES} vs. {D+GYA}</li>
|
|
<li>{DMAR} vs. {D+MA}</li>
|
|
<li>{GDA'} vs. {G+DA}</li>
|
|
<li>{GNAD} vs. {G+NA}</li>
|
|
<li>{MNA'} vs. {M+NA}</li>
|
|
</ul>
|
|
|
|
<p>
|
|
Regardless, for best results, you should ensure that the input
|
|
technician's knowledge of prefix rules matches the converter's
|
|
knowledge. The following are the legal combinations of prefix
|
|
and root stack in the converter's eyes:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
The BA prefix may occur with any of the following stacks:
|
|
<ul>
|
|
<li>KA</li>
|
|
<li>SA</li>
|
|
<li>CA</li>
|
|
<li>TA</li>
|
|
<li>TZA</li>
|
|
<li>GA</li>
|
|
<li>DA</li>
|
|
<li>ZHA</li>
|
|
<li>ZA</li>
|
|
<li>SHA</li>
|
|
<li>K+YA (KYA)</li>
|
|
<li>G+YA (GYA)</li>
|
|
<li>K+RA (KRA)</li>
|
|
<li>G+RA (GRA)</li>
|
|
<li>S+RA (SRA)</li>
|
|
<li>G+LA (GLA)</li>
|
|
<li>K+LA (KLA)</li>
|
|
<li>Z+LA (ZLA)</li>
|
|
<li>R+LA (RLA)</li>
|
|
<li>S+LA (SLA)</li>
|
|
<li>S+KA (SKA)</li>
|
|
<li>S+GA (SGA)</li>
|
|
<li>S+NGA (SNGA)</li>
|
|
<li>S+NYA (SNYA)</li>
|
|
<li>S+TA (STA)</li>
|
|
<li>S+DA (SDA)</li>
|
|
<li>S+NA (SNA)</li>
|
|
<li>S+TZA (STZA)</li>
|
|
<li>R+KA (RKA)</li>
|
|
<li>R+GA (RGA)</li>
|
|
<li>R+NGA (RNGA)</li>
|
|
<li>R+JA (RJA)</li>
|
|
<li>R+NYA (RNYA)</li>
|
|
<li>R+TA (RTA)</li>
|
|
<li>R+DA (RDA)</li>
|
|
<li>R+NA (RNA)</li>
|
|
<li>R+TZA (RTZA)</li>
|
|
<li>R+DZA (RDZA)</li>
|
|
<li>L+CA (LCA)</li>
|
|
<li>L+TA (LTA)</li>
|
|
<li>L+DA (LDA)</li>
|
|
<li>R+K+YA (RKYA)</li>
|
|
<li>R+G+YA (RGYA)</li>
|
|
<li>S+K+YA (SKYA)</li>
|
|
<li>S+G+YA (SGYA)</li>
|
|
<li>S+K+RA (SKRA)</li>
|
|
<li>S+G+RA (SGRA)</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
The GA prefix may occur with any of the following stacks:
|
|
<ul>
|
|
<li>CA</li>
|
|
<li>DA</li>
|
|
<li>NA</li>
|
|
<li>NYA</li>
|
|
<li>SA</li>
|
|
<li>SHA</li>
|
|
<li>TA</li>
|
|
<li>TZA</li>
|
|
<li>YA</li>
|
|
<li>ZA</li>
|
|
<li>ZHA</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
The 'A prefix may occur with any of the following stacks:
|
|
<ul>
|
|
<li>GA</li>
|
|
<li>JA</li>
|
|
<li>DA</li>
|
|
<li>BA</li>
|
|
<li>DZA</li>
|
|
<li>KHA</li>
|
|
<li>CHA</li>
|
|
<li>THA</li>
|
|
<li>PHA</li>
|
|
<li>TSA</li>
|
|
<li>PH+YA (PHYA)</li>
|
|
<li>B+YA (BYA)</li>
|
|
<li>KH+YA (KHYA)</li>
|
|
<li>G+YA (GYA)</li>
|
|
<li>B+RA (BRA)</li>
|
|
<li>KH+RA (KHRA)</li>
|
|
<li>G+RA (GRA)</li>
|
|
<li>D+RA (DRA)</li>
|
|
<li>PH+RA (PHRA)</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
The MA prefix may occur with any of the following stacks:
|
|
<ul>
|
|
<li>KHA</li>
|
|
<li>GA</li>
|
|
<li>CHA</li>
|
|
<li>JA</li>
|
|
<li>THA</li>
|
|
<li>TSA</li>
|
|
<li>DA</li>
|
|
<li>DZA</li>
|
|
<li>NGA</li>
|
|
<li>NYA</li>
|
|
<li>NA</li>
|
|
<li>KH+YA (KHYA)</li>
|
|
<li>G+YA (GYA)</li>
|
|
<li>KH+RA (KHRA)</li>
|
|
<li>G+RA (GRA)</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
The DA prefix may occur with any of the following stacks:
|
|
<ul>
|
|
<li>BA</li>
|
|
<li>GA</li>
|
|
<li>KA</li>
|
|
<li>MA</li>
|
|
<li>NGA</li>
|
|
<li>PA</li>
|
|
<li>B+RA (BRA)</li>
|
|
<li>B+YA (BYA)</li>
|
|
<li>G+RA (GRA)</li>
|
|
<li>G+YA (GYA)</li>
|
|
<li>K+RA (KRA)</li>
|
|
<li>K+YA (KYA)</li>
|
|
<li>M+YA (MYA)</li>
|
|
<li>P+RA (PRA)</li>
|
|
<li>P+YA (PYA)</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
In the above list, the presence of wa-zur (ACIP {V}) does not
|
|
disallow a prefix-root combination; nor does the presence of any
|
|
vowel, even {'EEm}. The presence of {:} does disallow
|
|
prefix-root combinations; e.g., {GN'EEm} is {G-N'EEm}, but {GNA:} is
|
|
{G+NA:}. ({GNVA} is parsed as {G+N+VA} not because NVA cannot
|
|
take a GA prefix, but because NVA is not a native stack.)
|
|
</p>
|
|
|
|
<p>
|
|
The converter will allow any suffix to go with any native root or
|
|
prefix-root combination; it will allow any postsuffix to follow any
|
|
suffix. It will allow any appendage on any native <i>tsheg
|
|
bar</i>.
|
|
</p>
|
|
|
|
<p>
|
|
For example, {SOGS}, {BSOGS}, {BS'EEmGS}, {LE'U'I'O} and
|
|
{BSKYABS-'UR-'UNG-'O} are all native <i>tsheg bar</i>s in the
|
|
converter's eyes. Note the need for disambiguation: {PAM-'AM}
|
|
is a native <i>tsheg bar</i>, but {PAM'AM}, which parses as the
|
|
three stacks {PA}, {M'A}, and {MA}, is not. (In practice,
|
|
appendages rarely occur after prefixes. {BUR-'ANG} appears at
|
|
least once in ACIP files and {DGA'-'AM} appears at least twice, but
|
|
these may be typos. The converter does allow it, though.
|
|
It thinks {BIR'U} and {WAN'U} (which also occur, but only very
|
|
rarely) are both non-native, though, and thus treats {'} as U+0F71
|
|
(subscribed) and not U+0F60 (full form) in each case.)
|
|
</p>
|
|
|
|
<p>
|
|
Note a fine point. When turning a <i>tsheg bar</i> into
|
|
Tibetan, the ACIP->Tibetan converters assume that subjoined YA
|
|
and RA consonants are not fixed-form -- not U+0FBB and U+0FBC -- but
|
|
rather are the usual subjoined forms U+0FB1 and U+0FB2. The
|
|
only exceptions are the stacks R+Y, Y+Y, and n+d+Y, which are known
|
|
to have fixed-form subjoined YA, and the stacks n+d+R+Y (where RA
|
|
but not YA is full-form) and K+sh+R, which are known to have
|
|
fixed-form subjoined RA. (Wa-zur, U+0FAD, is never confused
|
|
with full-form subjoined WA, U+0FBA, because ACIP represents the
|
|
former with {V} and the latter with {W}.) Furthermore, the
|
|
converter only generates U+0F6A, the fixed-form RA (<i>rango</i>),
|
|
for the stacks R+W, R+Y, R+SH, R+SH+Y, R+sh, R+sh+n, R+sh+n+Y,
|
|
R+sh+M, R+sh+Y, and R+S; U+0F62 is always produced for the top-most
|
|
RA in any other stack. (Note that U+0F62 is sometimes
|
|
displayed as a fixed-form RA itself, as in {RNYA}.)
|
|
</p>
|
|
|
|
<p>
|
|
Another fine point: The tool treats {N+DZYA} like {N+DZ+YA}, except
|
|
that it warns, "<tt>There is a stack of three or more consonants in
|
|
N+DZYA that uses at least one '+' but does not use a '+' between
|
|
each consonant.</tt>". The tool is inconsistent, however; it
|
|
will not treat {R+TS+NYA} like {R+T+S+N+YA} (and that would be a
|
|
terrible idea).
|
|
</p>
|
|
|
|
<p>
|
|
So far, we have spoken about consonants and vowels. In fact,
|
|
it is not trivial to determine when something is a consonant and
|
|
when it is a vowel. {A} can represent U+0F68, the Tibetan
|
|
letter, or the implicit vowel. {'} can represent U+0F71, the
|
|
subscribed a-chung, or U+0F60, the full-sized consonant
|
|
a-chung. The converter treats {TAA} as {T+AA}, not {TA-AA},
|
|
but treats {TAAA} like {TA-AA}, not {T+AA-A}. It treats
|
|
{PA'AM} like {PA-'A-M}, not {P+A'A-M}. In short, it first
|
|
tries out treating {'} and {A} like vowels, but will backtrack if
|
|
that leads to a clearly invalid <i>tsheg bar</i>.
|
|
</p>
|
|
|
|
<p>
|
|
Finally, a string of numbers can be a <i>tsheg bar</i> also.
|
|
It is illegal for numbers and consonants to appear together within
|
|
one <i>tsheg bar</i>, however.
|
|
</p>
|
|
|
|
<p>
|
|
The above is the complete understanding of the converter's
|
|
algorithms for parsing <i>tsheg bar</i>s. You the native
|
|
Tibetan speaker may know that {BSKYABS-'UR-'UNG-'O} is not allowed
|
|
and thus think that {B+S+K+YAB+S-'UR-'UNG-'O} should be the result,
|
|
but the converter has no such knowledge, and thinks this is a native
|
|
tsheg bar equivalent to {B-S+K+YAB-S-'UR-'UNG-'O}.
|
|
</p>
|
|
|
|
|
|
|
|
<a name="sysprops"></a><h2>System Properties</h2>
|
|
|
|
<p>
|
|
The <a href="#sub"><i>tsheg-bar</i> substitution</a> mechanism is
|
|
customizable via system properties. Java developers likely
|
|
know what these are, but few users do. This section will
|
|
perhaps get a determined person started, but if you have trouble,
|
|
contact <a href="mailto:thdltools-devel@lists.sourceforge.net">the
|
|
developers</a> so that we can improve this documentation or create a
|
|
better user interface.
|
|
</p>
|
|
|
|
<p>
|
|
For the tool to respect the value of a system property, you must
|
|
invoke the tool from the command line as follows:
|
|
</p>
|
|
|
|
<p>
|
|
<tt>
|
|
java
|
|
"-Dorg.thdl.tib.text.ttt.ReplacementMap=KAsh=>K+sh,ONYA=>[#ERROR-ONYA-IS-O&]"
|
|
-Dorg.thdl.tib.text.ttt.VerboseReplacementMap=true
|
|
-jar Jskad.jar
|
|
</tt>
|
|
</p>
|
|
|
|
|
|
<a name="bugs"></a><h2>Known Bugs</h2>
|
|
|
|
<p>
|
|
This section presents areas where the current tool's behavior is
|
|
wrong. Before doing serious work with the converter,
|
|
familiarize yourself with this section and develop a plan to work
|
|
around the bugs or to ensure that your documents will not trigger
|
|
the bugs. At the same time, if any of these bugs affects you,
|
|
contact <a href="mailto:thdltools-devel@lists.sourceforge.net">the
|
|
developers</a> so that we can fix them. The squeaky wheel
|
|
surely gets the grease; these bugs may never be fixed if there are
|
|
no complaints.
|
|
</p>
|
|
|
|
<p>
|
|
The following are all known bugs:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
When ACIP {MTHARo} is given, the {o} glyph should be centered
|
|
under the THA glyph in ACIP->TMW conversions. At present,
|
|
the {o} glyph appears underneath the rightmost stack.
|
|
Similarly, {\u0F35} and {\u0F37} are not centered properly.
|
|
[<a
|
|
href="http://sourceforge.net/tracker/index.php?func=detail&aid=838594&group_id=61934&atid=502515">838594</a>]
|
|
</li>
|
|
<li>
|
|
ACIP->TMW conversion for {\u0F3E} is not correct. Fear
|
|
not; the character U+0F3E is so rare that no ACIP transliteration
|
|
exists for it. [<a
|
|
href="http://sourceforge.net/tracker/index.php?func=detail&aid=855478&group_id=61934&atid=502515">855478</a>]
|
|
</li>
|
|
<li>
|
|
A folio marker {@0B1} can appear in ACIP Release V texts; it gives an error at present.
|
|
</li>
|
|
<li>
|
|
The treatment of the very last line in an input text may be buggy
|
|
with regard to treatment of ACIP spaces, etc.<!-- DLC -->
|
|
</li>
|
|
<li>
|
|
The treatment of {:} directly before a line break is likely
|
|
incorrect; a <i>tsheg</i> is inserted right now after the
|
|
visarga.<!-- DLC FIXME -->
|
|
</li>
|
|
<li>
|
|
<!--DLC FIXME: --> The number of errors after which processing is
|
|
aborted (under the assumption that the input is probably not ACIP)
|
|
is absolute, not a per capita measurement (i.e., one error per 100
|
|
characters of input) and is not easily configured.
|
|
</li>
|
|
</ul>
|
|
|
|
|
|
|
|
<a name="room"></a><h2>Room for Improvement</h2>
|
|
|
|
<p>
|
|
This section presents areas where the current tool could be
|
|
improved. None of the current behavior described here is
|
|
incontrovertibly flawed (i.e., there are no bugs described here, see
|
|
<a href="#bugs">known bugs</a> for that); current behavior is
|
|
technically correct. However, the current behavior is not, in
|
|
everyone's eyes, perfect.
|
|
</p>
|
|
|
|
<p>
|
|
The following are the current areas in which the tool could be
|
|
better:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
The Unicode U+0F43 is equivalent to the sequence U+0F42 followed
|
|
by U+0FB7. There are several distinct but similar
|
|
cases. The converter should have an option that allows for
|
|
producing one form or the other instead. (In practice, doing
|
|
Unicode normalization on the output is probably going to give you
|
|
results just as good, and having a separate normalizer facilitates
|
|
code reuse.) See issue <a
|
|
href="http://sourceforge.net/tracker/index.php?func=detail&aid=946063&group_id=61934&atid=502518">946043</a>.
|
|
</li>
|
|
<li>
|
|
The fact that stacks G+N+Y and M+N+Y exist in the TMW font means
|
|
that the ACIP snippets {GNY} and {MNY} should, in some cases,
|
|
trigger warning 512. They do not do so at present, and
|
|
warning 507 is not given either. See issue <a
|
|
href="http://sourceforge.net/tracker/index.php?func=detail&aid=946058&group_id=61934&atid=502518">946058</a>.
|
|
</li>
|
|
<li>
|
|
At present, an error in ACIP->TMW conversion is given when ACIP
|
|
like {RTSNY} or {NNY} is seen; this is because no glyph R+TS+NY or
|
|
N+NY is in the TMW font. For these snippets, though, there
|
|
is a significant possibility that R+T+S+N+Y or N+N+Y was intended
|
|
because TMW does have glyphs for both. It would be best to
|
|
give an error or stern warning when creating the Unicode for RTSNY
|
|
etc., and to allow for giving an error when creating TMW for RTSNY
|
|
etc. (whereas right now warning 512 is given). See issue <a
|
|
href="http://sourceforge.net/tracker/index.php?func=detail&aid=936998&group_id=61934&atid=502518">936998</a>.
|
|
</li>
|
|
<li>
|
|
It would be best to produce a warning (but not an error) when
|
|
converting R+W, Y+Y, n+d+R+Y, K+sh+R, n+d+Y, R+Y, R+SH, R+SH+Y,
|
|
R+sh, R+sh+n, R+sh+n+Y, R+sh+M, R+sh+Y, or R+S to Tibetan.
|
|
The ACIP scheme does not really say when the unusual, full-formed
|
|
consonant is intended. See issue <a
|
|
href="http://sourceforge.net/tracker/index.php?func=detail&aid=933240&group_id=61934&atid=502518">933240</a>.
|
|
</li>
|
|
<li>
|
|
Some warnings are false alarms (false positives). The only
|
|
known false positive is the warning <tt>[#WARNING CONVERTING ACIP
|
|
DOCUMENT: There is a chance that the ACIP KshA was intended to
|
|
represent more consonants than we parsed it as representing --
|
|
NNYA, e.g., means N+NYA, but you can imagine seeing N+N+YA and
|
|
typing NNYA for it too.]</tt> when it is given for ACIP
|
|
constructions that cannot possibly be interpreted any other
|
|
way. See issue <a
|
|
href="http://sourceforge.net/tracker/index.php?func=detail&aid=932896&group_id=61934&atid=502515">932896</a>.
|
|
</li>
|
|
<li>
|
|
The only time Unicode output will contain U+0F6A, full-formed RA,
|
|
instead of U+0F62, RA (possibly superscribed), is when a Unicode
|
|
escape sequence for U+0F6A is used or when one of the ten stacks
|
|
RY, RW, RSH, RSHY, Rsh, Rshn, RshnY, RshM, RshY, or RS
|
|
appears. The Unicode standard is not as clear as it could be
|
|
on the issue of when to use full-formed code points like U+0F6A,
|
|
so this treatment might not be the best.
|
|
</li>
|
|
<li>
|
|
The glyph TibetanMachineWeb9.61 -- the {O'I} special combination
|
|
(i.e., the glyph for the Unicode string U+0F7C,U+0F60,U+0F72) --
|
|
is never output by the ACIP->TMW converter. It is
|
|
sometimes more beautiful than the glyphs that are presently output
|
|
(three separate glyphs instead of the one).
|
|
</li>
|
|
<li>
|
|
Though the ACIP standard disallows it, you will find in ACIP
|
|
documents from the Buddhist Canon things like {/NYA\} where the
|
|
standard demands {/NYA/}. Presently, this triggers an error;
|
|
it would be better if this were converted like {/NYA/} is, and
|
|
triggered only a <tt>Most</tt>-level warning.
|
|
</li>
|
|
<li>
|
|
The hypothetical comment {[# \u0F40 may have been intended...]}
|
|
should cause a warning saying that Unicode escapes do not apply
|
|
within comments.
|
|
</li>
|
|
<li>
|
|
The whitespace after a <a href="#escapes">Unicode escape</a> is
|
|
not interpreted correctly when that Unicode escape represents
|
|
something that is part of a <i>tsheg bar</i>. For example,
|
|
the space in {KA KHA} is treated as a <i>tsheg</i> (i.e., U+0F0B),
|
|
but the space in {\u0F40 KHA} is wrongly treated as Tibetan
|
|
whitespace. [<a
|
|
href="http://sourceforge.net/tracker/index.php?func=detail&aid=855482&group_id=61934&atid=502515">855482</a>]
|
|
</li>
|
|
<li>
|
|
Though not standard, {:} and {:-} sometimes are intended to
|
|
represent U+0F14. The latter causes an error; it should
|
|
cause a warning suggested that the <a href="#escapes">Unicode
|
|
escape</a> {\u0F14} be used instead. The former is always
|
|
treated as U+0F7F; it should cause a warning in some or all
|
|
contexts.
|
|
</li>
|
|
<li>
|
|
The <a href="#sub"><i>tsheg-bar</i> substitution</a> mechanism
|
|
should be more general. The useful rule
|
|
<tt>ONYA=>O&</tt> should be supported and used by default.
|
|
</li>
|
|
<li>
|
|
The converters should support a white list of acceptable
|
|
non-native <i>tsheg bar</i>s (where the term "tsheg bar"
|
|
is to be interpreted somewhat literally here as any characters
|
|
between punctuation). Non-native <i>tsheg bar</i>s not on
|
|
the list should produce warnings or errors. Similarly, but
|
|
perhaps less urgently, a syllabary of native <i>tsheg bar</i>s
|
|
should be supported too. (A workaround is to use <a
|
|
href="#colors">coloring</a>, have your word processor delete
|
|
everything but the colored text, sort the colored <i>tsheg
|
|
bar</i>s, and inspect them all by hand. Also, <a
|
|
href="#stats"><i>tsheg-bar</i> statistics</a> will help you to
|
|
find uncommon <i>tsheg bar</i>s.)
|
|
</li>
|
|
<li>
|
|
ACIP->Unicode conversions produce Unicode text files at
|
|
present. While more compact than Rich Text Format (RTF)
|
|
files, a text file does not allow for supporting the two font
|
|
sizes in {KA (KA)}. A workaround is to use an ACIP->TMW
|
|
conversion followed by a separate <a
|
|
href="TMW_or_TM_To_X_Converters.html">TMW->Unicode</a>
|
|
conversion.
|
|
</li>
|
|
<li>
|
|
The converter should warn for each occurrence of the vowels {'E},
|
|
{'O}, {'EE}, or {'OO}.
|
|
</li>
|
|
<li>
|
|
Default <a href="#sub">substitution</a> rules should handle
|
|
{KAsh}, which seems to always mean {K+sh} in ACIP Release V texts.
|
|
</li>
|
|
</ul>
|
|
|
|
|
|
<h2>License</h2>
|
|
|
|
<p>Both the ACIP->Tibetan converters and this document are released
|
|
under the <a
|
|
href="http://orion.lib.virginia.edu/thdl/tools/thdl_license.txt">THDL
|
|
Open Community License Version 1.0</a>.</p>
|
|
|
|
|
|
<p>
|
|
Please
|
|
|
|
<a href="mailto:thdltools-devel@lists.sourceforge.net">
|
|
e-mail us</a>
|
|
|
|
your comments about this page.
|
|
</p>
|
|
|
|
<p>
|
|
The
|
|
<a href="http://www.sourceforge.net/projects/thdltools">
|
|
THDL Tools</a>
|
|
project is generously hosted by:
|
|
<!--
|
|
|
|
DO NOT DELETE THE SF.NET LOGO.
|
|
|
|
We have a choice of colors and sizes for this logo (see
|
|
"https://sourceforge.net/docman/display_doc.php?docid=790&group_id=1"),
|
|
but we do not have the option of removing it. SourceForge requests
|
|
that we put it on each web page for our project, and to give us
|
|
incentive to do so, they will not track the number of hits for our
|
|
project web pages unless we put this link in. To track hits, see
|
|
"http://sourceforge.net/project/stats/index.php?report=months&group_id=61934".
|
|
|
|
-->
|
|
<a href="http://sourceforge.net/">
|
|
<img src="http://sourceforge.net/sflogo.php?group_id=61934&type=1"
|
|
width="88" height="31" alt="SourceForge Logo" />
|
|
</a>
|
|
<!-- AGAIN, DO NOT DELETE THE SF.NET LOGO. -->
|
|
</p>
|
|
</div>
|
|
|
|
|
|
</body>
|
|
</html>
|