1537 lines
54 KiB
HTML
1537 lines
54 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
|
|
<!-- @author David Chandler -->
|
|
<!-- @editor Emacs, baby! -->
|
|
|
|
|
|
<head>
|
|
<title>ACIP To Tibetan Converters</title>
|
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
|
<script type="text/javascript" src="http://iris.lib.virginia.edu/tibet/scripts/thdl_scripts.js"></script>
|
|
<link rel="stylesheet" type="text/css" href="http://iris.lib.virginia.edu/tibet/style/thdl-styles.css"/>
|
|
</head>
|
|
|
|
<body>
|
|
|
|
<div id="banner">
|
|
<a id="logo" href="http://iris.lib.virginia.edu/tibet/index.html"><img id="test" alt="THDL Logo" src="http://iris.lib.virginia.edu/tibet/images/logo.png"/></a>
|
|
<h1>The Tibetan & Himalayan Digital Library</h1>
|
|
|
|
<div id="menubar">
|
|
<script type='text/javascript'>function Go(){return}</script>
|
|
<script type='text/javascript' src='http://iris.lib.virginia.edu/tibet/scripts/new/thdl_menu_config.js'></script>
|
|
|
|
<script type='text/javascript' src='http://iris.lib.virginia.edu/tibet/scripts/new/menu_new.js'></script>
|
|
<script type='text/javascript' src='http://iris.lib.virginia.edu/tibet/scripts/new/menu9_com.js'></script>
|
|
<noscript><p>Your browser does not support javascript.</p></noscript>
|
|
<div id='MenuPos' >Menu Loading... </div>
|
|
</div><!--END menubar-->
|
|
|
|
</div><!--END banner-->
|
|
|
|
<div id="sub_banner">
|
|
<div id="search">
|
|
<form method="get" action="http://www.google.com/u/thdl">
|
|
<p>
|
|
<input type="text" name="q" id="q" size="15" maxlength="255" value="" />
|
|
<input type="submit" name="sa" id="sa" value="Search"/>
|
|
<input type="hidden" name="hq" id="hq" value="inurl:iris.lib.virginia.edu"/>
|
|
</p>
|
|
</form>
|
|
|
|
</div>
|
|
<div id="breadcrumbs">
|
|
<a href="http://iris.lib.virginia.edu/tibet/index.html">Home</a> > <a href="index.html">Tools</a> > <a href="http://iris.lib.virginia.edu/tibet/tools/allfonts.html">Fonts & Input</a> > <a href="http://iris.lib.virginia.edu/tibet/tools/conv.html">Converters</a> > <a href="TMW_RTF_TO_THDL_WYLIE.html">Converters in Jskad</a> > ACIP To Tibetan Converters
|
|
</div>
|
|
</div><!--END banner-->
|
|
|
|
|
|
<div id="main">
|
|
|
|
<h2>ACIP To Tibetan Converters</h2>
|
|
|
|
<p>
|
|
This document describes the ACIP->Tibetan converters built atop
|
|
<a
|
|
href="http://iris.lib.virginia.edu/tibet/tools/jskad.html">Jskad</a>.
|
|
These converters were initially written by David Chandler, a
|
|
volunteer with the <a
|
|
href="http://iris.lib.virginia.edu/tibet/index.html">Tibetan and
|
|
Himalayan Digital Library</a>, in the latter half of 2003.
|
|
They built upon the work of Tony Duff, Edward Garrett, and Than
|
|
Garson, and they would not be possible without the assistance of
|
|
David Chapman, Robert Chilton, and Andrés Montano
|
|
Pellegrini. (Please correct, and forgive, any omissions from
|
|
these lists.)
|
|
</p>
|
|
|
|
<p>
|
|
These converters accept <a href="http://asianclassics.org">Asian
|
|
Classics Input Project</a> (ACIP) transliteration of Tibetan (using
|
|
ACIP's <a
|
|
href="http://asianclassics.org/download/tibetancode/ticode.pdf">Tibetan
|
|
Input Code</a>), a Roman transliteration scheme. ACIP has many
|
|
Buddhist texts available in ACIP transliteration, which alone makes
|
|
ACIP transliteration (or just ACIP for short) important.
|
|
</p>
|
|
|
|
<p>
|
|
The converters here accept a text file of ACIP and output either a
|
|
Unicode UTF-8-encoded text file or a Rich Text Format (RTF) file of
|
|
<a href="http://iris.lib.virginia.edu/tibet/tools/tmw.html">Tibetan
|
|
Machine Web</a> (TMW). The latter is ready to use onscreen and
|
|
to make beautiful hardcopy today; the former will be understood by
|
|
software for a long time to come.
|
|
</p>
|
|
|
|
<p>
|
|
The converters are meant to produce perfect results even for
|
|
imperfect input. To give you an idea of the thought and care
|
|
that went into these converters, consider the following partial list
|
|
of features:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
Four tiers of <a href="#diagnostics">warning and error
|
|
messages</a> are available.
|
|
</li>
|
|
<li>
|
|
Some transliterations specified by the ACIP standard are not
|
|
accepted (i.e., they cause <a href="#diagnostics">errors</a>)
|
|
because they are used too often improperly in Release IV texts
|
|
(e.g., {\}); some non-standard transliteration is understood
|
|
because it is used in ACIP Release IV texts (e.g., {[DD1]}).
|
|
</li>
|
|
<li>
|
|
Non-standard <a href="#escapes">Unicode character escapes</a> are
|
|
supported. (In this way, the glyph that the ACIP {\} refers
|
|
to according to the standard can in fact be represented, via
|
|
{\u0F84}.)
|
|
</li>
|
|
<li>
|
|
<a href="#colors">Color-coding</a> can help find typos in the
|
|
input.
|
|
</li>
|
|
<li>
|
|
A <a href="#sub">substitution</a> mechanism allows for correcting
|
|
erroneous documents on the fly.
|
|
</li>
|
|
<li>
|
|
The converters can output frequency <a
|
|
href="#stats">statistics</a>.
|
|
</li>
|
|
<li>
|
|
The <a href="#lex">"lexical analyzer"</a> and <a
|
|
href="#parse">"parser"</a> handle every intricacy of
|
|
real ACIP Release IV texts.
|
|
</li>
|
|
<li>
|
|
The knowledge regarding the TMW font has been verified by
|
|
independent teams as described <a
|
|
href="TMW_or_TM_To_X_Converters.html#vv">here</a>.
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
The ACIP->Unicode and ACIP->TMW converters are equally
|
|
good. There are some differences between the two,
|
|
though. The TMW font has only a fixed set of glyphs, whereas
|
|
Unicode can encode arbitrary Tibetan glyphs. Thus, the
|
|
hypothetical ACIP {GAI}, which parses as {G+AI} due to <a
|
|
href="#prefix">prefix rules</a>, will give an error in an
|
|
ACIP->TMW conversion because no glyph exists for this
|
|
stack. The ACIP->Unicode conversion will succeed, having
|
|
generated correct Unicode. This is the only difference between
|
|
the two conversions.
|
|
</p>
|
|
|
|
<p>
|
|
The converters are actively maintained; your <a
|
|
href="mailto:thdltools-devel@lists.sourceforge.net">feedback</a> is
|
|
valued.
|
|
</p>
|
|
|
|
<p>
|
|
Note that there are also <a
|
|
href="TMW_or_TM_To_X_Converters.html">TMW->ACIP</a> converters
|
|
available; this document does not cover them.
|
|
</p>
|
|
|
|
<p>
|
|
In what follows, you will learn <a href="#using">how to use</a> the
|
|
converters, including all the features listed above, and you'll find
|
|
a list of <a href="#bugs">known bugs</a> and places where there is
|
|
<a href="#room">room for improvement</a>.
|
|
</p>
|
|
|
|
|
|
<a name="using"></a><h2>Using the Converters</h2>
|
|
|
|
<p>
|
|
This section briefly describes how the converters are best used.
|
|
</p>
|
|
|
|
<p>
|
|
The GUI and command-line interfaces are both sufficient; the GUI
|
|
interface is your best bet if you've not used the converters
|
|
before. To learn how to invoke these interfaces, read <a
|
|
href="TMW_RTF_TO_THDL_WYLIE.html#invok">these instructions</a>.
|
|
</p>
|
|
|
|
<p>
|
|
First, review the <a href="#bugs">known bugs</a> and be sure you can
|
|
live with them.
|
|
</p>
|
|
|
|
<p>
|
|
Now perform a trial conversion of your document with <a
|
|
href="#diagnostics">warnings</a> disabled. You will first
|
|
ensure that no outright <a href="#diagnostics">errors</a> appear in
|
|
the input. If any do, make a copy of the input, edit the
|
|
input, and feed it through again. Feel free to try this out as
|
|
soon as you're comfortable; the error messages themselves are
|
|
sometimes self-explanatory.
|
|
</p>
|
|
|
|
<p>
|
|
Once all errors have been corrected, do a conversion with warning
|
|
level 'Some'. If any warnings mark real problems, correct
|
|
those problems.
|
|
</p>
|
|
|
|
<p>
|
|
If you have the patience, now do a conversion with warning level
|
|
'Most' and correct further problems. If any warnings mark real
|
|
problems, correct those problems.
|
|
</p>
|
|
|
|
<p>
|
|
The 'All' warning level is pedantic; you might find it useful if
|
|
you're writing software that is to produce ACIP transliteration that
|
|
is easily read by machines. If you find any useful warnings at
|
|
this level, report it as a bug -- such warnings should be 'Most' or
|
|
'Some' level.
|
|
</p>
|
|
|
|
<p>
|
|
For best results, produce <a href="#colors">color-coded
|
|
output</a>. Scan the output for non-<a
|
|
href="#native">native</a> <i>tsheg bar</i>s and ensure that they
|
|
match the original document (the one from which the ACIP
|
|
transliteration was produced). Color-coding is useful because,
|
|
for example, {ZHIGN} is probably a typo for {ZHING}; {ZHIGN} will
|
|
appear colored, whereas {ZHING} is not colored.
|
|
</p>
|
|
|
|
<p>
|
|
Note that the ACIP {%} gives a warning every time. Use the <a
|
|
href="#escapes">Unicode escape</a> {\u0F35} if you want to avoid
|
|
this warning, but <i>note well</i> that Unicode escapes are not part
|
|
of the ACIP standard. Thus, other tools that work with ACIP
|
|
transliteration will likely not understand {\u0F35}.
|
|
</p>
|
|
|
|
<p>
|
|
To save time, you may use the <a href="#sub"><i>tsheg-bar</i>
|
|
substitution</a> mechanism when appropriate.
|
|
</p>
|
|
|
|
<p>
|
|
Even if your desired end result is Unicode output, an ACIP->TMW
|
|
conversion is sometimes useful. One benefit is that errors
|
|
will appear for any ACIP <i>tsheg bar</i> that refers to a consonant
|
|
stack not included in TMW. These stacks should be scrutinized,
|
|
because TMW contains over 500 of the most common consonant stacks.
|
|
</p>
|
|
|
|
<p>
|
|
Finally, check a few folios by hand against the original document to
|
|
be sure that you're satisfied with the conversion.
|
|
</p>
|
|
|
|
|
|
|
|
|
|
|
|
<a name="diagnostics"></a><h2>Diagnostics: Warnings and Errors</h2>
|
|
|
|
<p>
|
|
These converters are designed such that the output is just what you
|
|
yourself would create by hand. Whenever there is doubt about
|
|
what output is desired, a warning or error is issued. This
|
|
means that a helpful warning or error message will appear in the
|
|
output, and that you will be told at the end of the conversion that
|
|
one or more warnings or errors have indeed occurred. You can
|
|
then search your output document for the text <tt>[#ERROR</tt> or
|
|
<tt>[#WARNING</tt>.
|
|
</p>
|
|
|
|
<p>
|
|
There are four warning levels: 'None', 'Some', 'Most', and
|
|
'All'. Choose 'None' if you don't want any warnings to appear
|
|
in your output and be brought to your attention at the end of
|
|
conversion. Choose 'Some' if you want to see the most
|
|
important warnings, 'Most' if you want some real confidence in your
|
|
output, and 'All' if you've absolutely got to know that the output
|
|
is right.
|
|
</p>
|
|
|
|
<p>
|
|
Errors will always appear; you cannot disable them.
|
|
</p>
|
|
|
|
<p>
|
|
The following are some (but not all) error and warning messages,
|
|
accompanied by further explication:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<tt>[#ERROR CONVERTING ACIP DOCUMENT: The Unicode escape with
|
|
ordinal 3912 does not match up with any TibetanMachineWeb
|
|
glyph.]</tt> appears for the input {\u0F48} because there is no
|
|
character at the Unicode codepoint U+0F48 (decimal 3912).
|
|
</li>
|
|
<li>
|
|
<tt>[#ERROR The ACIP {G+N+NA} cannot be represented with the
|
|
TibetanMachine or TibetanMachineWeb fonts because no such glyph
|
|
exists in these fonts.]</tt> appears because the Tibetan Machine
|
|
Web font has only a limited number of ready-made, precomposed
|
|
glyphs, and {G+N+NA} is not one of them. You'll only see
|
|
this error in an ACIP->TMW conversion, not an ACIP->Unicode
|
|
conversion.
|
|
</li>
|
|
<li>
|
|
<tt>[#ERROR CONVERTING ACIP DOCUMENT: This converter cannot
|
|
convert the ACIP {x} to Tibetan because it is unclear what the
|
|
result should be.]</tt> appears because the appropriate output for
|
|
this likely requires special mark-up.
|
|
</li>
|
|
<li>
|
|
<tt>[#ERROR CONVERTING ACIP DOCUMENT: Lexical error: The ACIP {^}
|
|
must precede a tsheg bar.]</tt> appears for
|
|
{^ GONG SA}, for example, because only
|
|
{^GONG SA} and {^ GONG SA} are supported in this
|
|
implementation.
|
|
</li>
|
|
<li>
|
|
<tt>[#ERROR CONVERTING ACIP DOCUMENT: The tsheg bar ("syllable") :
|
|
has these errors: Cannot convert ACIP A: because A: is a "vowel"
|
|
without an associated consonant]</tt> appears for the input {:}
|
|
because {:} cannot appear alone. (Sloppily, this message
|
|
exposes you to the internals of the converter, where {:} is
|
|
thought of as {A:} in some contexts.)
|
|
</li>
|
|
<li>
|
|
<tt>[#ERROR CONVERTING ACIP DOCUMENT: Lexical error: The ACIP x
|
|
must be glued to the end of a tsheg bar, but this one was
|
|
not]</tt> appears because {%}, {o}, and {x} are really only to be
|
|
applied to whole <i>tsheg bar</i>s, and should not occur alone.
|
|
</li>
|
|
<li>
|
|
<tt>[#WARNING CONVERTING ACIP DOCUMENT: The ACIP DGYA has been
|
|
interpreted as two stacks, not one, but you may wish to confirm
|
|
that the original text had two stacks as it would be an easy
|
|
mistake to make to see one stack and forget to input it with '+'
|
|
characters.]</tt> appears because it helps evince the impact of <a
|
|
href="#prefix">prefix rules</a>, a subtle point with regards to
|
|
ACIP because they are implied, but not discussed explicitly in
|
|
depth, by the ACIP standard.
|
|
</li>
|
|
<li>
|
|
<tt>[#WARNING CONVERTING ACIP DOCUMENT: Warning: We're going with
|
|
{B+NA}, but only because our knowledge of prefix rules says that
|
|
{B}{NA} is not a legal Tibetan tsheg bar ("syllable")]</tt>
|
|
appears for the same reason as above.
|
|
</li>
|
|
<li>
|
|
<tt>[#WARNING CONVERTING ACIP DOCUMENT: Lexical warning: The ACIP
|
|
{%} is treated by this converter as U+0F35, but sometimes might
|
|
represent U+0F14 in practice. To avoid seeing this warning again,
|
|
change the input to use {\u0F35} instead of {%}.]</tt> appears
|
|
because some ACIP transliteration out there does use {%} to mean
|
|
U+0F14.
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
When warning or error messages refer to a 'Lexical error', that is
|
|
an error that occurs when <a href="#lex">breaking an input text up
|
|
into <i>tsheg bar</i>s</a>. To fully understand all warning
|
|
and error messages, a thorough understanding of <a href="#lex">that
|
|
process</a> and of the <a href="#parse">interpretation of ACIP
|
|
<i>tsheg bar</i>s</a> is required.
|
|
</p>
|
|
|
|
|
|
<a name="colors"></a><h2>Coloration</h2>
|
|
|
|
<p>
|
|
For ACIP->TMW conversions (not ACIP->Unicode), color-coding of
|
|
<i>tsheg bar</i>s is an option. The command-line converters
|
|
accept a flag <tt>--colors yes|no</tt>; the conversion GUI in
|
|
Jskad has a checkbox for color-coding.
|
|
</p>
|
|
|
|
<p>
|
|
Warnings and errors appear in <font color="red">red</font>; <i>tsheg
|
|
bar</i>s that would parse differently if other <a
|
|
href="#prefix">prefix rules</a> were used appear in <font
|
|
color="yellow">yellow</font>; non-<a href="#native">native</a>
|
|
<i>tsheg bar</i>s appear in <font color="green">green</font>.
|
|
</p>
|
|
|
|
|
|
<a name="stats"></a><h2><i>Tsheg-bar</i> Statistics</h2>
|
|
|
|
<p>
|
|
The ACIP->Tibetan converters provide a simple-minded accounting
|
|
mechanism with which one can determine which <i>tsheg bar</i>s
|
|
appear in a conversion or how many times each <i>tsheg bar</i>
|
|
appears. This mechanism is for power users only at this point;
|
|
its user interface leaves much to be desired. If you wish to
|
|
produce frequency information, and if you are not familiar with some
|
|
sort of scripting (via Excel macros, Unix shell scripts, etc.), then
|
|
the output produced will likely be useless to you.
|
|
</p>
|
|
|
|
<p>
|
|
To support the calculation of frequency statistics, that is, how
|
|
many times each <i>tsheg bar</i> appears, the converter can output
|
|
all <i>tsheg bar</i>s to the Java error console (i.e.,
|
|
<tt>System.err</tt>). Each will appear on the console as many
|
|
times as it appears in the input. To activate this
|
|
functionality, <a href="#sysprops">set the system property</a>
|
|
<tt>org.thdl.tib.text.ttt.OutputAllTshegBars</tt> to <tt>true</tt>,
|
|
and be prepared for voluminous output. Massaging this output
|
|
into a friendly tabular format is quite possible but not described
|
|
here; contact <a
|
|
href="mailto:thdltools-devel@lists.sourceforge.net">the
|
|
developers</a> for help.
|
|
</p>
|
|
|
|
<p>
|
|
To support the generation of syllabaries, the converter can output
|
|
each <i>tsheg bar</i> encountered to the Java error console (i.e.,
|
|
<tt>System.err</tt>). Each will appear on the console only
|
|
once, no matter how many times it appears in the input. To
|
|
activate this functionality, <a href="#sysprops">set the system
|
|
property</a> <tt>org.thdl.tib.text.ttt.OutputUniqueTshegBars</tt> to
|
|
<tt>true</tt>, and be prepared for voluminous output.
|
|
</p>
|
|
|
|
<p>
|
|
If desired, each <i>tsheg bar</i> output can be prefixed with a
|
|
string of your choice by <a href="#sysprops">setting the system
|
|
property</a> <tt>org.thdl.tib.text.ttt.PrefixForOutputTshegBars</tt>
|
|
to that string. This is useful if the converter is producing
|
|
other output on the console and you want to separate that output
|
|
from the statistics.
|
|
</p>
|
|
|
|
<!-- DLC LINK TO THE EXCEL SPREADSHEET OF STATS -->
|
|
|
|
|
|
|
|
<a name="sub"></a><h2><i>Tsheg-bar</i> Substitution</h2>
|
|
|
|
<!-- NOTE WELL: The text here is largely the same as the text in the
|
|
class comment for org.thdl.tib.text.ttt.MidLexSubstitution. -->
|
|
|
|
<p>
|
|
The ACIP->Tibetan converters provide a mechanism for
|
|
automatically correcting common transliteration typos. For
|
|
example, if your document contains 100 occurrences of {KAsh} that
|
|
all in fact intend {K+sh}, then you can specify just once the rule
|
|
<tt>{KAsh}->{K+sh}</tt>, and all 100 occurrences will be treated
|
|
correctly. This mechanism is not very easy to use, but it is
|
|
completely customizable; you can specify any number of rules.
|
|
You can only perform such substitutions at the <i>tsheg bar</i>
|
|
level, though. This means, for example, that you cannot
|
|
specify the rule <tt>{GONG SA}->{^GONG SA}</tt>; you can only
|
|
specify <tt>{GONG}->{^GONG}</tt>, which would affect {GONG LA}
|
|
just as it would affect {GONG SA}.
|
|
</p>
|
|
|
|
<p>
|
|
To perform substitutions, <a href="#sysprops">set the system
|
|
property</a> <tt>org.thdl.tib.text.ttt.ReplacementMap</tt> to be a
|
|
comma-delimited list of <tt>x=>y</tt> pairs. For example,
|
|
if you think BLKU, which parses as B+L+KU, should parse as B-L+KU,
|
|
and you want KAsh to be parsed as K+sh because the input operators
|
|
mistyped it, then set <tt>org.thdl.tib.text.ttt.ReplacementMap</tt>
|
|
to <tt>BLKU=>B-L+KU,KAsh=>K+sh</tt>. Note that this will
|
|
not cause {B+L+KU} to become {B-L+KU} -- we are doing the
|
|
replacement during lexical analysis of the input file, not during
|
|
parsing. And it will cause {SBLKU} to become {SB-L+KU}, which
|
|
is parsed as {S+B-L+KU}, probably not what you wanted. If you
|
|
fear such things, you can see if they happen by setting the system
|
|
property <tt>org.thdl.tib.text.ttt.VerboseReplacementMap</tt> to
|
|
<tt>true</tt>, which will cause an informational message to be
|
|
printed on the Java console every time a replacement is made.
|
|
</p>
|
|
|
|
<p>
|
|
Furthermore, you can use the regular expression notations <tt>^</tt>
|
|
and <tt>$</tt> to denote the beginning and end of the <i>tsheg
|
|
bar</i>, respectively. For example, <tt>^BLKU$=>B-L+KU</tt>
|
|
is a useful rule. Note that full regular expressions are not
|
|
supported -- the tool just borrows a bit of the notation. The
|
|
rule <tt>^BLKU=>B-L+KU</tt> means that {BLKUM} and {BLKU} will
|
|
both be replaced, but {SBLKU} and {SBLKUM} will not be. The
|
|
caret, <tt>^</tt>, means that we only match if BLKU is at the
|
|
beginning. The dollar sign, <tt>$</tt>, means that we only
|
|
match if the pattern is at the end. The rule
|
|
<tt>BLKU$=>B-L+KU</tt> will cause {SBLKU} to be replaced, but not
|
|
{BLKUM}. Note that performance is far better for
|
|
<tt>^FOO$</tt> than for <tt>^FOO</tt>, <tt>FOO$</tt>, or
|
|
<tt>FOO</tt> alone.
|
|
</p>
|
|
|
|
<p>
|
|
Only one substitution is made per <i>tsheg bar</i>.
|
|
<tt>^FOO$</tt>-style mappings will be tried first, then
|
|
<tt>^FOO</tt>-style, then <tt>FOO$</tt>-style, and finally
|
|
<tt>FOO</tt>-style.
|
|
</p>
|
|
|
|
<p>
|
|
An example of a useful substitution is <tt>o$=>\u0F35</tt>.
|
|
This is useful because the converters interpret the ACIP {o} as
|
|
U+0F37 by default, but you might prefer U+0F35 in your output.
|
|
</p>
|
|
|
|
<p>
|
|
Note that you cannot literally replace {FOO} with {BAR} using this
|
|
mechanism -- because {F} is not an ACIP character, the lex will not
|
|
get far enough to use this substitution mechanism. This is not
|
|
considered a design flaw -- serious errors require user
|
|
intervention. Sophisticated users can use something akin to
|
|
perl, sed, or awk scripts to preprocess the input.
|
|
</p>
|
|
|
|
<p>
|
|
Note also that you cannot use the rule <tt>ONYA=>O&</tt>,
|
|
although it would be nice if you could. Technically, {&}
|
|
is considered to be punctuation (i.e., that which divides <i>tsheg
|
|
bar</i>s) and is not understood inside a <i>tsheg bar</i>.
|
|
</p>
|
|
|
|
<p>
|
|
Note that this mechanism is also useful for fixing problems in the
|
|
converter itself rather than in the input.
|
|
</p>
|
|
|
|
<a name="escapes"></a><h2>Unicode Character Escapes</h2>
|
|
|
|
<p>
|
|
The ACIP->Tibetan converters support some non-standard extensions
|
|
to the <a
|
|
href="http://asianclassics.org/download/tibetancode/ticode.pdf">ACIP
|
|
Tibetan Input Code Standard</a>. One of those is Unicode
|
|
character escape sequences. This extension makes it possible
|
|
to represent characters that the <a
|
|
href="http://asianclassics.org/download/tibetancode/ticode.pdf">ACIP
|
|
standard</a> does not address, and to represent one character,
|
|
U+0F84, that ACIP does address with the transliteration {\} but that
|
|
is misused in practice so often to refer to U+0F3C that the
|
|
ACIP->Tibetan converters always produce an error upon seeing {\}.
|
|
</p>
|
|
|
|
<p>
|
|
Outside of comments, {\uKLMN} is interpreted as referring to the
|
|
Unicode character with ordinal <i>KLMN</i>, where each of K, L, M,
|
|
and N are case-insensitive hexadecimal digits. For example,
|
|
the ACIP {KA KHA GA NGA } is exactly equivalent to
|
|
{\u0F40\u0f0B\u0F41\u0F0B\u0F42\u0F0B\u0F44\u0f0b}. Unicode
|
|
escapes produce the obvious Unicode in an ACIP->Unicode
|
|
conversion, and they produce the correct TMW glyph in an
|
|
ACIP->TMW conversion. There are limits, though, when
|
|
converting to TMW; multiple escapes in sequence are not handled
|
|
correctly. It would take a Unicode to TMW converter to produce
|
|
the correct glyphs for {\u0F42\u0F92\u0FB7\u0F7C}. The escapes
|
|
for vowels and other characters that are mapped to multiple TMW
|
|
glyphs are also not handled perfectly. Best practice is to use
|
|
escapes only when necessary in an ACIP->TMW conversion.
|
|
</p>
|
|
|
|
<p>
|
|
The Unicode character represented need not be a Tibetan one; for
|
|
example, {\u0040} produces the at sign, <tt>@</tt>.
|
|
</p>
|
|
|
|
<p>
|
|
The latest <a
|
|
href="http://iris.lib.virginia.edu/tibet/collections/langling/ewts/">Extended
|
|
Wylie Transliteration Scheme</a> standard has assigned private-use
|
|
area (PUA) Unicode codepoints to some TMW glyphs. ACIP
|
|
documents that have a <a href="#escapes">Unicode escape</a> in the
|
|
range U+F021 to U+F0FF, inclusive, are interpreted as intending
|
|
these TMW glyphs. ACIP->Unicode produces an error for such
|
|
an escape because it is font-dependent and not standard. Other
|
|
tools will likely not understand such Unicode, so the converter will
|
|
not produce it. If you want it in the output, it is there in
|
|
the error message.
|
|
</p>
|
|
|
|
|
|
<p>
|
|
Note well the <a href="#bugs">known bug</a> with regard to
|
|
whitespace in transliteration that follows a Unicode escape.
|
|
In large part, this bug affects characters that can be
|
|
transliterated by other, simpler, standard means.
|
|
</p>
|
|
|
|
<p>
|
|
If you do want to disable the use of Unicode escapes, <a
|
|
href="#sysprops">set the system property</a>
|
|
<tt>thdl.tib.text.disallow.unicode.character.escapes.in.acip</tt> to
|
|
<tt>true</tt>.
|
|
</p>
|
|
|
|
|
|
<a name="lex"></a><h2>Breaking a Text Up Into <i>tsheg bar</i>s</h2>
|
|
|
|
<p>
|
|
The ACIP->Tibetan converters all take ACIP transliteration as
|
|
input. The first step in conversion is to break up the input
|
|
into manageable pieces. (This is known as <i>lexical
|
|
analysis</i> in the context of programming languages, and you may
|
|
see the term in diagnostic messages though a linguist who studies
|
|
human language like Tibetan might balk at the term.) The
|
|
correct pieces in this case are <i>tsheg bar</i>s (in ACIP, {TSEG
|
|
BAR}), punctuation, comments, whitespace, folio markers, formatting
|
|
codes, etc. In this section, the intracacies of how the
|
|
converter does that will be laid bare. With luck, this will
|
|
help you understand why the converter treated one space character
|
|
(i.e, ' ', U+0020) as a <i>tsheg</i> and another as Tibetan
|
|
whitespace.
|
|
</p>
|
|
|
|
<p>
|
|
The Tibetan term <i>tsheg bar</i> refers to "the stuff between
|
|
the dots". In the ACIP {BKRA SHIS [# Notice that
|
|
this comment is embedded in the Tibetan greeting pronounced 'tashi
|
|
delay']BDE LEGS,}, there are four <i>tsheg bar</i>s, 'BKRA',
|
|
'SHIS', 'BDE', and 'LEGS'. In this case 'BDE' is literally
|
|
"between the dots"; i.e., it is sandwiched by two U+0F0B
|
|
characters (because comments are in a sense invisible). One of
|
|
the "dots" that touches 'LEGS' does not look like a dot --
|
|
it is a <i>shad</i>, U+0F0D. The lexical analyzer also finds
|
|
one comment, which will appear in a Latin typeface in the output,
|
|
and it finds four pieces of punctuation -- three <i>tsheg</i>s and a
|
|
<i>shad</i>.
|
|
</p>
|
|
|
|
<p>
|
|
The converter will not allow an illegal character into a <i>tsheg
|
|
bar</i>. For example, {jA} is an error and causes an error
|
|
message to appear in the output.
|
|
</p>
|
|
|
|
<p>
|
|
Now that the basic operation is clear from the above example, let's
|
|
cover the fine points of how standard ACIP is handled. We'll
|
|
also cover some non-standard constructs that appear commonly in
|
|
actual ACIP Release IV texts.
|
|
</p>
|
|
|
|
<p>
|
|
The first construct that deserves explanation is the line
|
|
break. By the ACIP standard, line breaks in the input do not
|
|
become line breaks in the output unless there are two line breaks in
|
|
the input. For example, the ACIP snippet below has only one
|
|
line break in the output although three line breaks appear in the
|
|
input:
|
|
</p>
|
|
|
|
<pre>
|
|
BKRA SHIS
|
|
BDE LEGS,
|
|
|
|
THUGS RJE CHE ... and so on ...
|
|
</pre>
|
|
|
|
<p>
|
|
One fine point is that the converter does not require a space before
|
|
a line break. If {SHIS} appears before a line break, the converter
|
|
inserts a space so that it's treated just like {SHIS } is
|
|
treated. This oddity is needed to convert real ACIP documents.
|
|
</p>
|
|
|
|
<p>
|
|
Another fine point is that ACIP's {^} character "eats" a
|
|
following space or a newline. This is so that
|
|
{^ GONG SA } is treated identically to
|
|
{^GONG SA }.
|
|
</p>
|
|
|
|
<p>
|
|
Comments appear in a Latin typeface always. Comments are not
|
|
allowed just anywhere -- a comment cannot occur within a single
|
|
<i>tsheg bar</i>, for example, and it cannot appear between a
|
|
<i>tsheg bar</i> and the <i>tsheg</i> that follows it. That
|
|
is, {BD[#COMMENT]E} is not like {BDE}, and {BDE[#COMMENT] LEGS}
|
|
is not like {BDE LEGS} (though {BDE [#COMMENT]LEGS} is).
|
|
</p>
|
|
|
|
<p>
|
|
Corrections are interpreted as Tibetan, not English, by default, but
|
|
there is a built-in list of corrections that should appear in the
|
|
output in a Latin typeface. (Actually, any correction that
|
|
starts with a certain string will appear in a Latin typeface.)
|
|
The full list is the following:
|
|
</p>
|
|
|
|
<pre>
|
|
"LINE" // from KD0001I1.ACT
|
|
"DATA" // from KL0009I2.INC
|
|
"BLANK" // from KL0009I2.INC
|
|
"NOTE" // from R0001F.ACM
|
|
"alternate" // from R0018F.ACE
|
|
"02101-02150 missing" // from R1003A3.INC
|
|
"51501-51550 missing" // from R1003A52.ACT
|
|
"BRTAGS ETC" // from S0002N.ACT
|
|
"TSAN, ETC" // from S0015N.ACT
|
|
"SNYOMS, THROUGHOUT" // from S0016N.ACT
|
|
"KYIS ETC" // from S0019N.ACT
|
|
"MISSING" // from S0455M.ACT
|
|
"this" // from S6850I1B.ALT
|
|
"THIS" // from S0057M.ACT
|
|
</pre>
|
|
|
|
<p>
|
|
Somewhat related is the converter's treatment of a few oddball
|
|
comments. The oddity is that these comments use the syntax
|
|
{[COMMENT]} rather than the standard syntax {[#COMMENT]}. The
|
|
converter will treat the following as comments:
|
|
</p>
|
|
|
|
<pre>
|
|
From S5274I.ACT:
|
|
"[FIRST]"
|
|
From S5274I.ACT:
|
|
"[SECOND]"
|
|
From S0216M.ACT:
|
|
"[Additional verses added by Khen Rinpoche here are]"
|
|
From S0216M.ACT:
|
|
"[ADDENDUM: The text of]"
|
|
From S0216M.ACT:
|
|
"[END OF ADDENDUM]"
|
|
From S0216M.ACT:
|
|
"[Some of the verses added here by Khen Rinpoche include:]"
|
|
From S0216M.ACT (note the typo):
|
|
"[Note that, in the second verse, the {YUL LJONG} was orignally {GANG LJONG},
|
|
and is now recited this way since the ceremony is not only taking place in Tibet.]"
|
|
From S6954E1.ACT:
|
|
"[text missing]"
|
|
From TD3817I.INC:
|
|
"[INCOMPLETE]"
|
|
From S0935m.act:
|
|
"[MISSING PAGE]"
|
|
From S0975I.INC:
|
|
"[MISSING FOLIO]"
|
|
From S0839D1I.INC:
|
|
"[UNCLEAR LINE]"
|
|
From SE6260A.INC:
|
|
"[THE FOLLOWING TEXT HAS INCOMPLETE SECTIONS, WHICH ARE ON ORDER]"
|
|
From SE6260A.INC:
|
|
"[@DATA INCOMPLETE HERE]"
|
|
From SE6260A.INC:
|
|
"[@DATA MISSING HERE]"
|
|
From TD4035I.INC:
|
|
"[LINE APPARENTLY MISSING THIS PAGE]"
|
|
From TD4226I2.INC:
|
|
"[DATA INCOMPLETE HERE]"
|
|
To be consistent with the above:
|
|
"[DATA MISSING HERE]"
|
|
From S0018N.ACT:
|
|
"[FOLLOWING SECTION WAS NOT AVAILABLE WHEN THIS EDITION WAS
|
|
PRINTED, AND IS SUPPLIED FROM ANOTHER, PROBABLY THE ORIGINAL:]"
|
|
From S0018N.ACT:
|
|
"[THESE PAGE NUMBERS RESERVED IN THIS EDITION FOR PAGES
|
|
MISSING FROM ORIGINAL ON WHICH IT WAS BASED]"
|
|
From S0018N.ACT:
|
|
"[PAGE NUMBERS RESERVED FROM THIS EDITION FOR MISSING
|
|
SECTION SUPPLIED BY PRECEDING]"
|
|
From S0057M.ACT:
|
|
"[SW: OK]"
|
|
From S0057M.ACT:
|
|
"[m:ok]"
|
|
From S0057M.ACT:
|
|
"[A FIRST ONE
|
|
MISSING HERE?]"
|
|
From S0195A1.INC:
|
|
"[THE INITIAL PART OF THIS TEXT WAS INPUT BY THE SERA MEY LIBRARY IN
|
|
TIBETAN FONT AND NEEDS TO BE REDONE BY DOUBLE INPUT]"
|
|
</pre>
|
|
|
|
<p>
|
|
The converter also supports several non-standard folio
|
|
markers. A review of ACIP Release IV texts determined that the
|
|
following types of folio markers can appear:
|
|
</p>
|
|
|
|
<pre>
|
|
@001
|
|
@001A
|
|
@001B
|
|
@01A.3
|
|
@012A.3
|
|
@[07B]
|
|
@00007B
|
|
@00007
|
|
@B00007
|
|
@[00007A]
|
|
</pre>
|
|
|
|
<p>
|
|
Similarly, to support real ACIP Release IV texts, the converter
|
|
treats {[DD1]}, {[DD2]}, {[ DD ]}, and {[DDD]} just like {[DD]}
|
|
(which is specified in the ACIP standard). It treats {[ BP ]}
|
|
and {[BLANK PAGE]} just like {[BP]}, also.
|
|
</p>
|
|
|
|
<p>
|
|
The lists above were created by a most fallible process of reviewing
|
|
a large number of ACIP Release IV texts. Your suggestions for
|
|
additions to these lists are highly valued; please contact <a
|
|
href="mailto:thdltools-devel@lists.sourceforge.net">the
|
|
developers</a>.
|
|
</p>
|
|
|
|
<p>
|
|
FIXME: describe when the converter treats a space as a <i>tsheg</i> and when a space is Tibetan whitespace. Describe how a tsheg does not appear after {KA} and {GA} with most vowels, describe the handling of {NGA,} as {NGA ,}. Talk about dzongkha vs. tibetan when it comes to a <i>tsheg</i> at the end of a string of <i>tsheg bar</i>s. Describe treatment of final line break or lack thereof. Warn users to watch out for lines that end with {-}. Describe treatment of {.} in certain contexts as U+0F0C. Etc.
|
|
|
|
|
|
<!-- <h1>DLC</h1>
|
|
|
|
<pre>
|
|
DLC warn on BUR'ANG because BUR'ANG and BUR-'ANG both appear once in ACIP files. Tell RC. DLC: subst it!
|
|
|
|
lex: Spaces
|
|
DLC configurable tibetan spaces bhutan vs. tibet
|
|
{NGA,} -> {NGA ,}
|
|
|
|
DLC hyphens and ' end lines but shouldn't, eh?
|
|
|
|
DLC ACIP {.} is just an error! Have the error message mention {\u0F0C} for those desiring a non-breaking tsheg.
|
|
|
|
DLC whitespace -- newlines [final newline DLC?], spaces
|
|
|
|
</pre> -->
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<a name="parse"></a><h2>Parsing <i>tsheg bar</i>s: Greedy Stacking and
|
|
Nativeness</h2>
|
|
|
|
<p>
|
|
This section is a technical reference sufficiently detailed so that
|
|
you can fully understand the inner workings of the converter as it
|
|
decides which Unicode or TMW to use for a given <i>tsheg
|
|
bar</i>. The problem of <a href="#lex">breaking up a text into
|
|
<i>tsheg bar</i>s</a> is a separate issue; this section describes
|
|
what happens to a <i>tsheg bar</i> after it's been chipped away from
|
|
the text.
|
|
</p>
|
|
|
|
<a name="native"></a>
|
|
<p>
|
|
The ACIP->Tibetan converters have a notion of
|
|
<i>nativeness</i>. Each <i>tsheg bar</i> is either native
|
|
Tibetan or non-native. For example, in Buddhist texts written
|
|
in Tibetan, Sanskrit mantras often appear in Tibetan
|
|
characters. This "Tibetanized Sanskrit" is
|
|
non-native. The <i>tsheg bar</i>s that make up this mantra
|
|
(and here, take "tsheg bar" somewhat literally to mean the
|
|
characters delimited by punctuation and whitespace) are some native
|
|
and some non-native in the converter's eyes. For example, the
|
|
<i>tsheg bar</i> {MA } appears in some mantras, and is thus in
|
|
fact non-native. The converter, however, treats {MA } as
|
|
native in all contexts. Thus, "native" is a
|
|
technical term with a slightly different meaning than usual.
|
|
</p>
|
|
|
|
<p>
|
|
The idea of nativeness is important because it affects how the
|
|
converter treats a <i>tsheg bar</i>. In ACIP transliteration,
|
|
the rule is that consonants stack up until punctuation, whitespace,
|
|
or a vowel appears. For example, {RDZYA} is equivalent to
|
|
{R+DZ+YA}. ({DZA} always means the letter {DZA} itself, never
|
|
{D+ZA}.) But this greedy stacking does not apply to {SOGS},
|
|
which is equivalent to {SOG-S}, not {SOG+S}. Why not?
|
|
Because {SOGS} is a native <i>tsheg bar</i> where GA is the suffix
|
|
and SA is the postsuffix. Similarly, {GNAD} is {G-NAD}, not
|
|
{G+NAD}. Why? Because GA is a prefix in this native
|
|
Tibetan <i>tsheg bar</i>.
|
|
</p>
|
|
|
|
<p>
|
|
In this section, we will illustrate the inner workings of this
|
|
aspect of the converter. You will be able to determine which
|
|
snippets of transliteration the converter considers to be native
|
|
<i>tsheg bar</i>s, where greedy stacking does not apply except for
|
|
the root stack, and which snippets are non-native, and thus wholly
|
|
subject to greedy stacking.
|
|
</p>
|
|
|
|
<h3>Anatomy of a Native <i>tsheg bar</i></h3>
|
|
|
|
<p>
|
|
First, the <a href="#lex">lexical analyzer</a> ensures that only the
|
|
Tibetan and Sanskrit consonants, the vowels {A}, {I}, {U}, {E}, {O},
|
|
{OO}, {EE}, {i}, {'A}, {'I}, {'U}, {'E}, {'O}, {'OO}, {'EE}, and
|
|
{'i}, and the adornments {m} and {:} are allowed in a <i>tsheg
|
|
bar</i>.
|
|
</p>
|
|
|
|
<p>
|
|
As far as the converter is concerned, a native <i>tsheg bar</i>
|
|
consists of an optional prefix, a native root stack, an optional
|
|
suffix, an optional postsuffix (also known as a secondary suffix)
|
|
that may only be present if a suffix is present, and zero or more
|
|
<i>appendages</i> (my term, created because I don't know what a
|
|
grammarian calls such a thing). An appendage is one of the
|
|
following stack sequences:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>{'E}</li>
|
|
<li>{'I}</li>
|
|
<li>{'O}</li>
|
|
<li>{'U}</li>
|
|
<li>{'US}</li>
|
|
<li>{'UR}</li>
|
|
<li>{'UM}</li>
|
|
<li>{'ONG}</li>
|
|
<li>{'ONGS}</li>
|
|
<li>{'OS}</li>
|
|
<li>{'IS}</li>
|
|
<li>{'UNG}</li>
|
|
<li>{'ANG}</li>
|
|
<li>{'AM}</li>
|
|
</ul>
|
|
|
|
<p>
|
|
A <i>tsheg bar</i> is non-native if it has a non-native root stack
|
|
or if it contains the {:} character. Any vowel is allowed on a
|
|
native root stack, even {'EEm}, {i}, or the like.
|
|
</p>
|
|
<p>
|
|
The rule about native root stacks is important, for example, in
|
|
determining that {KTYAMS} is {K+T+YAM+SA} instead of {K+T+YAMASA}
|
|
(because K+T+YA is not a native stack). Another example is
|
|
{GNVA}, which is treated like {G+N+VA}, not {G-N+VA}, even though
|
|
{GNA} is treated like {G-NA} because NA can take a GA prefix.
|
|
The complete list of native stacks is the following:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>KA</li>
|
|
<li>KHA</li>
|
|
<li>GA</li>
|
|
<li>NGA</li>
|
|
<li>CA</li>
|
|
<li>CHA</li>
|
|
<li>JA</li>
|
|
<li>NYA</li>
|
|
<li>TA</li>
|
|
<li>THA</li>
|
|
<li>DA</li>
|
|
<li>NA</li>
|
|
<li>PA</li>
|
|
<li>PHA</li>
|
|
<li>BA</li>
|
|
<li>MA</li>
|
|
<li>TZA</li>
|
|
<li>TSA</li>
|
|
<li>DZA</li>
|
|
<li>WA</li>
|
|
<li>ZHA</li>
|
|
<li>ZA</li>
|
|
<li>'A</li>
|
|
<li>YA</li>
|
|
<li>RA</li>
|
|
<li>LA</li>
|
|
<li>SHA</li>
|
|
<li>SA</li>
|
|
<li>HA</li>
|
|
<li>AA</li>
|
|
<li>R+KA (RKA)</li>
|
|
<li>R+GA (RGA)</li>
|
|
<li>R+NGA (RNGA)</li>
|
|
<li>R+JA (RJA)</li>
|
|
<li>R+NYA (RNYA)</li>
|
|
<li>R+TA (RTA)</li>
|
|
<li>R+DA (RDA)</li>
|
|
<li>R+NA (RNA)</li>
|
|
<li>R+BA (RBA)</li>
|
|
<li>R+MA (RMA)</li>
|
|
<li>R+TZA (RTZA)</li>
|
|
<li>R+DZA (RDZA)</li>
|
|
<li>L+KA (LKA)</li>
|
|
<li>L+GA (LGA)</li>
|
|
<li>L+NGA (LNGA)</li>
|
|
<li>L+CA (LCA)</li>
|
|
<li>L+JA (LJA)</li>
|
|
<li>L+TA (LTA)</li>
|
|
<li>L+DA (LDA)</li>
|
|
<li>L+PA (LPA)</li>
|
|
<li>L+BA (LBA)</li>
|
|
<li>L+HA (LHA)</li>
|
|
<li>S+KA (SKA)</li>
|
|
<li>S+GA (SGA)</li>
|
|
<li>S+NGA (SNGA)</li>
|
|
<li>S+NYA (SNYA)</li>
|
|
<li>S+TA (STA)</li>
|
|
<li>S+DA (SDA)</li>
|
|
<li>S+NA (SNA)</li>
|
|
<li>S+PA (SPA)</li>
|
|
<li>S+BA (SBA)</li>
|
|
<li>S+MA (SMA)</li>
|
|
<li>S+TZA (STZA)</li>
|
|
<li>K+VA (KVA)</li>
|
|
<li>KH+VA (KHVA)</li>
|
|
<li>G+VA (GVA)</li>
|
|
<li>C+VA (CVA)</li>
|
|
<li>NY+VA (NYVA)</li>
|
|
<li>T+VA (TVA)</li>
|
|
<li>D+VA (DVA)</li>
|
|
<li>TZ+VA (TZVA)</li>
|
|
<li>TS+VA (TSVA)</li>
|
|
<li>ZH+VA (ZHVA)</li>
|
|
<li>Z+VA (ZVA)</li>
|
|
<li>R+VA (RVA)</li>
|
|
<li>SH+VA (SHVA)</li>
|
|
<li>S+VA (SVA)</li>
|
|
<li>H+VA (HVA)</li>
|
|
<li>K+YA (KYA)</li>
|
|
<li>KH+YA (KHYA)</li>
|
|
<li>G+YA (GYA)</li>
|
|
<li>P+YA (PYA)</li>
|
|
<li>PH+YA (PHYA)</li>
|
|
<li>B+YA (BYA)</li>
|
|
<li>M+YA (MYA)</li>
|
|
<li>K+RA (KRA)</li>
|
|
<li>KH+RA (KHRA)</li>
|
|
<li>G+RA (GRA)</li>
|
|
<li>T+RA (TRA)</li>
|
|
<li>TH+RA (THRA)</li>
|
|
<li>D+RA (DRA)</li>
|
|
<li>P+RA (PRA)</li>
|
|
<li>PH+RA (PHRA)</li>
|
|
<li>B+RA (BRA)</li>
|
|
<li>M+RA (MRA)</li>
|
|
<li>SH+RA (SHRA)</li>
|
|
<li>S+RA (SRA)</li>
|
|
<li>H+RA (HRA)</li>
|
|
<li>K+LA (KLA)</li>
|
|
<li>G+LA (GLA)</li>
|
|
<li>B+LA (BLA)</li>
|
|
<li>Z+LA (ZLA)</li>
|
|
<li>R+LA (RLA)</li>
|
|
<li>S+LA (SLA)</li>
|
|
<li>R+K+YA (RKYA)</li>
|
|
<li>R+G+YA (RGYA)</li>
|
|
<li>R+M+YA (RMYA)</li>
|
|
<li>R+G+VA (RGVA)</li>
|
|
<li>R+TZ+VA (RTZVA)</li>
|
|
<li>S+K+YA (SKYA)</li>
|
|
<li>S+G+YA (SGYA)</li>
|
|
<li>S+P+YA (SPYA)</li>
|
|
<li>S+B+YA (SBYA)</li>
|
|
<li>S+M+YA (SMYA)</li>
|
|
<li>S+K+RA (SKRA)</li>
|
|
<li>S+G+RA (SGRA)</li>
|
|
<li>S+N+RA (SNRA)</li>
|
|
<li>S+P+RA (SPRA)</li>
|
|
<li>S+B+RA (SBRA)</li>
|
|
<li>S+M+RA (SMRA)</li>
|
|
<li>G+R+VA (GRVA)</li>
|
|
<li>D+R+VA (DRVA)</li>
|
|
<li>PH+Y+VA (PHYVA)</li>
|
|
</ul>
|
|
|
|
<p>
|
|
(Some would argue that LVA is notably absent. It is seen in
|
|
ACIP Buddhist texts in {AELVA}, {LVAm}, {LVU}, {LVUN}, {LVAR},
|
|
{LVE}, {LVANG}, and {LVA}. Greedy stacking affects none of
|
|
these <i>tsheg bar</i>s' parsing, however.)
|
|
</p>
|
|
|
|
<a name="prefix"></a>
|
|
<p>
|
|
Not all characters can be prefixes and the like. Only the five
|
|
prefixes (GA, DA, BA, MA, 'A), ten suffixes (GA, NGA, DA, NA, BA,
|
|
MA, 'A, RA, LA, SA), and two postsuffixes (DA, SA) every Tibetan
|
|
student knows are allowed, and they cannot appear with vowels.
|
|
(In {LE'U}, {'} is not a suffix -- it is part of an
|
|
appendage.) In fact, certain prefixes may only appear with
|
|
certain root stacks. The reason that these prefix rules matter
|
|
is that they govern how <i>tsheg bar</i>s are parsed. For
|
|
example, {GNA} is parsed like {G-NA}, because NA takes a GA
|
|
prefix. But {GPA} is parsed like {G+PA}, because PA does not
|
|
take a GA prefix.
|
|
</p>
|
|
|
|
<p>
|
|
Prefix rules are a topic of some controversy; different grammars
|
|
give different lists of prefix rules. For a converter, it is
|
|
important that the converter's knowledge of prefix rules matches the
|
|
knowledge of the person who typed in the ACIP transliteration, not
|
|
that the converter agrees with a grammarian. For example, if
|
|
the input technician thought that PA could take a GA prefix, then
|
|
the converter will produce {G+PA} when {G-PA} was intended.
|
|
For this reason, the converter can produce a warning every time a
|
|
prefix rule prohibited the treatment of one of the five prefixes as
|
|
a prefix. For example, {GPA} produces this warning.
|
|
However, {GNA} produces no warning, because the converter assumes
|
|
that it is unlikely that an input technician would enter {GNA} upon
|
|
seeing {G+NA}. Part of the reason for this assumption is that
|
|
the <i>Asian Classics Input Project Entry Operator Transcription
|
|
Chart</i> as of Spring, 1993, explicitly enumerates the following
|
|
cases for special treatment by input operators:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>{BDA'} vs. {B+DA}</li>
|
|
<li>{DBANG} vs. {D+BA}</li>
|
|
<li>{DGA'} vs. {D+GA}</li>
|
|
<li>{DGRA} vs. {D+GRA}</li>
|
|
<li>{DGYES} vs. {D+GYA}</li>
|
|
<li>{DMAR} vs. {D+MA}</li>
|
|
<li>{GDA'} vs. {G+DA}</li>
|
|
<li>{GNAD} vs. {G+NA}</li>
|
|
<li>{MNA'} vs. {M+NA}</li>
|
|
</ul>
|
|
|
|
<p>
|
|
Regardless, for best results, you should ensure that the input
|
|
technician's knowledge of prefix rules matches the converter's
|
|
knowledge. The following are the legal combinations of prefix
|
|
and root stack in the converter's eyes:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
The BA prefix may occur with any of the following stacks:
|
|
<ul>
|
|
<li>KA</li>
|
|
<li>SA</li>
|
|
<li>CA</li>
|
|
<li>TA</li>
|
|
<li>TZA</li>
|
|
<li>GA</li>
|
|
<li>DA</li>
|
|
<li>ZHA</li>
|
|
<li>ZA</li>
|
|
<li>SHA</li>
|
|
<li>K+YA (KYA)</li>
|
|
<li>G+YA (GYA)</li>
|
|
<li>K+RA (KRA)</li>
|
|
<li>G+RA (GRA)</li>
|
|
<li>S+RA (SRA)</li>
|
|
<li>G+LA (GLA)</li>
|
|
<li>K+LA (KLA)</li>
|
|
<li>Z+LA (ZLA)</li>
|
|
<li>R+LA (RLA)</li>
|
|
<li>S+LA (SLA)</li>
|
|
<li>S+KA (SKA)</li>
|
|
<li>S+GA (SGA)</li>
|
|
<li>S+NGA (SNGA)</li>
|
|
<li>S+NYA (SNYA)</li>
|
|
<li>S+TA (STA)</li>
|
|
<li>S+DA (SDA)</li>
|
|
<li>S+NA (SNA)</li>
|
|
<li>S+TZA (STZA)</li>
|
|
<li>R+KA (RKA)</li>
|
|
<li>R+GA (RGA)</li>
|
|
<li>R+NGA (RNGA)</li>
|
|
<li>R+JA (RJA)</li>
|
|
<li>R+NYA (RNYA)</li>
|
|
<li>R+TA (RTA)</li>
|
|
<li>R+DA (RDA)</li>
|
|
<li>R+NA (RNA)</li>
|
|
<li>R+TZA (RTZA)</li>
|
|
<li>R+DZA (RDZA)</li>
|
|
<li>L+CA (LCA)</li>
|
|
<li>L+TA (LTA)</li>
|
|
<li>L+DA (LDA)</li>
|
|
<li>R+K+YA (RKYA)</li>
|
|
<li>R+G+YA (RGYA)</li>
|
|
<li>S+K+YA (SKYA)</li>
|
|
<li>S+G+YA (SGYA)</li>
|
|
<li>S+K+RA (SKRA)</li>
|
|
<li>S+G+RA (SGRA)</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
The GA prefix may occur with any of the following stacks:
|
|
<ul>
|
|
<li>CA</li>
|
|
<li>DA</li>
|
|
<li>NA</li>
|
|
<li>NYA</li>
|
|
<li>SA</li>
|
|
<li>SHA</li>
|
|
<li>TA</li>
|
|
<li>TZA</li>
|
|
<li>YA</li>
|
|
<li>ZA</li>
|
|
<li>ZHA</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
The 'A prefix may occur with any of the following stacks:
|
|
<ul>
|
|
<li>GA</li>
|
|
<li>JA</li>
|
|
<li>DA</li>
|
|
<li>BA</li>
|
|
<li>DZA</li>
|
|
<li>KHA</li>
|
|
<li>CHA</li>
|
|
<li>THA</li>
|
|
<li>PHA</li>
|
|
<li>TSA</li>
|
|
<li>PH+YA (PHYA)</li>
|
|
<li>B+YA (BYA)</li>
|
|
<li>KH+YA (KHYA)</li>
|
|
<li>G+YA (GYA)</li>
|
|
<li>B+RA (BRA)</li>
|
|
<li>KH+RA (KHRA)</li>
|
|
<li>G+RA (GRA)</li>
|
|
<li>D+RA (DRA)</li>
|
|
<li>PH+RA (PHRA)</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
The MA prefix may occur with any of the following stacks:
|
|
<ul>
|
|
<li>KHA</li>
|
|
<li>GA</li>
|
|
<li>CHA</li>
|
|
<li>JA</li>
|
|
<li>THA</li>
|
|
<li>TSA</li>
|
|
<li>DA</li>
|
|
<li>DZA</li>
|
|
<li>NGA</li>
|
|
<li>NYA</li>
|
|
<li>NA</li>
|
|
<li>KH+YA (KHYA)</li>
|
|
<li>G+YA (GYA)</li>
|
|
<li>KH+RA (KHRA)</li>
|
|
<li>G+RA (GRA)</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
The DA prefix may occur with any of the following stacks:
|
|
<ul>
|
|
<li>BA</li>
|
|
<li>GA</li>
|
|
<li>KA</li>
|
|
<li>MA</li>
|
|
<li>NGA</li>
|
|
<li>PA</li>
|
|
<li>B+RA (BRA)</li>
|
|
<li>B+YA (BYA)</li>
|
|
<li>G+RA (GRA)</li>
|
|
<li>G+YA (GYA)</li>
|
|
<li>K+RA (KRA)</li>
|
|
<li>K+YA (KYA)</li>
|
|
<li>M+YA (MYA)</li>
|
|
<li>P+RA (PRA)</li>
|
|
<li>P+YA (PYA)</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
In the above list, the presence of wa-zur (ACIP {V}) does not
|
|
disallow a prefix-root combination; nor does the presence of any
|
|
vowel, even {'EEm}. The presence of {:} does disallow
|
|
prefix-root combinations; e.g., {GN'EEm} is {G-N'EEm}, but {GNA:} is
|
|
{G+NA:}. ({GNVA} is parsed as {G+N+VA} not because NVA cannot
|
|
take a GA prefix, but because NVA is not a native stack.)
|
|
</p>
|
|
|
|
<p>
|
|
The converter will allow any suffix to go with any native root or
|
|
prefix-root combination; it will allow any postsuffix to follow any
|
|
suffix. It will allow any appendage on any native <i>tsheg
|
|
bar</i>.
|
|
</p>
|
|
|
|
<p>
|
|
For example, {SOGS}, {BSOGS}, {BS'EEmGS}, {LE'U'I'O} and
|
|
{BSKYABS-'UR-'UNG-'O} are all native <i>tsheg bar</i>s in the
|
|
converter's eyes. Note the need for disambiguation: {PAM-'AM}
|
|
is a native <i>tsheg bar</i>, but {PAM'AM}, which parses as the
|
|
three stacks {PA}, {M'A}, and {MA}, is not. (In practice,
|
|
appendages rarely occur after prefixes. {BUR-'ANG} appears at
|
|
least once in ACIP files and {DGA'-'AM} appears at least twice, but
|
|
these may be typos. The converter does allow it, though.
|
|
It thinks {BIR'U} and {WAN'U} (which also occur, but only very
|
|
rarely) are both non-native, though, and thus treats {'} as U+0F71
|
|
(subscribed) and not U+0F60 (full form) in each case.)
|
|
</p>
|
|
|
|
<p>
|
|
Note a fine point. When turning a <i>tsheg bar</i> into
|
|
Tibetan, the ACIP->Tibetan converters assume that subjoined YA
|
|
and RA consonants are not fixed-form -- not U+0FBB and U+0FBC -- but
|
|
rather are the usual subjoined forms U+0FB1 and U+0FB2. The
|
|
only exceptions are the stacks R+Y, Y+Y, and n+d+Y, which are known
|
|
to have fixed-form subjoined YA, and the stacks n+d+R+Y (where RA
|
|
but not YA is full-form) and K+sh+R, which are known to have
|
|
fixed-form subjoined RA. Wa-zur, U+0FAD, is never confused
|
|
with full-form subjoined WA, U+0FBA, though, because ACIP represents
|
|
the former with {V} and the latter with {W}. Furthermore, the
|
|
converter never generates U+0F6A, the fixed-form RA (<i>rango</i>);
|
|
U+0F62 is always produced. (Note that U+0F62 is often
|
|
displayed as a fixed-form RA itself, as in {RNYA}.)
|
|
</p>
|
|
|
|
<p>
|
|
So far, we have spoken about consonants and vowels. In fact,
|
|
it is not trivial to determine when something is a consonant and
|
|
when it is a vowel. {A} can represent U+0F68, the Tibetan
|
|
letter, or the implicit vowel. {'} can represent U+0F71, the
|
|
subscribed a-chung, or U+0F60, the full-sized consonant
|
|
a-chung. The converter treats {TAA} as {T+AA}, not {TA-AA},
|
|
but treats {TAAA} like {TA-AA}, not {T+AA-A}. It treats
|
|
{PA'AM} like {PA-'A-M}, not {P+A'A-M}. In short, it first
|
|
tries out treating {'} and {A} like vowels, but will backtrack if
|
|
that leads to a clearly invalid <i>tsheg bar</i>.
|
|
</p>
|
|
|
|
<p>
|
|
Finally, a string of numbers can be a <i>tsheg bar</i> also.
|
|
It is illegal for numbers and consonants to appear together within
|
|
one <i>tsheg bar</i>, however.
|
|
</p>
|
|
|
|
<p>
|
|
The above is the complete understanding of the converter's
|
|
algorithms for parsing <i>tsheg bar</i>s. You the native
|
|
Tibetan speaker may know that {BSKYABS-'UR-'UNG-'O} is not allowed
|
|
and thus think that {B+S+K+YAB+S-'UR-'UNG-'O} should be the result,
|
|
but the converter has no such knowledge, and thinks this is a native
|
|
tsheg bar equivalent to {B-S+K+YAB-S-'UR-'UNG-'O}.
|
|
</p>
|
|
|
|
|
|
|
|
<a name="sysprops"></a><h2>System Properties</h2>
|
|
|
|
<p>
|
|
The <a href="#sub"><i>tsheg-bar</i> substitution</a> mechanism is
|
|
customizable via system properties. Java developers likely
|
|
know what these are, but few users do. This section will
|
|
perhaps get a determined person started, but if you have trouble,
|
|
contact <a href="mailto:thdltools-devel@lists.sourceforge.net">the
|
|
developers</a> so that we can improve this documentation or create a
|
|
better user interface.
|
|
</p>
|
|
|
|
<p>
|
|
For the tool to respect the value of a system property, you must
|
|
invoke the tool from the command line as follows:
|
|
</p>
|
|
|
|
<p>
|
|
<tt>
|
|
java
|
|
"-Dorg.thdl.tib.text.ttt.ReplacementMap=KAsh=>K+sh,ONYA=>[#ERROR-ONYA-IS-O&]"
|
|
-Dorg.thdl.tib.text.ttt.VerboseReplacementMap=true
|
|
-jar Jskad.jar
|
|
</tt>
|
|
</p>
|
|
|
|
|
|
<a name="bugs"></a><h2>Known Bugs</h2>
|
|
|
|
<p>
|
|
This section presents areas where the current tool's behavior is
|
|
wrong. Before doing serious work with the converter,
|
|
familiarize yourself with this section and develop a plan to work
|
|
around the bugs or to ensure that your documents will not trigger
|
|
the bugs. At the same time, if any of these bugs affects you,
|
|
contact <a href="mailto:thdltools-devel@lists.sourceforge.net">the
|
|
developers</a> so that we can fix them. The squeaky wheel
|
|
surely gets the grease; these bugs may never be fixed if there are
|
|
no complaints.
|
|
</p>
|
|
|
|
<p>
|
|
The following are all known bugs:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
When ACIP {MTHARo} is given, the {o} glyph should be centered
|
|
under the THA glyph in ACIP->TMW conversions. At present,
|
|
the {o} glyph appears underneath the rightmost stack.
|
|
Similarly, {\u0F35} and {\u0F37} are not centered properly.
|
|
[<a
|
|
href="http://sourceforge.net/tracker/index.php?func=detail&aid=838594&group_id=61934&atid=502515">838594</a>]
|
|
</li>
|
|
<li>
|
|
ACIP->TMW conversion for {\u0F3E} is not correct. Fear
|
|
not; the character U+0F3E is so rare that no ACIP transliteration
|
|
exists for it. [<a
|
|
href="http://sourceforge.net/tracker/index.php?func=detail&aid=855478&group_id=61934&atid=502515">855478</a>]
|
|
</li>
|
|
<li>
|
|
In a command-line ACIP->Unicode text file conversion, no
|
|
warning or error is given when the input is {KA (KHA)}. (The
|
|
output is a text file and does not have a mechanism for indicating
|
|
a change in font size.) [<a
|
|
href="http://sourceforge.net/tracker/index.php?func=detail&aid=855519&group_id=61934&atid=502515">855519</a>]
|
|
</li>
|
|
</ul>
|
|
|
|
|
|
|
|
<a name="room"></a><h2>Room for Improvement</h2>
|
|
|
|
<p>
|
|
This section presents areas where the current tool could be
|
|
improved. None of the current behavior described here is
|
|
incontrovertibly flawed (i.e., there are no bugs described here, see
|
|
<a href="#bugs">known bugs</a> for that); current behavior is
|
|
technically correct. However, the current behavior is not, in
|
|
everyone's eyes, perfect.
|
|
</p>
|
|
|
|
<p>
|
|
The following are the current areas in which the tool could be
|
|
better:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
The glyph TibetanMachineWeb9.61 -- the {O'I} special combination
|
|
(i.e., the glyph for the Unicode string U+0F7C,U+0F60,U+0F72) --
|
|
is never output by the ACIP->TMW converter. It is
|
|
sometimes more beautiful than the glyphs that are presently output
|
|
(three separate glyphs instead of the one).
|
|
</li>
|
|
<li>
|
|
Though the ACIP standard disallows it, you will find in ACIP
|
|
documents from the Buddhist Canon things like {/NYA\} where the
|
|
standard demands {/NYA/}. Presently, this triggers an error;
|
|
it would be better if this were converted like {/NYA/} is, and
|
|
triggered only a <tt>Most</tt>-level warning.
|
|
</li>
|
|
<li>
|
|
The hypothetical comment {[# \u0F40 may have been intended...]}
|
|
should cause a warning saying that Unicode escapes do not apply
|
|
within comments.
|
|
</li>
|
|
<li>
|
|
The whitespace after a <a href="#escapes">Unicode escape</a> is
|
|
not interpreted correctly when that Unicode escape represents
|
|
something that is part of a <i>tsheg bar</i>. For example,
|
|
the space in {KA KHA} is treated as a <i>tsheg</i> (i.e., U+0F0B),
|
|
but the space in {\u0F40 KHA} is wrongly treated as Tibetan
|
|
whitespace. [<a
|
|
href="http://sourceforge.net/tracker/index.php?func=detail&aid=855482&group_id=61934&atid=502515">855482</a>]
|
|
</li>
|
|
<li>
|
|
Though not standard, {:} and {:-} sometimes are intended to
|
|
represent U+0F14. The latter causes an error; it should
|
|
cause a warning suggested that the <a href="#escapes">Unicode
|
|
escape</a> {\u0F14} be used instead. The former is always
|
|
treated as U+0F7F; it should cause a warning in some or all
|
|
contexts.
|
|
</li>
|
|
<li>
|
|
The <a href="#sub"><i>tsheg-bar</i> substitution</a> mechanism
|
|
should be more general. The useful rule
|
|
<tt>ONYA=>O&</tt> should be supported.
|
|
</li>
|
|
<li>
|
|
The converters should support a white list of acceptable
|
|
non-native <i>tsheg bar</i>s (where the term "tsheg bar"
|
|
is to be interpreted somewhat literally here as any characters
|
|
between punctuation). Non-native <i>tsheg bar</i>s not on
|
|
the list should produce warnings or errors. Similarly, but
|
|
perhaps less urgently, a syllabary of native <i>tsheg bar</i>s
|
|
should be supported too. (A workaround is to use <a
|
|
href="#colors">coloring</a>, have your word processor delete
|
|
everything but the colored text, sort the colored <i>tsheg
|
|
bar</i>s, and inspect them all by hand. Also, <a
|
|
href="#stats"><i>tsheg-bar</i> statistics</a> will help you to
|
|
find uncommon <i>tsheg bar</i>s.)
|
|
</li>
|
|
<li>
|
|
ACIP->Unicode conversions produce Unicode text files at
|
|
present. While more compact than Rich Text Format (RTF)
|
|
files, a text file does not allow for supporting the two font
|
|
sizes in {KA (KA)}. A workaround is to use an ACIP->TMW
|
|
conversion followed by a separate <a
|
|
href="TMW_or_TM_To_X_Converters.html">TMW->Unicode</a>
|
|
conversion.
|
|
</li>
|
|
<li>
|
|
The converter should warn for each occurrence of the vowels {'E},
|
|
{'O}, {'EE}, or {'OO}.
|
|
</li>
|
|
</ul>
|
|
|
|
|
|
<h2>License</h2>
|
|
|
|
<p>Both the ACIP->Tibetan converters and this document are released
|
|
under the <a
|
|
href="http://iris.lib.virginia.edu/tibet/tools/thdl_license.txt">THDL
|
|
Open Community License Version 1.0</a>.</p>
|
|
|
|
|
|
<p>
|
|
Please
|
|
|
|
<a href="mailto:thdltools-devel@lists.sourceforge.net">
|
|
e-mail us</a>
|
|
|
|
your comments about this page.
|
|
</p>
|
|
|
|
<p>
|
|
The
|
|
<a href="http://www.sourceforge.net/projects/thdltools">
|
|
THDL Tools</a>
|
|
project is generously hosted by:
|
|
<!--
|
|
|
|
DO NOT DELETE THE SF.NET LOGO.
|
|
|
|
We have a choice of colors and sizes for this logo (see
|
|
"https://sourceforge.net/docman/display_doc.php?docid=790&group_id=1"),
|
|
but we do not have the option of removing it. SourceForge requests
|
|
that we put it on each web page for our project, and to give us
|
|
incentive to do so, they will not track the number of hits for our
|
|
project web pages unless we put this link in. To track hits, see
|
|
"http://sourceforge.net/project/stats/index.php?report=months&group_id=61934".
|
|
|
|
-->
|
|
<a href="http://sourceforge.net/">
|
|
<img src="http://sourceforge.net/sflogo.php?group_id=61934&type=1"
|
|
width="88" height="31" alt="SourceForge Logo" />
|
|
</a>
|
|
<!-- AGAIN, DO NOT DELETE THE SF.NET LOGO. -->
|
|
</p>
|
|
</div>
|
|
|
|
|
|
</body>
|
|
</html>
|