ACIP To Tibetan Converters
+ ++ This document describes the ACIP->Tibetan converters built atop + Jskad. + These converters were initially written by David Chandler, a + volunteer with the Tibetan and + Himalayan Digital Library, in the latter half of 2003. + They built upon the work of Tony Duff, Edward Garrett, and Than + Garson, and they would not be possible without the assistance of + David Chapman, Robert Chilton, and Andrés Montano + Pellegrini. (Please correct, and forgive, any omissions from + these lists.) +
+ ++ These converters accept Asian + Classics Input Project (ACIP) transliteration of Tibetan (using + ACIP's Tibetan + Input Code), a Roman transliteration scheme. ACIP has many + Buddhist texts available in ACIP transliteration, which alone makes + ACIP transliteration (or just ACIP for short) important. +
+ ++ The converters here accept a text file of ACIP and output either a + Unicode UTF-8-encoded text file or a Rich Text Format (RTF) file of + Tibetan + Machine Web (TMW). The latter is ready to use onscreen and + to make beautiful hardcopy today; the former will be understood by + software for a long time to come. +
+ ++ The converters are meant to produce perfect results even for + imperfect input. To give you an idea of the thought and care + that went into these converters, consider the following partial list + of features: +
+ +-
+
- + Four tiers of warning and error + messages are available. + +
- + Some transliterations specified by the ACIP standard are not + accepted (i.e., they cause errors) + because they are used too often improperly in Release IV texts + (e.g., {\}); some non-standard transliteration is understood + because it is used in ACIP Release IV texts (e.g., {[DD1]}). + +
- + Non-standard Unicode character escapes are + supported. (In this way, the glyph that the ACIP {\} refers + to according to the standard can in fact be represented, via + {\u0F84}.) + +
- + Color-coding can help find typos in the + input. + +
- + A substitution mechanism allows for correcting + erroneous documents on the fly. + +
- + The converters can output frequency statistics. + +
- + The "lexical analyzer" and "parser" handle every intricacy of + real ACIP Release IV texts. + +
- + The knowledge regarding the TMW font has been verified by + independent teams as described here. + +
+ The ACIP->Unicode and ACIP->TMW converters are equally + good. There are some differences between the two, + though. The TMW font has only a fixed set of glyphs, whereas + Unicode can encode arbitrary Tibetan glyphs. Thus, the + hypothetical ACIP {GAI}, which parses as {G+AI} due to prefix rules, will give an error in an + ACIP->TMW conversion because no glyph exists for this + stack. The ACIP->Unicode conversion will succeed, having + generated correct Unicode. This is the only difference between + the two conversions. +
+ ++ The converters are actively maintained; your feedback is + valued. +
+ ++ Note that there are also TMW->ACIP converters + available; this document does not cover them. +
+ ++ In what follows, you will learn how to use the + converters, including all the features listed above, and you'll find + a list of known bugs and places where there is + room for improvement. +
+ + +Using the Converters
+ ++ This section briefly describes how the converters are best used. +
+ ++ The GUI and command-line interfaces are both sufficient; the GUI + interface is your best bet if you've not used the converters + before. To learn how to invoke these interfaces, read these instructions. +
+ ++ First, review the known bugs and be sure you can + live with them. +
+ ++ Now perform a trial conversion of your document with warnings disabled. You will first + ensure that no outright errors appear in + the input. If any do, make a copy of the input, edit the + input, and feed it through again. Feel free to try this out as + soon as you're comfortable; the error messages themselves are + sometimes self-explanatory. +
+ ++ Once all errors have been corrected, do a conversion with warning + level 'Some'. If any warnings mark real problems, correct + those problems. +
+ ++ If you have the patience, now do a conversion with warning level + 'Most' and correct further problems. If any warnings mark real + problems, correct those problems. +
+ ++ The 'All' warning level is pedantic; you might find it useful if + you're writing software that is to produce ACIP transliteration that + is easily read by machines. If you find any useful warnings at + this level, report it as a bug -- such warnings should be 'Most' or + 'Some' level. +
+ ++ For best results, produce color-coded + output. Scan the output for non-native tsheg bars and ensure that they + match the original document (the one from which the ACIP + transliteration was produced). Color-coding is useful because, + for example, {ZHIGN} is probably a typo for {ZHING}; {ZHIGN} will + appear colored, whereas {ZHING} is not colored. +
+ ++ Note that the ACIP {%} gives a warning every time. Use the Unicode escape {\u0F35} if you want to avoid + this warning, but note well that Unicode escapes are not part + of the ACIP standard. Thus, other tools that work with ACIP + transliteration will likely not understand {\u0F35}. +
+ ++ To save time, you may use the tsheg-bar + substitution mechanism when appropriate. +
+ ++ Even if your desired end result is Unicode output, an ACIP->TMW + conversion is sometimes useful. One benefit is that errors + will appear for any ACIP tsheg bar that refers to a consonant + stack not included in TMW. These stacks should be scrutinized, + because TMW contains over 500 of the most common consonant stacks. +
+ ++ Finally, check a few folios by hand against the original document to + be sure that you're satisfied with the conversion. +
+ + + + + +Diagnostics: Warnings and Errors
+ ++ These converters are designed such that the output is just what you + yourself would create by hand. Whenever there is doubt about + what output is desired, a warning or error is issued. This + means that a helpful warning or error message will appear in the + output, and that you will be told at the end of the conversion that + one or more warnings or errors have indeed occurred. You can + then search your output document for the text [#ERROR or + [#WARNING. +
+ ++ There are four warning levels: 'None', 'Some', 'Most', and + 'All'. Choose 'None' if you don't want any warnings to appear + in your output and be brought to your attention at the end of + conversion. Choose 'Some' if you want to see the most + important warnings, 'Most' if you want some real confidence in your + output, and 'All' if you've absolutely got to know that the output + is right. +
+ ++ Errors will always appear; you cannot disable them. +
+ ++ The following are some (but not all) error and warning messages, + accompanied by further explication: +
+ +-
+
- + [#ERROR CONVERTING ACIP DOCUMENT: The Unicode escape with + ordinal 3912 does not match up with any TibetanMachineWeb + glyph.] appears for the input {\u0F48} because there is no + character at the Unicode codepoint U+0F48 (decimal 3912). + +
- + [#ERROR The ACIP {G+N+NA} cannot be represented with the + TibetanMachine or TibetanMachineWeb fonts because no such glyph + exists in these fonts.] appears because the Tibetan Machine + Web font has only a limited number of ready-made, precomposed + glyphs, and {G+N+NA} is not one of them. You'll only see + this error in an ACIP->TMW conversion, not an ACIP->Unicode + conversion. + +
- + [#ERROR CONVERTING ACIP DOCUMENT: This converter cannot + convert the ACIP {x} to Tibetan because it is unclear what the + result should be.] appears because the appropriate output for + this likely requires special mark-up. + +
- + [#ERROR CONVERTING ACIP DOCUMENT: Lexical error: The ACIP {^} + must precede a tsheg bar.] appears for + {^ GONG SA}, for example, because only + {^GONG SA} and {^ GONG SA} are supported in this + implementation. + +
- + [#ERROR CONVERTING ACIP DOCUMENT: The tsheg bar ("syllable") : + has these errors: Cannot convert ACIP A: because A: is a "vowel" + without an associated consonant] appears for the input {:} + because {:} cannot appear alone. (Sloppily, this message + exposes you to the internals of the converter, where {:} is + thought of as {A:} in some contexts.) + +
- + [#ERROR CONVERTING ACIP DOCUMENT: Lexical error: The ACIP x + must be glued to the end of a tsheg bar, but this one was + not] appears because {%}, {o}, and {x} are really only to be + applied to whole tsheg bars, and should not occur alone. + +
- + [#WARNING CONVERTING ACIP DOCUMENT: The ACIP DGYA has been + interpreted as two stacks, not one, but you may wish to confirm + that the original text had two stacks as it would be an easy + mistake to make to see one stack and forget to input it with '+' + characters.] appears because it helps evince the impact of prefix rules, a subtle point with regards to + ACIP because they are implied, but not discussed explicitly in + depth, by the ACIP standard. + +
- + [#WARNING CONVERTING ACIP DOCUMENT: Warning: We're going with + {B+NA}, but only because our knowledge of prefix rules says that + {B}{NA} is not a legal Tibetan tsheg bar ("syllable")] + appears for the same reason as above. + +
- + [#WARNING CONVERTING ACIP DOCUMENT: Lexical warning: The ACIP + {%} is treated by this converter as U+0F35, but sometimes might + represent U+0F14 in practice. To avoid seeing this warning again, + change the input to use {\u0F35} instead of {%}.] appears + because some ACIP transliteration out there does use {%} to mean + U+0F14. + +
+ When warning or error messages refer to a 'Lexical error', that is + an error that occurs when breaking an input text up + into tsheg bars. To fully understand all warning + and error messages, a thorough understanding of that + process and of the interpretation of ACIP + tsheg bars is required. +
+ + +Coloration
+ ++ For ACIP->TMW conversions (not ACIP->Unicode), color-coding of + tsheg bars is an option. The command-line converters + accept a flag --colors yes|no; the conversion GUI in + Jskad has a checkbox for color-coding. +
+ ++ Warnings and errors appear in red; tsheg + bars that would parse differently if other prefix rules were used appear in yellow; non-native + tsheg bars appear in green. +
+ + +Tsheg-bar Statistics
+ ++ The ACIP->Tibetan converters provide a simple-minded accounting + mechanism with which one can determine which tsheg bars + appear in a conversion or how many times each tsheg bar + appears. This mechanism is for power users only at this point; + its user interface leaves much to be desired. If you wish to + produce frequency information, and if you are not familiar with some + sort of scripting (via Excel macros, Unix shell scripts, etc.), then + the output produced will likely be useless to you. +
+ ++ To support the calculation of frequency statistics, that is, how + many times each tsheg bar appears, the converter can output + all tsheg bars to the Java error console (i.e., + System.err). Each will appear on the console as many + times as it appears in the input. To activate this + functionality, set the system property + org.thdl.tib.text.ttt.OutputAllTshegBars to true, + and be prepared for voluminous output. Massaging this output + into a friendly tabular format is quite possible but not described + here; contact the + developers for help. +
+ ++ To support the generation of syllabaries, the converter can output + each tsheg bar encountered to the Java error console (i.e., + System.err). Each will appear on the console only + once, no matter how many times it appears in the input. To + activate this functionality, set the system + property org.thdl.tib.text.ttt.OutputUniqueTshegBars to + true, and be prepared for voluminous output. +
+ ++ If desired, each tsheg bar output can be prefixed with a + string of your choice by setting the system + property org.thdl.tib.text.ttt.PrefixForOutputTshegBars + to that string. This is useful if the converter is producing + other output on the console and you want to separate that output + from the statistics. +
+ + + + + +Tsheg-bar Substitution
+ + + ++ The ACIP->Tibetan converters provide a mechanism for + automatically correcting common transliteration typos. For + example, if your document contains 100 occurrences of {KAsh} that + all in fact intend {K+sh}, then you can specify just once the rule + {KAsh}->{K+sh}, and all 100 occurrences will be treated + correctly. This mechanism is not very easy to use, but it is + completely customizable; you can specify any number of rules. + You can only perform such substitutions at the tsheg bar + level, though. This means, for example, that you cannot + specify the rule {GONG SA}->{^GONG SA}; you can only + specify {GONG}->{^GONG}, which would affect {GONG LA} + just as it would affect {GONG SA}. +
+ ++ To perform substitutions, set the system + property org.thdl.tib.text.ttt.ReplacementMap to be a + comma-delimited list of x=>y pairs. For example, + if you think BLKU, which parses as B+L+KU, should parse as B-L+KU, + and you want KAsh to be parsed as K+sh because the input operators + mistyped it, then set org.thdl.tib.text.ttt.ReplacementMap + to BLKU=>B-L+KU,KAsh=>K+sh. Note that this will + not cause {B+L+KU} to become {B-L+KU} -- we are doing the + replacement during lexical analysis of the input file, not during + parsing. And it will cause {SBLKU} to become {SB-L+KU}, which + is parsed as {S+B-L+KU}, probably not what you wanted. If you + fear such things, you can see if they happen by setting the system + property org.thdl.tib.text.ttt.VerboseReplacementMap to + true, which will cause an informational message to be + printed on the Java console every time a replacement is made. +
+ ++ Furthermore, you can use the regular expression notations ^ + and $ to denote the beginning and end of the tsheg + bar, respectively. For example, ^BLKU$=>B-L+KU + is a useful rule. Note that full regular expressions are not + supported -- the tool just borrows a bit of the notation. The + rule ^BLKU=>B-L+KU means that {BLKUM} and {BLKU} will + both be replaced, but {SBLKU} and {SBLKUM} will not be. The + caret, ^, means that we only match if BLKU is at the + beginning. The dollar sign, $, means that we only + match if the pattern is at the end. The rule + BLKU$=>B-L+KU will cause {SBLKU} to be replaced, but not + {BLKUM}. Note that performance is far better for + ^FOO$ than for ^FOO, FOO$, or + FOO alone. +
+ ++ Only one substitution is made per tsheg bar. + ^FOO$-style mappings will be tried first, then + ^FOO-style, then FOO$-style, and finally + FOO-style. +
+ ++ An example of a useful substitution is o$=>\u0F35. + This is useful because the converters interpret the ACIP {o} as + U+0F37 by default, but you might prefer U+0F35 in your output. +
+ ++ Note that you cannot literally replace {FOO} with {BAR} using this + mechanism -- because {F} is not an ACIP character, the lex will not + get far enough to use this substitution mechanism. This is not + considered a design flaw -- serious errors require user + intervention. Sophisticated users can use something akin to + perl, sed, or awk scripts to preprocess the input. +
+ ++ Note also that you cannot use the rule ONYA=>O&, + although it would be nice if you could. Technically, {&} + is considered to be punctuation (i.e., that which divides tsheg + bars) and is not understood inside a tsheg bar. +
+ ++ Note that this mechanism is also useful for fixing problems in the + converter itself rather than in the input. +
+ +Unicode Character Escapes
+ ++ The ACIP->Tibetan converters support some non-standard extensions + to the ACIP + Tibetan Input Code Standard. One of those is Unicode + character escape sequences. This extension makes it possible + to represent characters that the ACIP + standard does not address, and to represent one character, + U+0F84, that ACIP does address with the transliteration {\} but that + is misused in practice so often to refer to U+0F3C that the + ACIP->Tibetan converters always produce an error upon seeing {\}. +
+ ++ Outside of comments, {\uKLMN} is interpreted as referring to the + Unicode character with ordinal KLMN, where each of K, L, M, + and N are case-insensitive hexadecimal digits. For example, + the ACIP {KA KHA GA NGA } is exactly equivalent to + {\u0F40\u0f0B\u0F41\u0F0B\u0F42\u0F0B\u0F44\u0f0b}. Unicode + escapes produce the obvious Unicode in an ACIP->Unicode + conversion, and they produce the correct TMW glyph in an + ACIP->TMW conversion. There are limits, though, when + converting to TMW; multiple escapes in sequence are not handled + correctly. It would take a Unicode to TMW converter to produce + the correct glyphs for {\u0F42\u0F92\u0FB7\u0F7C}. The escapes + for vowels and other characters that are mapped to multiple TMW + glyphs are also not handled perfectly. Best practice is to use + escapes only when necessary in an ACIP->TMW conversion. +
+ ++ The Unicode character represented need not be a Tibetan one; for + example, {\u0040} produces the at sign, @. +
+ ++ Note well the known bug with regard to + whitespace in transliteration that follows a Unicode escape. + In large part, this bug affects characters that can be + transliterated by other, simpler, standard means. +
+ ++ If you do want to disable the use of Unicode escapes, set the system property + thdl.tib.text.disallow.unicode.character.escapes.in.acip to + true. +
+ + +Breaking a Text Up Into tsheg bars
+ ++ The ACIP->Tibetan converters all take ACIP transliteration as + input. The first step in conversion is to break up the input + into manageable pieces. (This is known as lexical + analysis in the context of programming languages, and you may + see the term in diagnostic messages though a linguist who studies + human language like Tibetan might balk at the term.) The + correct pieces in this case are tsheg bars (in ACIP, {TSEG + BAR}), punctuation, comments, whitespace, folio markers, formatting + codes, etc. In this section, the intracacies of how the + converter does that will be laid bare. With luck, this will + help you understand why the converter treated one space character + (i.e, ' ', U+0020) as a tsheg and another as Tibetan + whitespace. +
+ ++ The Tibetan term tsheg bar refers to "the stuff between + the dots". In the ACIP {BKRA SHIS [# Notice that + this comment is embedded in the Tibetan greeting pronounced 'tashi + delay']BDE LEGS,}, there are four tsheg bars, 'BKRA', + 'SHIS', 'BDE', and 'LEGS'. In this case 'BDE' is literally + "between the dots"; i.e., it is sandwiched by two U+0F0B + characters (because comments are in a sense invisible). One of + the "dots" that touches 'LEGS' does not look like a dot -- + it is a shad, U+0F0D. The lexical analyzer also finds + one comment, which will appear in a Latin typeface in the output, + and it finds four pieces of punctuation -- three tshegs and a + shad. +
+ ++ The converter will not allow an illegal character into a tsheg + bar. For example, {jA} is an error and causes an error + message to appear in the output. +
+ ++ Now that the basic operation is clear from the above example, let's + cover the fine points of how standard ACIP is handled. We'll + also cover some non-standard constructs that appear commonly in + actual ACIP Release IV texts. +
+ ++ The first construct that deserves explanation is the line + break. By the ACIP standard, line breaks in the input do not + become line breaks in the output unless there are two line breaks in + the input. For example, the ACIP snippet below has only one + line break in the output although three line breaks appear in the + input: +
+ ++BKRA SHIS +BDE LEGS, + +THUGS RJE CHE ... and so on ... ++ +
+ One fine point is that the converter does not require a space before + a line break. If {SHIS} appears before a line break, the converter + inserts a space so that it's treated just like {SHIS } is + treated. This oddity is needed to convert real ACIP documents. +
+ ++ Another fine point is that ACIP's {^} character "eats" a + following space or a newline. This is so that + {^ GONG SA } is treated identically to + {^GONG SA }. +
+ ++ Comments appear in a Latin typeface always. Comments are not + allowed just anywhere -- a comment cannot occur within a single + tsheg bar, for example, and it cannot appear between a + tsheg bar and the tsheg that follows it. That + is, {BD[#COMMENT]E} is not like {BDE}, and {BDE[#COMMENT] LEGS} + is not like {BDE LEGS} (though {BDE [#COMMENT]LEGS} is). +
+ ++ Corrections are interpreted as Tibetan, not English, by default, but + there is a built-in list of corrections that should appear in the + output in a Latin typeface. (Actually, any correction that + starts with a certain string will appear in a Latin typeface.) + The full list is the following: +
+ ++"LINE" // from KD0001I1.ACT +"DATA" // from KL0009I2.INC +"BLANK" // from KL0009I2.INC +"NOTE" // from R0001F.ACM +"alternate" // from R0018F.ACE +"02101-02150 missing" // from R1003A3.INC +"51501-51550 missing" // from R1003A52.ACT +"BRTAGS ETC" // from S0002N.ACT +"TSAN, ETC" // from S0015N.ACT +"SNYOMS, THROUGHOUT" // from S0016N.ACT +"KYIS ETC" // from S0019N.ACT +"MISSING" // from S0455M.ACT +"this" // from S6850I1B.ALT +"THIS" // from S0057M.ACT ++ +
+ Somewhat related is the converter's treatment of a few oddball + comments. The oddity is that these comments use the syntax + {[COMMENT]} rather than the standard syntax {[#COMMENT]}. The + converter will treat the following as comments: +
+ ++From S5274I.ACT: +"[FIRST]" +From S5274I.ACT: +"[SECOND]" +From S0216M.ACT: +"[Additional verses added by Khen Rinpoche here are]" +From S0216M.ACT: +"[ADDENDUM: The text of]" +From S0216M.ACT: +"[END OF ADDENDUM]" +From S0216M.ACT: +"[Some of the verses added here by Khen Rinpoche include:]" +From S0216M.ACT (note the typo): +"[Note that, in the second verse, the {YUL LJONG} was orignally {GANG LJONG}, +and is now recited this way since the ceremony is not only taking place in Tibet.]" +From S6954E1.ACT: +"[text missing]" +From TD3817I.INC: +"[INCOMPLETE]" +From S0935m.act: +"[MISSING PAGE]" +From S0975I.INC: +"[MISSING FOLIO]" +From S0839D1I.INC: +"[UNCLEAR LINE]" +From SE6260A.INC: +"[THE FOLLOWING TEXT HAS INCOMPLETE SECTIONS, WHICH ARE ON ORDER]" +From SE6260A.INC: +"[@DATA INCOMPLETE HERE]" +From SE6260A.INC: +"[@DATA MISSING HERE]" +From TD4035I.INC: +"[LINE APPARENTLY MISSING THIS PAGE]" +From TD4226I2.INC: +"[DATA INCOMPLETE HERE]" +To be consistent with the above: +"[DATA MISSING HERE]" +From S0018N.ACT: +"[FOLLOWING SECTION WAS NOT AVAILABLE WHEN THIS EDITION WAS +PRINTED, AND IS SUPPLIED FROM ANOTHER, PROBABLY THE ORIGINAL:]" +From S0018N.ACT: +"[THESE PAGE NUMBERS RESERVED IN THIS EDITION FOR PAGES +MISSING FROM ORIGINAL ON WHICH IT WAS BASED]" +From S0018N.ACT: +"[PAGE NUMBERS RESERVED FROM THIS EDITION FOR MISSING +SECTION SUPPLIED BY PRECEDING]" +From S0057M.ACT: +"[SW: OK]" +From S0057M.ACT: +"[m:ok]" +From S0057M.ACT: +"[A FIRST ONE +MISSING HERE?]" +From S0195A1.INC: +"[THE INITIAL PART OF THIS TEXT WAS INPUT BY THE SERA MEY LIBRARY IN +TIBETAN FONT AND NEEDS TO BE REDONE BY DOUBLE INPUT]" ++ +
+ The converter also supports several non-standard folio + markers. A review of ACIP Release IV texts determined that the + following types of folio markers can appear: +
+ ++@001 +@001A +@001B +@01A.3 +@012A.3 +@[07B] +@00007B +@00007 +@B00007 +@[00007A] ++ +
+ Similarly, to support real ACIP Release IV texts, the converter + treats {[DD1]}, {[DD2]}, {[ DD ]}, and {[DDD]} just like {[DD]} + (which is specified in the ACIP standard). It treats {[ BP ]} + and {[BLANK PAGE]} just like {[BP]}, also. +
+ ++ The lists above were created by a most fallible process of reviewing + a large number of ACIP Release IV texts. Your suggestions for + additions to these lists are highly valued; please contact the + developers. +
+ ++ FIXME: describe when the converter treats a space as a tsheg and when a space is Tibetan whitespace. Describe how a tsheg does not appear after {KA} and {GA} with most vowels, describe the handling of {NGA,} as {NGA ,}. Talk about dzongkha vs. tibetan when it comes to a tsheg at the end of a string of tsheg bars. Describe treatment of final line break or lack thereof. Warn users to watch out for lines that end with {-}. Describe treatment of {.} in certain contexts as U+0F0C. Etc. + + + + +
+ + + + +Parsing tsheg bars: Greedy Stacking and +Nativeness
+ ++ This section is a technical reference sufficiently detailed so that + you can fully understand the inner workings of the converter as it + decides which Unicode or TMW to use for a given tsheg + bar. The problem of breaking up a text into + tsheg bars is a separate issue; this section describes + what happens to a tsheg bar after it's been chipped away from + the text. +
+ + ++ The ACIP->Tibetan converters have a notion of + nativeness. Each tsheg bar is either native + Tibetan or non-native. For example, in Buddhist texts written + in Tibetan, Sanskrit mantras often appear in Tibetan + characters. This "Tibetanized Sanskrit" is + non-native. The tsheg bars that make up this mantra + (and here, take "tsheg bar" somewhat literally to mean the + characters delimited by punctuation and whitespace) are some native + and some non-native in the converter's eyes. For example, the + tsheg bar {MA } appears in some mantras, and is thus in + fact non-native. The converter, however, treats {MA } as + native in all contexts. Thus, "native" is a + technical term with a slightly different meaning than usual. +
+ ++ The idea of nativeness is important because it affects how the + converter treats a tsheg bar. In ACIP transliteration, + the rule is that consonants stack up until punctuation, whitespace, + or a vowel appears. For example, {RDZYA} is equivalent to + {R+DZ+YA}. ({DZA} always means the letter {DZA} itself, never + {D+ZA}.) But this greedy stacking does not apply to {SOGS}, + which is equivalent to {SOG-S}, not {SOG+S}. Why not? + Because {SOGS} is a native tsheg bar where GA is the suffix + and SA is the postsuffix. Similarly, {GNAD} is {G-NAD}, not + {G+NAD}. Why? Because GA is a prefix in this native + Tibetan tsheg bar. +
+ ++ In this section, we will illustrate the inner workings of this + aspect of the converter. You will be able to determine which + snippets of transliteration the converter considers to be native + tsheg bars, where greedy stacking does not apply except for + the root stack, and which snippets are non-native, and thus wholly + subject to greedy stacking. +
+ +Anatomy of a Native tsheg bar
+ ++ First, the lexical analyzer ensures that only the + Tibetan and Sanskrit consonants, the vowels {A}, {I}, {U}, {E}, {O}, + {OO}, {EE}, {i}, {'A}, {'I}, {'U}, {'E}, {'O}, {'OO}, {'EE}, and + {'i}, and the adornments {m} and {:} are allowed in a tsheg + bar. +
+ ++ As far as the converter is concerned, a native tsheg bar + consists of an optional prefix, a native root stack, an optional + suffix, an optional postsuffix (also known as a secondary suffix) + that may only be present if a suffix is present, and zero or more + appendages (my term, created because I don't know what a + grammarian calls such a thing). An appendage is one of the + following stack sequences: +
+ +-
+
- {'E} +
- {'I} +
- {'O} +
- {'U} +
- {'US} +
- {'UR} +
- {'UM} +
- {'ONG} +
- {'ONGS} +
- {'OS} +
- {'IS} +
- {'UNG} +
- {'ANG} +
- {'AM} +
+ A tsheg bar is non-native if it has a non-native root stack + or if it contains the {:} character. Any vowel is allowed on a + native root stack, even {'EEm}, {i}, or the like. +
++ The rule about native root stacks is important, for example, in + determining that {KTYAMS} is {K+T+YAM+SA} instead of {K+T+YAMASA} + (because K+T+YA is not a native stack). Another example is + {GNVA}, which is treated like {G+N+VA}, not {G-N+VA}, even though + {GNA} is treated like {G-NA} because NA can take a GA prefix. + The complete list of native stacks is the following: +
+ +-
+
- KA +
- KHA +
- GA +
- NGA +
- CA +
- CHA +
- JA +
- NYA +
- TA +
- THA +
- DA +
- NA +
- PA +
- PHA +
- BA +
- MA +
- TZA +
- TSA +
- DZA +
- WA +
- ZHA +
- ZA +
- 'A +
- YA +
- RA +
- LA +
- SHA +
- SA +
- HA +
- AA +
- R+KA (RKA) +
- R+GA (RGA) +
- R+NGA (RNGA) +
- R+JA (RJA) +
- R+NYA (RNYA) +
- R+TA (RTA) +
- R+DA (RDA) +
- R+NA (RNA) +
- R+BA (RBA) +
- R+MA (RMA) +
- R+TZA (RTZA) +
- R+DZA (RDZA) +
- L+KA (LKA) +
- L+GA (LGA) +
- L+NGA (LNGA) +
- L+CA (LCA) +
- L+JA (LJA) +
- L+TA (LTA) +
- L+DA (LDA) +
- L+PA (LPA) +
- L+BA (LBA) +
- L+HA (LHA) +
- S+KA (SKA) +
- S+GA (SGA) +
- S+NGA (SNGA) +
- S+NYA (SNYA) +
- S+TA (STA) +
- S+DA (SDA) +
- S+NA (SNA) +
- S+PA (SPA) +
- S+BA (SBA) +
- S+MA (SMA) +
- S+TZA (STZA) +
- K+VA (KVA) +
- KH+VA (KHVA) +
- G+VA (GVA) +
- C+VA (CVA) +
- NY+VA (NYVA) +
- T+VA (TVA) +
- D+VA (DVA) +
- TZ+VA (TZVA) +
- TS+VA (TSVA) +
- ZH+VA (ZHVA) +
- Z+VA (ZVA) +
- R+VA (RVA) +
- SH+VA (SHVA) +
- S+VA (SVA) +
- H+VA (HVA) +
- K+YA (KYA) +
- KH+YA (KHYA) +
- G+YA (GYA) +
- P+YA (PYA) +
- PH+YA (PHYA) +
- B+YA (BYA) +
- M+YA (MYA) +
- K+RA (KRA) +
- KH+RA (KHRA) +
- G+RA (GRA) +
- T+RA (TRA) +
- TH+RA (THRA) +
- D+RA (DRA) +
- P+RA (PRA) +
- PH+RA (PHRA) +
- B+RA (BRA) +
- M+RA (MRA) +
- SH+RA (SHRA) +
- S+RA (SRA) +
- H+RA (HRA) +
- K+LA (KLA) +
- G+LA (GLA) +
- B+LA (BLA) +
- Z+LA (ZLA) +
- R+LA (RLA) +
- S+LA (SLA) +
- R+K+YA (RKYA) +
- R+G+YA (RGYA) +
- R+M+YA (RMYA) +
- R+G+VA (RGVA) +
- R+TZ+VA (RTZVA) +
- S+K+YA (SKYA) +
- S+G+YA (SGYA) +
- S+P+YA (SPYA) +
- S+B+YA (SBYA) +
- S+M+YA (SMYA) +
- S+K+RA (SKRA) +
- S+G+RA (SGRA) +
- S+N+RA (SNRA) +
- S+P+RA (SPRA) +
- S+B+RA (SBRA) +
- S+M+RA (SMRA) +
- G+R+VA (GRVA) +
- D+R+VA (DRVA) +
- PH+Y+VA (PHYVA) +
+ (Some would argue that LVA is notably absent. It is seen in + ACIP Buddhist texts in {AELVA}, {LVAm}, {LVU}, {LVUN}, {LVAR}, + {LVE}, {LVANG}, and {LVA}. Greedy stacking affects none of + these tsheg bars' parsing, however.) +
+ + ++ Not all characters can be prefixes and the like. Only the five + prefixes (GA, DA, BA, MA, 'A), ten suffixes (GA, NGA, DA, NA, BA, + MA, 'A, RA, LA, SA), and two postsuffixes (DA, SA) every Tibetan + student knows are allowed, and they cannot appear with vowels. + (In {LE'U}, {'} is not a suffix -- it is part of an + appendage.) In fact, certain prefixes may only appear with + certain root stacks. The reason that these prefix rules matter + is that they govern how tsheg bars are parsed. For + example, {GNA} is parsed like {G-NA}, because NA takes a GA + prefix. But {GPA} is parsed like {G+PA}, because PA does not + take a GA prefix. +
+ ++ Prefix rules are a topic of some controversy; different grammars + give different lists of prefix rules. For a converter, it is + important that the converter's knowledge of prefix rules matches the + knowledge of the person who typed in the ACIP transliteration, not + that the converter agrees with a grammarian. For example, if + the input technician thought that PA could take a GA prefix, then + the converter will produce {G+PA} when {G-PA} was intended. + For this reason, the converter can produce a warning every time a + prefix rule prohibited the treatment of one of the five prefixes as + a prefix. For example, {GPA} produces this warning. + However, {GNA} produces no warning, because the converter assumes + that it is unlikely that an input technician would enter {GNA} upon + seeing {G+NA}. Part of the reason for this assumption is that + the Asian Classics Input Project Entry Operator Transcription + Chart as of Spring, 1993, explicitly enumerates the following + cases for special treatment by input operators: +
+ +-
+
- {BDA'} vs. {B+DA} +
- {DBANG} vs. {D+BA} +
- {DGA'} vs. {D+GA} +
- {DGRA} vs. {D+GRA} +
- {DGYES} vs. {D+GYA} +
- {DMAR} vs. {D+MA} +
- {GDA'} vs. {G+DA} +
- {GNAD} vs. {G+NA} +
- {MNA'} vs. {M+NA} +
+ Regardless, for best results, you should ensure that the input + technician's knowledge of prefix rules matches the converter's + knowledge. The following are the legal combinations of prefix + and root stack in the converter's eyes: +
+ +-
+
-
+ The BA prefix may occur with any of the following stacks:
+
-
+
- KA +
- SA +
- CA +
- TA +
- TZA +
- GA +
- DA +
- ZHA +
- ZA +
- SHA +
- K+YA (KYA) +
- G+YA (GYA) +
- K+RA (KRA) +
- G+RA (GRA) +
- S+RA (SRA) +
- G+LA (GLA) +
- K+LA (KLA) +
- Z+LA (ZLA) +
- R+LA (RLA) +
- S+LA (SLA) +
- S+KA (SKA) +
- S+GA (SGA) +
- S+NGA (SNGA) +
- S+NYA (SNYA) +
- S+TA (STA) +
- S+DA (SDA) +
- S+NA (SNA) +
- S+TZA (STZA) +
- R+KA (RKA) +
- R+GA (RGA) +
- R+NGA (RNGA) +
- R+JA (RJA) +
- R+NYA (RNYA) +
- R+TA (RTA) +
- R+DA (RDA) +
- R+NA (RNA) +
- R+TZA (RTZA) +
- R+DZA (RDZA) +
- L+CA (LCA) +
- L+TA (LTA) +
- L+DA (LDA) +
- R+K+YA (RKYA) +
- R+G+YA (RGYA) +
- S+K+YA (SKYA) +
- S+G+YA (SGYA) +
- S+K+RA (SKRA) +
- S+G+RA (SGRA) +
+ -
+ The GA prefix may occur with any of the following stacks:
+
-
+
- CA +
- DA +
- NA +
- NYA +
- SA +
- SHA +
- TA +
- TZA +
- YA +
- ZA +
- ZHA +
+ -
+ The 'A prefix may occur with any of the following stacks:
+
-
+
- GA +
- JA +
- DA +
- BA +
- DZA +
- KHA +
- CHA +
- THA +
- PHA +
- TSA +
- PH+YA (PHYA) +
- B+YA (BYA) +
- KH+YA (KHYA) +
- G+YA (GYA) +
- B+RA (BRA) +
- KH+RA (KHRA) +
- G+RA (GRA) +
- D+RA (DRA) +
- PH+RA (PHRA) +
+ -
+ The MA prefix may occur with any of the following stacks:
+
-
+
- KHA +
- GA +
- CHA +
- JA +
- THA +
- TSA +
- DA +
- DZA +
- NGA +
- NYA +
- NA +
- KH+YA (KHYA) +
- G+YA (GYA) +
- KH+RA (KHRA) +
- G+RA (GRA) +
+ -
+ The DA prefix may occur with any of the following stacks:
+
-
+
- BA +
- GA +
- KA +
- MA +
- NGA +
- PA +
- B+RA (BRA) +
- B+YA (BYA) +
- G+RA (GRA) +
- G+YA (GYA) +
- K+RA (KRA) +
- K+YA (KYA) +
- M+YA (MYA) +
- P+RA (PRA) +
- P+YA (PYA) +
+
+ In the above list, the presence of wa-zur (ACIP {V}) does not + disallow a prefix-root combination; nor does the presence of any + vowel, even {'EEm}. The presence of {:} does disallow + prefix-root combinations; e.g., {GN'EEm} is {G-N'EEm}, but {GNA:} is + {G+NA:}. ({GNVA} is parsed as {G+N+VA} not because NVA cannot + take a GA prefix, but because NVA is not a native stack.) +
+ ++ The converter will allow any suffix to go with any native root or + prefix-root combination; it will allow any postsuffix to follow any + suffix. It will allow any appendage on any native tsheg + bar. +
+ ++ For example, {SOGS}, {BSOGS}, {BS'EEmGS}, {LE'U'I'O} and + {BSKYABS-'UR-'UNG-'O} are all native tsheg bars in the + converter's eyes. Note the need for disambiguation: {PAM-'AM} + is a native tsheg bar, but {PAM'AM}, which parses as the + three stacks {PA}, {M'A}, and {MA}, is not. (In practice, + appendages rarely occur after prefixes. {BUR-'ANG} appears at + least once in ACIP files and {DGA'-'AM} appears at least twice, but + these may be typos. The converter does allow it, though. + It thinks {BIR'U} and {WAN'U} (which also occur, but only very + rarely) are both non-native, though, and thus treats {'} as U+0F71 + (subscribed) and not U+0F60 (full form) in each case.) +
+ ++ Note a fine point. When turning a tsheg bar into + Tibetan, the ACIP->Tibetan converters assume that subjoined YA + and RA consonants are not fixed-form -- not U+0FBB and U+0FBC -- but + rather are the usual subjoined forms U+0FB1 and U+0FB2. The + only exceptions are the stacks R+Y, Y+Y, and n+d+Y, which are known + to have fixed-form subjoined YA, and the stacks n+d+R+Y (where RA + but not YA is full-form) and K+sh+R, which are known to have + fixed-form subjoined RA. Wa-zur, U+0FAD, is never confused + with full-form subjoined WA, U+0FBA, though, because ACIP represents + the former with {V} and the latter with {W}. Furthermore, the + converter never generates U+0F6A, the fixed-form RA (rango); + U+0F62 is always produced. (Note that U+0F62 is often + displayed as a fixed-form RA itself, as in {RNYA}.) +
+ ++ So far, we have spoken about consonants and vowels. In fact, + it is not trivial to determine when something is a consonant and + when it is a vowel. {A} can represent U+0F68, the Tibetan + letter, or the implicit vowel. {'} can represent U+0F71, the + subscribed a-chung, or U+0F60, the full-sized consonant + a-chung. The converter treats {TAA} as {T+AA}, not {TA-AA}, + but treats {TAAA} like {TA-AA}, not {T+AA-A}. It treats + {PA'AM} like {PA-'A-M}, not {P+A'A-M}. In short, it first + tries out treating {'} and {A} like vowels, but will backtrack if + that leads to a clearly invalid tsheg bar. +
+ ++ Finally, a string of numbers can be a tsheg bar also. + It is illegal for numbers and consonants to appear together within + one tsheg bar, however. +
+ ++ The above is the complete understanding of the converter's + algorithms for parsing tsheg bars. You the native + Tibetan speaker may know that {BSKYABS-'UR-'UNG-'O} is not allowed + and thus think that {B+S+K+YAB+S-'UR-'UNG-'O} should be the result, + but the converter has no such knowledge, and thinks this is a native + tsheg bar equivalent to {B-S+K+YAB-S-'UR-'UNG-'O}. +
+ + + +System Properties
+ ++ The tsheg-bar substitution mechanism is + customizable via system properties. Java developers likely + know what these are, but few users do. This section will + perhaps get a determined person started, but if you have trouble, + contact the + developers so that we can improve this documentation or create a + better user interface. +
+ ++ For the tool to respect the value of a system property, you must + invoke the tool from the command line as follows: +
+ ++ + java + "-Dorg.thdl.tib.text.ttt.ReplacementMap=KAsh=>K+sh,ONYA=>[#ERROR-ONYA-IS-O&]" + -Dorg.thdl.tib.text.ttt.VerboseReplacementMap=true + -jar Jskad.jar + +
+ + +Known Bugs
+ ++ This section presents areas where the current tool's behavior is + wrong. Before doing serious work with the converter, + familiarize yourself with this section and develop a plan to work + around the bugs or to ensure that your documents will not trigger + the bugs. At the same time, if any of these bugs affects you, + contact the + developers so that we can fix them. The squeaky wheel + surely gets the grease; these bugs may never be fixed if there are + no complaints. +
+ ++ The following are all known bugs: +
+ +-
+
- + When ACIP {MTHARo} is given, the {o} glyph should be centered + under the THA glyph in ACIP->TMW conversions. At present, + the {o} glyph appears underneath the rightmost stack. + Similarly, {\u0F35} and {\u0F37} are not centered properly. + [838594] + +
- + ACIP->TMW conversion for {\u0F3E} is not correct. Fear + not; the character U+0F3E is so rare that no ACIP transliteration + exists for it. [855478] + +
- + In a command-line ACIP->Unicode text file conversion, no + warning or error is given when the input is {KA (KHA)}. (The + output is a text file and does not have a mechanism for indicating + a change in font size.) [855519] + +
Room for Improvement
+ ++ This section presents areas where the current tool could be + improved. None of the current behavior described here is + incontrovertibly flawed (i.e., there are no bugs described here, see + known bugs for that); current behavior is + technically correct. However, the current behavior is not, in + everyone's eyes, perfect. +
+ ++ The following are the current areas in which the tool could be + better: +
+ +-
+
- + The glyph TibetanMachineWeb9.61 -- the {O'I} special combination + (i.e., the glyph for the Unicode string U+0F7C,U+0F60,U+0F72) -- + is never output by the ACIP->TMW converter. It is + sometimes more beautiful than the glyphs that are presently output + (three separate glyphs instead of the one). + +
- + Though the ACIP standard disallows it, you will find in ACIP + documents from the Buddhist Canon things like {/NYA\} where the + standard demands {/NYA/}. Presently, this triggers an error; + it would be better if this were converted like {/NYA/} is, and + triggered only a Most-level warning. + +
- + The hypothetical comment {[# \u0F40 may have been intended...]} + should cause a warning saying that Unicode escapes do not apply + within comments. + +
- + The whitespace after a Unicode escape is + not interpreted correctly when that Unicode escape represents + something that is part of a tsheg bar. For example, + the space in {KA KHA} is treated as a tsheg (i.e., U+0F0B), + but the space in {\u0F40 KHA} is wrongly treated as Tibetan + whitespace. [855482] + +
- + Though not standard, {:} and {:-} sometimes are intended to + represent U+0F14. The latter causes an error; it should + cause a warning suggested that the Unicode + escape {\u0F14} be used instead. The former is always + treated as U+0F7F; it should cause a warning in some or all + contexts. + +
- + The latest Extended + Wylie Transliteration Scheme standard has assigned private-use + area Unicode codepoints to some TMW glyphs. ACIP documents + that have a Unicode escape in the range + U+F021 to U+F0FF, inclusive, should be interpreted as intending + these TMW glyphs. ACIP->Unicode should by default produce + errors for such things (because they are font-dependent and not + standard); optionally, the private-use area codepoints should be + passed along into the output. + +
- + The tsheg-bar substitution mechanism + should be more general. The useful rule + ONYA=>O& should be supported. + +
- + The converters should support a white list of acceptable + non-native tsheg bars (where the term "tsheg bar" + is to be interpreted somewhat literally here as any characters + between punctuation). Non-native tsheg bars not on + the list should produce warnings or errors. Similarly, but + perhaps less urgently, a syllabary of native tsheg bars + should be supported too. (A workaround is to use coloring, have your word processor delete + everything but the colored text, sort the colored tsheg + bars, and inspect them all by hand. Also, tsheg-bar statistics will help you to + find uncommon tsheg bars.) + +
- + ACIP->Unicode conversions produce Unicode text files at + present. While more compact than Rich Text Format (RTF) + files, a text file does not allow for supporting the two font + sizes in {KA (KA)}. A workaround is to use an ACIP->TMW + conversion followed by a separate TMW->Unicode + conversion. + +
- + The converter should warn for each occurrence of the vowels {'E}, + {'O}, {'EE}, or {'OO}. + +
License
+ +Both the ACIP->Tibetan converters and this document are released +under the THDL +Open Community License Version 1.0.
+ + ++ Please + + + e-mail us + + your comments about this page. +
+ ++The + + THDL Tools +project is generously hosted by: + + + + + +
+