diff --git a/htdocs/ACIP_To_Tibetan_Converter.html b/htdocs/ACIP_To_Tibetan_Converter.html new file mode 100644 index 0000000..13174a1 --- /dev/null +++ b/htdocs/ACIP_To_Tibetan_Converter.html @@ -0,0 +1,1534 @@ + + + + + + + + + ACIP To Tibetan Converters + + + + + + + + + + +
+ + +
+ + +
+ +

ACIP To Tibetan Converters

+ +

+ This document describes the ACIP->Tibetan converters built atop + Jskad.  + These converters were initially written by David Chandler, a + volunteer with the Tibetan and + Himalayan Digital Library, in the latter half of 2003.  + They built upon the work of Tony Duff, Edward Garrett, and Than + Garson, and they would not be possible without the assistance of + David Chapman, Robert Chilton, and Andrés Montano + Pellegrini.  (Please correct, and forgive, any omissions from + these lists.) +

+ +

+ These converters accept Asian + Classics Input Project (ACIP) transliteration of Tibetan (using + ACIP's Tibetan + Input Code), a Roman transliteration scheme.  ACIP has many + Buddhist texts available in ACIP transliteration, which alone makes + ACIP transliteration (or just ACIP for short) important. +

+ +

+ The converters here accept a text file of ACIP and output either a + Unicode UTF-8-encoded text file or a Rich Text Format (RTF) file of + Tibetan + Machine Web (TMW).  The latter is ready to use onscreen and + to make beautiful hardcopy today; the former will be understood by + software for a long time to come. +

+ +

+ The converters are meant to produce perfect results even for + imperfect input.  To give you an idea of the thought and care + that went into these converters, consider the following partial list + of features: +

+ + + +

+ The ACIP->Unicode and ACIP->TMW converters are equally + good.  There are some differences between the two, + though.  The TMW font has only a fixed set of glyphs, whereas + Unicode can encode arbitrary Tibetan glyphs.  Thus, the + hypothetical ACIP {GAI}, which parses as {G+AI} due to prefix rules, will give an error in an + ACIP->TMW conversion because no glyph exists for this + stack.  The ACIP->Unicode conversion will succeed, having + generated correct Unicode.  This is the only difference between + the two conversions. +

+ +

+ The converters are actively maintained; your feedback is + valued. +

+ +

+ Note that there are also TMW->ACIP converters + available; this document does not cover them. +

+ +

+ In what follows, you will learn how to use the + converters, including all the features listed above, and you'll find + a list of known bugs and places where there is + room for improvement. +

+ + +

Using the Converters

+ +

+ This section briefly describes how the converters are best used. +

+ +

+ The GUI and command-line interfaces are both sufficient; the GUI + interface is your best bet if you've not used the converters + before.  To learn how to invoke these interfaces, read these instructions. +

+ +

+ First, review the known bugs and be sure you can + live with them. +

+ +

+ Now perform a trial conversion of your document with warnings disabled.  You will first + ensure that no outright errors appear in + the input.  If any do, make a copy of the input, edit the + input, and feed it through again.  Feel free to try this out as + soon as you're comfortable; the error messages themselves are + sometimes self-explanatory. +

+ +

+ Once all errors have been corrected, do a conversion with warning + level 'Some'.  If any warnings mark real problems, correct + those problems. +

+ +

+ If you have the patience, now do a conversion with warning level + 'Most' and correct further problems.  If any warnings mark real + problems, correct those problems. +

+ +

+ The 'All' warning level is pedantic; you might find it useful if + you're writing software that is to produce ACIP transliteration that + is easily read by machines.  If you find any useful warnings at + this level, report it as a bug -- such warnings should be 'Most' or + 'Some' level. +

+ +

+ For best results, produce color-coded + output.  Scan the output for non-native tsheg bars and ensure that they + match the original document (the one from which the ACIP + transliteration was produced).  Color-coding is useful because, + for example, {ZHIGN} is probably a typo for {ZHING}; {ZHIGN} will + appear colored, whereas {ZHING} is not colored. +

+ +

+ Note that the ACIP {%} gives a warning every time.  Use the Unicode escape {\u0F35} if you want to avoid + this warning, but note well that Unicode escapes are not part + of the ACIP standard.  Thus, other tools that work with ACIP + transliteration will likely not understand {\u0F35}. +

+ +

+ To save time, you may use the tsheg-bar + substitution mechanism when appropriate. +

+ +

+ Even if your desired end result is Unicode output, an ACIP->TMW + conversion is sometimes useful.  One benefit is that errors + will appear for any ACIP tsheg bar that refers to a consonant + stack not included in TMW.  These stacks should be scrutinized, + because TMW contains over 500 of the most common consonant stacks. +

+ +

+ Finally, check a few folios by hand against the original document to + be sure that you're satisfied with the conversion. +

+ + + + + +

Diagnostics: Warnings and Errors

+ +

+ These converters are designed such that the output is just what you + yourself would create by hand.  Whenever there is doubt about + what output is desired, a warning or error is issued.  This + means that a helpful warning or error message will appear in the + output, and that you will be told at the end of the conversion that + one or more warnings or errors have indeed occurred.  You can + then search your output document for the text [#ERROR or + [#WARNING. +

+ +

+ There are four warning levels: 'None', 'Some', 'Most', and + 'All'.  Choose 'None' if you don't want any warnings to appear + in your output and be brought to your attention at the end of + conversion.  Choose 'Some' if you want to see the most + important warnings, 'Most' if you want some real confidence in your + output, and 'All' if you've absolutely got to know that the output + is right. +

+ +

+ Errors will always appear; you cannot disable them. +

+ +

+ The following are some (but not all) error and warning messages, + accompanied by further explication: +

+ + + +

+ When warning or error messages refer to a 'Lexical error', that is + an error that occurs when breaking an input text up + into tsheg bars.  To fully understand all warning + and error messages, a thorough understanding of that + process and of the interpretation of ACIP + tsheg bars is required. +

+ + +

Coloration

+ +

+ For ACIP->TMW conversions (not ACIP->Unicode), color-coding of + tsheg bars is an option.  The command-line converters + accept a flag --colors yes|no; the conversion GUI in + Jskad has a checkbox for color-coding. +

+ +

+ Warnings and errors appear in red; tsheg + bars that would parse differently if other prefix rules were used appear in yellow; non-native + tsheg bars appear in green. +

+ + +

Tsheg-bar Statistics

+ +

+ The ACIP->Tibetan converters provide a simple-minded accounting + mechanism with which one can determine which tsheg bars + appear in a conversion or how many times each tsheg bar + appears.  This mechanism is for power users only at this point; + its user interface leaves much to be desired.  If you wish to + produce frequency information, and if you are not familiar with some + sort of scripting (via Excel macros, Unix shell scripts, etc.), then + the output produced will likely be useless to you. +

+ +

+ To support the calculation of frequency statistics, that is, how + many times each tsheg bar appears, the converter can output + all tsheg bars to the Java error console (i.e., + System.err).  Each will appear on the console as many + times as it appears in the input.  To activate this + functionality, set the system property + org.thdl.tib.text.ttt.OutputAllTshegBars to true, + and be prepared for voluminous output.  Massaging this output + into a friendly tabular format is quite possible but not described + here; contact the + developers for help. +

+ +

+ To support the generation of syllabaries, the converter can output + each tsheg bar encountered to the Java error console (i.e., + System.err).  Each will appear on the console only + once, no matter how many times it appears in the input.  To + activate this functionality, set the system + property org.thdl.tib.text.ttt.OutputUniqueTshegBars to + true, and be prepared for voluminous output. +

+ +

+ If desired, each tsheg bar output can be prefixed with a + string of your choice by setting the system + property org.thdl.tib.text.ttt.PrefixForOutputTshegBars + to that string.  This is useful if the converter is producing + other output on the console and you want to separate that output + from the statistics. +

+ + + + + +

Tsheg-bar Substitution

+ + + +

+ The ACIP->Tibetan converters provide a mechanism for + automatically correcting common transliteration typos.  For + example, if your document contains 100 occurrences of {KAsh} that + all in fact intend {K+sh}, then you can specify just once the rule + {KAsh}->{K+sh}, and all 100 occurrences will be treated + correctly.  This mechanism is not very easy to use, but it is + completely customizable; you can specify any number of rules.  + You can only perform such substitutions at the tsheg bar + level, though.  This means, for example, that you cannot + specify the rule {GONG SA}->{^GONG SA}; you can only + specify {GONG}->{^GONG}, which would affect {GONG LA} + just as it would affect {GONG SA}. +

+ +

+ To perform substitutions, set the system + property org.thdl.tib.text.ttt.ReplacementMap to be a + comma-delimited list of x=>y pairs.  For example, + if you think BLKU, which parses as B+L+KU, should parse as B-L+KU, + and you want KAsh to be parsed as K+sh because the input operators + mistyped it, then set org.thdl.tib.text.ttt.ReplacementMap + to BLKU=>B-L+KU,KAsh=>K+sh.  Note that this will + not cause {B+L+KU} to become {B-L+KU} -- we are doing the + replacement during lexical analysis of the input file, not during + parsing.  And it will cause {SBLKU} to become {SB-L+KU}, which + is parsed as {S+B-L+KU}, probably not what you wanted.  If you + fear such things, you can see if they happen by setting the system + property org.thdl.tib.text.ttt.VerboseReplacementMap to + true, which will cause an informational message to be + printed on the Java console every time a replacement is made. +

+ +

+ Furthermore, you can use the regular expression notations ^ + and $ to denote the beginning and end of the tsheg + bar, respectively.  For example, ^BLKU$=>B-L+KU + is a useful rule.  Note that full regular expressions are not + supported -- the tool just borrows a bit of the notation.  The + rule ^BLKU=>B-L+KU means that {BLKUM} and {BLKU} will + both be replaced, but {SBLKU} and {SBLKUM} will not be.  The + caret, ^, means that we only match if BLKU is at the + beginning.  The dollar sign, $, means that we only + match if the pattern is at the end.  The rule + BLKU$=>B-L+KU will cause {SBLKU} to be replaced, but not + {BLKUM}.  Note that performance is far better for + ^FOO$ than for ^FOO, FOO$, or + FOO alone. +

+ +

+ Only one substitution is made per tsheg bar.  + ^FOO$-style mappings will be tried first, then + ^FOO-style, then FOO$-style, and finally + FOO-style. +

+ +

+ An example of a useful substitution is o$=>\u0F35.  + This is useful because the converters interpret the ACIP {o} as + U+0F37 by default, but you might prefer U+0F35 in your output. +

+ +

+ Note that you cannot literally replace {FOO} with {BAR} using this + mechanism -- because {F} is not an ACIP character, the lex will not + get far enough to use this substitution mechanism.  This is not + considered a design flaw -- serious errors require user + intervention.  Sophisticated users can use something akin to + perl, sed, or awk scripts to preprocess the input. +

+ +

+ Note also that you cannot use the rule ONYA=>O&, + although it would be nice if you could.  Technically, {&} + is considered to be punctuation (i.e., that which divides tsheg + bars) and is not understood inside a tsheg bar. +

+ +

+ Note that this mechanism is also useful for fixing problems in the + converter itself rather than in the input. +

+ +

Unicode Character Escapes

+ +

+ The ACIP->Tibetan converters support some non-standard extensions + to the ACIP + Tibetan Input Code Standard.  One of those is Unicode + character escape sequences.  This extension makes it possible + to represent characters that the ACIP + standard does not address, and to represent one character, + U+0F84, that ACIP does address with the transliteration {\} but that + is misused in practice so often to refer to U+0F3C that the + ACIP->Tibetan converters always produce an error upon seeing {\}. +

+ +

+ Outside of comments, {\uKLMN} is interpreted as referring to the + Unicode character with ordinal KLMN, where each of K, L, M, + and N are case-insensitive hexadecimal digits.  For example, + the ACIP {KA KHA GA NGA } is exactly equivalent to + {\u0F40\u0f0B\u0F41\u0F0B\u0F42\u0F0B\u0F44\u0f0b}.  Unicode + escapes produce the obvious Unicode in an ACIP->Unicode + conversion, and they produce the correct TMW glyph in an + ACIP->TMW conversion.  There are limits, though, when + converting to TMW; multiple escapes in sequence are not handled + correctly.  It would take a Unicode to TMW converter to produce + the correct glyphs for {\u0F42\u0F92\u0FB7\u0F7C}.  The escapes + for vowels and other characters that are mapped to multiple TMW + glyphs are also not handled perfectly.  Best practice is to use + escapes only when necessary in an ACIP->TMW conversion. +

+ +

+ The Unicode character represented need not be a Tibetan one; for + example, {\u0040} produces the at sign, @. +

+ +

+ Note well the known bug with regard to + whitespace in transliteration that follows a Unicode escape.  + In large part, this bug affects characters that can be + transliterated by other, simpler, standard means. +

+ +

+ If you do want to disable the use of Unicode escapes, set the system property + thdl.tib.text.disallow.unicode.character.escapes.in.acip to + true. +

+ + +

Breaking a Text Up Into tsheg bars

+ +

+ The ACIP->Tibetan converters all take ACIP transliteration as + input.  The first step in conversion is to break up the input + into manageable pieces.  (This is known as lexical + analysis in the context of programming languages, and you may + see the term in diagnostic messages though a linguist who studies + human language like Tibetan might balk at the term.)  The + correct pieces in this case are tsheg bars (in ACIP, {TSEG + BAR}), punctuation, comments, whitespace, folio markers, formatting + codes, etc.  In this section, the intracacies of how the + converter does that will be laid bare.  With luck, this will + help you understand why the converter treated one space character + (i.e, ' ', U+0020) as a tsheg and another as Tibetan + whitespace. +

+ +

+ The Tibetan term tsheg bar refers to "the stuff between + the dots".  In the ACIP {BKRA SHIS [# Notice that + this comment is embedded in the Tibetan greeting pronounced 'tashi + delay']BDE LEGS,}, there are four tsheg bars, 'BKRA', + 'SHIS', 'BDE', and 'LEGS'.  In this case 'BDE' is literally + "between the dots"; i.e., it is sandwiched by two U+0F0B + characters (because comments are in a sense invisible).  One of + the "dots" that touches 'LEGS' does not look like a dot -- + it is a shad, U+0F0D.  The lexical analyzer also finds + one comment, which will appear in a Latin typeface in the output, + and it finds four pieces of punctuation -- three tshegs and a + shad. +

+ +

+ The converter will not allow an illegal character into a tsheg + bar.  For example, {jA} is an error and causes an error + message to appear in the output. +

+ +

+ Now that the basic operation is clear from the above example, let's + cover the fine points of how standard ACIP is handled.  We'll + also cover some non-standard constructs that appear commonly in + actual ACIP Release IV texts. +

+ +

+ The first construct that deserves explanation is the line + break.  By the ACIP standard, line breaks in the input do not + become line breaks in the output unless there are two line breaks in + the input.  For example, the ACIP snippet below has only one + line break in the output although three line breaks appear in the + input: +

+ +
+BKRA SHIS 
+BDE LEGS,
+
+THUGS RJE CHE ... and so on ...
+
+ +

+ One fine point is that the converter does not require a space before + a line break. If {SHIS} appears before a line break, the converter + inserts a space so that it's treated just like {SHIS } is + treated.  This oddity is needed to convert real ACIP documents. +

+ +

+ Another fine point is that ACIP's {^} character "eats" a + following space or a newline.  This is so that + {^ GONG SA } is treated identically to + {^GONG SA }. +

+ +

+ Comments appear in a Latin typeface always.  Comments are not + allowed just anywhere -- a comment cannot occur within a single + tsheg bar, for example, and it cannot appear between a + tsheg bar and the tsheg that follows it.  That + is, {BD[#COMMENT]E} is not like {BDE}, and {BDE[#COMMENT] LEGS} + is not like {BDE LEGS} (though {BDE [#COMMENT]LEGS} is). +

+ +

+ Corrections are interpreted as Tibetan, not English, by default, but + there is a built-in list of corrections that should appear in the + output in a Latin typeface.  (Actually, any correction that + starts with a certain string will appear in a Latin typeface.)  + The full list is the following: +

+ +
+"LINE"                // from KD0001I1.ACT
+"DATA"                // from KL0009I2.INC
+"BLANK"               // from KL0009I2.INC
+"NOTE"                // from R0001F.ACM
+"alternate"           // from R0018F.ACE
+"02101-02150 missing" // from R1003A3.INC
+"51501-51550 missing" // from R1003A52.ACT
+"BRTAGS ETC"          // from S0002N.ACT
+"TSAN, ETC"           // from S0015N.ACT
+"SNYOMS, THROUGHOUT"  // from S0016N.ACT
+"KYIS ETC"            // from S0019N.ACT
+"MISSING"             // from S0455M.ACT
+"this"                // from S6850I1B.ALT
+"THIS"                // from S0057M.ACT
+
+ +

+ Somewhat related is the converter's treatment of a few oddball + comments.  The oddity is that these comments use the syntax + {[COMMENT]} rather than the standard syntax {[#COMMENT]}.  The + converter will treat the following as comments: +

+ +
+From S5274I.ACT:
+"[FIRST]"
+From S5274I.ACT:
+"[SECOND]"
+From S0216M.ACT:
+"[Additional verses added by Khen Rinpoche here are]"
+From S0216M.ACT:
+"[ADDENDUM: The text of]"
+From S0216M.ACT:
+"[END OF ADDENDUM]"
+From S0216M.ACT:
+"[Some of the verses added here by Khen Rinpoche include:]"
+From S0216M.ACT (note the typo):
+"[Note that, in the second verse, the {YUL LJONG} was orignally {GANG LJONG},
+and is now recited this way since the ceremony is not only taking place in Tibet.]"
+From S6954E1.ACT:
+"[text missing]"
+From TD3817I.INC:
+"[INCOMPLETE]"
+From S0935m.act:
+"[MISSING PAGE]"
+From S0975I.INC:
+"[MISSING FOLIO]"
+From S0839D1I.INC:
+"[UNCLEAR LINE]"
+From SE6260A.INC:
+"[THE FOLLOWING TEXT HAS INCOMPLETE SECTIONS, WHICH ARE ON ORDER]"
+From SE6260A.INC:
+"[@DATA INCOMPLETE HERE]"
+From SE6260A.INC:
+"[@DATA MISSING HERE]"
+From TD4035I.INC:
+"[LINE APPARENTLY MISSING THIS PAGE]"
+From TD4226I2.INC:
+"[DATA INCOMPLETE HERE]"
+To be consistent with the above:
+"[DATA MISSING HERE]"
+From S0018N.ACT:
+"[FOLLOWING SECTION WAS NOT AVAILABLE WHEN THIS EDITION WAS
+PRINTED, AND IS SUPPLIED FROM ANOTHER, PROBABLY THE ORIGINAL:]"
+From S0018N.ACT:
+"[THESE PAGE NUMBERS RESERVED IN THIS EDITION FOR PAGES
+MISSING FROM ORIGINAL ON WHICH IT WAS BASED]"
+From S0018N.ACT:
+"[PAGE NUMBERS RESERVED FROM THIS EDITION FOR MISSING
+SECTION SUPPLIED BY PRECEDING]"
+From S0057M.ACT:
+"[SW: OK]"
+From S0057M.ACT:
+"[m:ok]"
+From S0057M.ACT:
+"[A FIRST ONE
+MISSING HERE?]"
+From S0195A1.INC:
+"[THE INITIAL PART OF THIS TEXT WAS INPUT BY THE SERA MEY LIBRARY IN
+TIBETAN FONT AND NEEDS TO BE REDONE BY DOUBLE INPUT]"
+
+ +

+ The converter also supports several non-standard folio + markers.  A review of ACIP Release IV texts determined that the + following types of folio markers can appear: +

+ +
+@001
+@001A
+@001B
+@01A.3
+@012A.3
+@[07B]
+@00007B
+@00007
+@B00007
+@[00007A]
+
+ +

+ Similarly, to support real ACIP Release IV texts, the converter + treats {[DD1]}, {[DD2]}, {[ DD ]}, and {[DDD]} just like {[DD]} + (which is specified in the ACIP standard).  It treats {[ BP ]} + and {[BLANK PAGE]} just like {[BP]}, also. +

+ +

+ The lists above were created by a most fallible process of reviewing + a large number of ACIP Release IV texts.  Your suggestions for + additions to these lists are highly valued; please contact the + developers. +

+ +

+ FIXME: describe when the converter treats a space as a tsheg and when a space is Tibetan whitespace.  Describe how a tsheg does not appear after {KA} and {GA} with most vowels, describe the handling of {NGA,} as {NGA ,}.  Talk about dzongkha vs. tibetan when it comes to a tsheg at the end of a string of tsheg bars.  Describe treatment of final line break or lack thereof.  Warn users to watch out for lines that end with {-}.  Describe treatment of {.} in certain contexts as U+0F0C.  Etc. + + + + +

+ + + + +

Parsing tsheg bars: Greedy Stacking and +Nativeness

+ +

+ This section is a technical reference sufficiently detailed so that + you can fully understand the inner workings of the converter as it + decides which Unicode or TMW to use for a given tsheg + bar.  The problem of breaking up a text into + tsheg bars is a separate issue; this section describes + what happens to a tsheg bar after it's been chipped away from + the text. +

+ + +

+ The ACIP->Tibetan converters have a notion of + nativeness.  Each tsheg bar is either native + Tibetan or non-native.  For example, in Buddhist texts written + in Tibetan, Sanskrit mantras often appear in Tibetan + characters.  This "Tibetanized Sanskrit" is + non-native.  The tsheg bars that make up this mantra + (and here, take "tsheg bar" somewhat literally to mean the + characters delimited by punctuation and whitespace) are some native + and some non-native in the converter's eyes.  For example, the + tsheg bar {MA } appears in some mantras, and is thus in + fact non-native.  The converter, however, treats {MA } as + native in all contexts.  Thus, "native" is a + technical term with a slightly different meaning than usual. +

+ +

+ The idea of nativeness is important because it affects how the + converter treats a tsheg bar.  In ACIP transliteration, + the rule is that consonants stack up until punctuation, whitespace, + or a vowel appears.  For example, {RDZYA} is equivalent to + {R+DZ+YA}.  ({DZA} always means the letter {DZA} itself, never + {D+ZA}.)  But this greedy stacking does not apply to {SOGS}, + which is equivalent to {SOG-S}, not {SOG+S}.  Why not?  + Because {SOGS} is a native tsheg bar where GA is the suffix + and SA is the postsuffix.  Similarly, {GNAD} is {G-NAD}, not + {G+NAD}.  Why?  Because GA is a prefix in this native + Tibetan tsheg bar. +

+ +

+ In this section, we will illustrate the inner workings of this + aspect of the converter.  You will be able to determine which + snippets of transliteration the converter considers to be native + tsheg bars, where greedy stacking does not apply except for + the root stack, and which snippets are non-native, and thus wholly + subject to greedy stacking. +

+ +

Anatomy of a Native tsheg bar

+ +

+ First, the lexical analyzer ensures that only the + Tibetan and Sanskrit consonants, the vowels {A}, {I}, {U}, {E}, {O}, + {OO}, {EE}, {i}, {'A}, {'I}, {'U}, {'E}, {'O}, {'OO}, {'EE}, and + {'i}, and the adornments {m} and {:} are allowed in a tsheg + bar. +

+ +

+ As far as the converter is concerned, a native tsheg bar + consists of an optional prefix, a native root stack, an optional + suffix, an optional postsuffix (also known as a secondary suffix) + that may only be present if a suffix is present, and zero or more + appendages (my term, created because I don't know what a + grammarian calls such a thing).  An appendage is one of the + following stack sequences: +

+ + + +

+ A tsheg bar is non-native if it has a non-native root stack + or if it contains the {:} character.  Any vowel is allowed on a + native root stack, even {'EEm}, {i}, or the like. +

+

+ The rule about native root stacks is important, for example, in + determining that {KTYAMS} is {K+T+YAM+SA} instead of {K+T+YAMASA} + (because K+T+YA is not a native stack).  Another example is + {GNVA}, which is treated like {G+N+VA}, not {G-N+VA}, even though + {GNA} is treated like {G-NA} because NA can take a GA prefix.  + The complete list of native stacks is the following: +

+ + + +

+ (Some would argue that LVA is notably absent.  It is seen in + ACIP Buddhist texts in {AELVA}, {LVAm}, {LVU}, {LVUN}, {LVAR}, + {LVE}, {LVANG}, and {LVA}.  Greedy stacking affects none of + these tsheg bars' parsing, however.) +

+ + +

+ Not all characters can be prefixes and the like.  Only the five + prefixes (GA, DA, BA, MA, 'A), ten suffixes (GA, NGA, DA, NA, BA, + MA, 'A, RA, LA, SA), and two postsuffixes (DA, SA) every Tibetan + student knows are allowed, and they cannot appear with vowels.  + (In {LE'U}, {'} is not a suffix -- it is part of an + appendage.)  In fact, certain prefixes may only appear with + certain root stacks.  The reason that these prefix rules matter + is that they govern how tsheg bars are parsed.  For + example, {GNA} is parsed like {G-NA}, because NA takes a GA + prefix.  But {GPA} is parsed like {G+PA}, because PA does not + take a GA prefix. +

+ +

+ Prefix rules are a topic of some controversy; different grammars + give different lists of prefix rules.  For a converter, it is + important that the converter's knowledge of prefix rules matches the + knowledge of the person who typed in the ACIP transliteration, not + that the converter agrees with a grammarian.  For example, if + the input technician thought that PA could take a GA prefix, then + the converter will produce {G+PA} when {G-PA} was intended.  + For this reason, the converter can produce a warning every time a + prefix rule prohibited the treatment of one of the five prefixes as + a prefix.  For example, {GPA} produces this warning.  + However, {GNA} produces no warning, because the converter assumes + that it is unlikely that an input technician would enter {GNA} upon + seeing {G+NA}.  Part of the reason for this assumption is that + the Asian Classics Input Project Entry Operator Transcription + Chart as of Spring, 1993, explicitly enumerates the following + cases for special treatment by input operators: +

+ + + +

+ Regardless, for best results, you should ensure that the input + technician's knowledge of prefix rules matches the converter's + knowledge.  The following are the legal combinations of prefix + and root stack in the converter's eyes: +

+ + + +

+ In the above list, the presence of wa-zur (ACIP {V}) does not + disallow a prefix-root combination; nor does the presence of any + vowel, even {'EEm}.  The presence of {:} does disallow + prefix-root combinations; e.g., {GN'EEm} is {G-N'EEm}, but {GNA:} is + {G+NA:}.  ({GNVA} is parsed as {G+N+VA} not because NVA cannot + take a GA prefix, but because NVA is not a native stack.) +

+ +

+ The converter will allow any suffix to go with any native root or + prefix-root combination; it will allow any postsuffix to follow any + suffix.  It will allow any appendage on any native tsheg + bar. +

+ +

+ For example, {SOGS}, {BSOGS}, {BS'EEmGS}, {LE'U'I'O} and + {BSKYABS-'UR-'UNG-'O} are all native tsheg bars in the + converter's eyes.  Note the need for disambiguation: {PAM-'AM} + is a native tsheg bar, but {PAM'AM}, which parses as the + three stacks {PA}, {M'A}, and {MA}, is not.  (In practice, + appendages rarely occur after prefixes.  {BUR-'ANG} appears at + least once in ACIP files and {DGA'-'AM} appears at least twice, but + these may be typos.  The converter does allow it, though.  + It thinks {BIR'U} and {WAN'U} (which also occur, but only very + rarely) are both non-native, though, and thus treats {'} as U+0F71 + (subscribed) and not U+0F60 (full form) in each case.) +

+ +

+ Note a fine point.  When turning a tsheg bar into + Tibetan, the ACIP->Tibetan converters assume that subjoined YA + and RA consonants are not fixed-form -- not U+0FBB and U+0FBC -- but + rather are the usual subjoined forms U+0FB1 and U+0FB2.  The + only exceptions are the stacks R+Y, Y+Y, and n+d+Y, which are known + to have fixed-form subjoined YA, and the stacks n+d+R+Y (where RA + but not YA is full-form) and K+sh+R, which are known to have + fixed-form subjoined RA.  Wa-zur, U+0FAD, is never confused + with full-form subjoined WA, U+0FBA, though, because ACIP represents + the former with {V} and the latter with {W}.  Furthermore, the + converter never generates U+0F6A, the fixed-form RA (rango); + U+0F62 is always produced.  (Note that U+0F62 is often + displayed as a fixed-form RA itself, as in {RNYA}.) +

+ +

+ So far, we have spoken about consonants and vowels.  In fact, + it is not trivial to determine when something is a consonant and + when it is a vowel.  {A} can represent U+0F68, the Tibetan + letter, or the implicit vowel.  {'} can represent U+0F71, the + subscribed a-chung, or U+0F60, the full-sized consonant + a-chung.  The converter treats {TAA} as {T+AA}, not {TA-AA}, + but treats {TAAA} like {TA-AA}, not {T+AA-A}.  It treats + {PA'AM} like {PA-'A-M}, not {P+A'A-M}.  In short, it first + tries out treating {'} and {A} like vowels, but will backtrack if + that leads to a clearly invalid tsheg bar. +

+ +

+ Finally, a string of numbers can be a tsheg bar also.  + It is illegal for numbers and consonants to appear together within + one tsheg bar, however. +

+ +

+ The above is the complete understanding of the converter's + algorithms for parsing tsheg bars.  You the native + Tibetan speaker may know that {BSKYABS-'UR-'UNG-'O} is not allowed + and thus think that {B+S+K+YAB+S-'UR-'UNG-'O} should be the result, + but the converter has no such knowledge, and thinks this is a native + tsheg bar equivalent to {B-S+K+YAB-S-'UR-'UNG-'O}. +

+ + + +

System Properties

+ +

+ The tsheg-bar substitution mechanism is + customizable via system properties.  Java developers likely + know what these are, but few users do.  This section will + perhaps get a determined person started, but if you have trouble, + contact the + developers so that we can improve this documentation or create a + better user interface. +

+ +

+ For the tool to respect the value of a system property, you must + invoke the tool from the command line as follows: +

+ +

+ + java + "-Dorg.thdl.tib.text.ttt.ReplacementMap=KAsh=>K+sh,ONYA=>[#ERROR-ONYA-IS-O&]" + -Dorg.thdl.tib.text.ttt.VerboseReplacementMap=true + -jar Jskad.jar + +

+ + +

Known Bugs

+ +

+ This section presents areas where the current tool's behavior is + wrong.  Before doing serious work with the converter, + familiarize yourself with this section and develop a plan to work + around the bugs or to ensure that your documents will not trigger + the bugs.  At the same time, if any of these bugs affects you, + contact the + developers so that we can fix them.  The squeaky wheel + surely gets the grease; these bugs may never be fixed if there are + no complaints. +

+ +

+ The following are all known bugs: +

+ + + + + +

Room for Improvement

+ +

+ This section presents areas where the current tool could be + improved.  None of the current behavior described here is + incontrovertibly flawed (i.e., there are no bugs described here, see + known bugs for that); current behavior is + technically correct.  However, the current behavior is not, in + everyone's eyes, perfect. +

+ +

+ The following are the current areas in which the tool could be + better: +

+ + + + +

License

+ +

Both the ACIP->Tibetan converters and this document are released +under the THDL +Open Community License Version 1.0.

+ + +

+ Please + + + e-mail us + + your comments about this page. +

+ +

+The + + THDL Tools +project is generously hosted by: + + + SourceForge Logo + + +

+
+ + + + diff --git a/htdocs/TMW_RTF_TO_THDL_WYLIE.html b/htdocs/TMW_RTF_TO_THDL_WYLIE.html index abb5ea9..6d6d3cc 100644 --- a/htdocs/TMW_RTF_TO_THDL_WYLIE.html +++ b/htdocs/TMW_RTF_TO_THDL_WYLIE.html @@ -7,7 +7,7 @@ - Tibetan Machine Web Converter + Converters in Jskad @@ -44,259 +44,141 @@
-

Tibetan Machine Web Converter

+

Converters in Jskad

In recent versions of Jskad, the 'Tools' menu has an option 'Launch - Converter...'.  If you use that option, you will find a - first-class Tibetan-to-Tibetan and Tibetan-to-Wylie converter.  - That converter has a user-friendly GUI interface, and it tells you - when things go wrong (even things as subtle as your having selected - the wrong conversion).  If you need a command-line interface to - that converter, however, read on. + Converter...'.  If you use that option, you will find a set of + first-class converters that can convert digital Tibetan from one + form to another.  (A command-line interface is also available; + see below.)

- In the same JAR file as Jskad, power users will find a command-line - utility that converts Tibetan documents from one digital - representation to another.  The converter embodies the same - technology as Jskad itself, but often works even when Jskad fails - due to Java's presently poor support for viewing RTF - documents.  This command-line utility converts a Tibetan - Machine Web-encoded (TMW-encoded) Rich Text Format (RTF) file to - either of these three output formats: + Some of the converters there are based on Jskad technology, but all + are first-class in the sense that they are well though-out, well + tested, and handle errors + nicely.  Certain features in Jskad are quite buggy; for + example, its keyboards do not work as desired, but even when they + do, they silently drop certain input characters.  Do not worry + that the converters described here suffer from these flaws; not one + character of input is ever silently dropped.  It is the + intention of the developers that a Buddhist canon one day could be + entrusted to these converters.  Before you do that, though, + please contact the + developers to be sure that this documentation is up-to-date and + to develop a custom validation and verification plan.  None of + the converters has yet been hand-validated on a real text of any + size, but extensive unit testing has been performed for each + conversion at every stage of development.

- In the same JAR file as Jskad, power users will find a command-line - utility that converts Tibetan documents from one digital - representation to another.  The converter embodies the same - technology as Jskad itself, but often works even when Jskad fails - due to Java's presently poor support for viewing RTF - documents.  This command-line utility converts a Tibetan - Machine Web-encoded (TMW-encoded) Rich Text Format (RTF) file to - either of these three output formats: + The following converters are available:

+

- In addition, this converter can convert Tibetan Machine RTF files to - Tibetan Machine Web RTF files, and takes precautions to ensure that - only a 100% perfect conversion is done in both directions - (TM->TMW and TMW>TM).  One such precaution is that two - independent teams (Garrett and Garson, Chandler) turned the Tibetan - Machine Web - documentation into TM<->TMW tables.  These tables - were compared, giving full confidence that the tables are as - accurate as the documentation (which has a - few flaws itself).  That documentation has not been - extensively verified against the actual fonts, however.  - Another precaution is that any unknown characters cause the - conversion to fail, and the result is a document containing merely - the unknown characters.  (There are some known, illegal glyphs - created by Tibet Doc, and the converter handles the ones it knows of - and treats the rest as unknown.) + Moreover, EWTS->Unicode and EWTS->TMW converters are in + development.  Wylie + Word 2.0 has better EWTS support at present.

- This converter is smart enough to solve the "curly-brace - problem", wherein Tahoma '{', '}', and '\' characters appear - instead of the TMW stacks they are supposed to represent.  This - problem originates with certain versions of Microsoft Word's Rich - Text Format writing capabilities. + Above, RTF is an abbreviation for Rich Text Format; + Text refers to an unformatted text file (in one of several + encodings); TMW refers to the Tibetan + Machine Web font; TM refers to the Tibetan + Machine font; Unicode refers to the Tibetan Unicode characters in the range + U+0F00-U+0FFF mainly but also sometimes includes other Unicode + characters; EWTS refers to Tibetan encoded using the Extended + Wylie Transliteration Scheme, a Roman transliteration scheme; + ACIP refers to Tibetan encoded using Asian Classics Input Project + (ACIP) Tibetan + Input Code, another Roman transliteration scheme. +

+ +

Invoking the Converters

+ +

+ The converters have a user-friendly GUI interface, and it tells you + when things go wrong (from things like the lack of a needed glyph in + the output font to things like your having selected the wrong + conversion).  The GUI is not properly documented here, and + probably will not be until you contact the + developers and ask them to document it.

- Further, this converter gives a polite error message when a given - .rtf file simply cannot be read by the version of Java used. + To use the GUI, first launch Jskad + itself.  Then select 'Launch Converter...' from the 'Tools' + menu.  Let's hope from there it's self-explanatory, because it + is not yet properly documented.

- Perhaps most importantly, the converter has a - --find-some-non-tmw mode of operation that gives you, the - user, confidence that RTF reading and writing idiosyncrasies are not - going to interfere with a flawless conversion.  It does so by - printing out the first occurrence of a given character in a non-TMW - font.  Here is some example output: + For batch conversions of many files, a command-line interface to the + converters may be more suitable than the GUI interface.  In the + same JAR file as Jskad, power users will find a command-line utility + that can do everything the GUI interface to the converters can + do.  To learn how to invoke it, see the output you get when you + use this invocation:

 java -cp "c:\my thdl tools\Jskad.jar" \
-     org.thdl.tib.input.TibetanConverter \
-        --find-some-non-tmw \
-        "Dalai Lama Fifth History 01.rtf"
-
-Non-TMW character newline [decimal 10] in the font Tahoma appears first at location 39
-Non-TMW character ' ' [decimal 32] in the font TimesNewRoman appears first at location 45
-Non-TMW character '}' [decimal 125] in the font Tahoma appears first at location 66
-Non-TMW character '{' [decimal 123] in the font Tahoma appears first at location 219
-Non-TMW character '\' [decimal 92] in the font Tahoma appears first at location 1237
-Non-TMW character newline [decimal 10] in the font Times New Roman appears first at location 9754
+     org.thdl.tib.input.TibetanConverter --help
 
-

- Given the above output, you can be sure that a flawless conversion - (barring the appearance of known bugs) will - result when you run java -cp "c:\my thdl tools\Jskad.jar" - org.thdl.tib.input.TibetanConverter --to-wylie "Dalai Lama Fifth - History 01.rtf" > "Dalai Lama Fifth History 01 in THDL Extended - Wylie.rtf".  (Note that the '>' causes the output to be - directed to the file named thereafter; this is quite handy.)  - This is because the only text in the input file besides Tibetan is - whitespace and the Tahoma characters '{', '}', and - '\'. These Tahoma characters are understood by the tool; - they are symptoms of the "curly-brace problem". + where you must replace "c:\my thdl tools\Jskad.jar" with the + appropriate path on your system.

-

Failed Conversions

+ -

- In this section, you'll learn how to tell if a conversion has - succeeded in full, ran into minor problems, or failed altogether. -

+
License
-

TMW to Wylie

+

Both the converters and this document are released under the THDL +Open Community License Version 1.0.

-

- - This section is too up-to-date -- this is documenting plans for the - future. At present, an error message like - <<[[JSKAD_TMW_TO_WYLIE_ERROR_NO_SUCH_WYLIE: Cannot - convert DuffCode <duffcode font=TibetanMachineWeb7 charNum=72 - character=H/> to THDL Extended Wylie. Please see the - documentation for the TMW font and transcribe this - yourself.]]>> appears. - -

- -

- Note that some TMW glyphs have no transliteration in Exteded - Wylie.  When you encounter such a glyph, you'll find - \tmwXYYY in your output, where X tells you which TMW font - the troublesome glyph comes from and YYY is the decimal number of - the glyph in that font (which is a number between 000 and 255 - inclusive, usually between 33 and 126).  The following are - values corresponding to X: -

- - - -

- Upon finding a \tmwXYYY sequence in your output, you should - consult the - documentation for the specific TMW font named.  Find the - glyph (by its YYY value) and decide how to proceed.  If you - find a glyph that you believe should have been converted into - Extended Wylie by the tool, please report this as a bug through the - SourceForge website or via e-mail. -

- -

Other Conversions

- -

- The other conversions are all-or-nothing.  That is, if you run - into any trouble whatsoever, the result will be a file containing - just the problematic glyphs, each preceded by achen (i.e., U+0F68, - the letter whose THDL Extended Wylie representation is 'a').  - These glyphs will be bracketed on the left by U+0F3C (for which the - THDL Extended Wylie is '(') and on the right by U+0F3D (for which - the THDL Extended Wylie is ')').  If your result is as long as - your input, then the conversion went flawlessly. -

- -

- There is one TMW glyph (TibetanMachineWeb7, glyph 91 [\tmw7091]) - that has no Tibetan Machine equivalent.  This glyph is the only - TMW glyph that can cause a TMW->TM conversion to fail.  It - is fairly common, though, especially if you've used Jskad to prepare - your document.  It might be appropriate to change the document - to use TibetanMachineWeb7, glyph 90 [\tmw7090], a similar glyph that - does have a TM equivalent. -

- -

- You might consider using Jskad to convert documents that give - errors, as it has better error reporting and can tell you just - what's wrong. -

-

- If you ever encounter problems in a TM->TMW conversion, please - send us mail with the error report (and the problem input document's - resulting document) so that we can improve our tools. -

- -

Invoking the Converter

- -

- First add Jskad.jar to your CLASSPATH.  You can do this by - setting an environment variable CLASSPATH to contain the absolute - path of the Jskad.jar file and then running the command java - org.thdl.tib.input.TibetanConverter.  Alternatively, you - can use java -cp "c:\my tibetan documents\Jskad.jar" - org.thdl.tib.input.TibetanConverter where you put in the - appropriate path to Jskad.jar.  You will see usage information - appear if you do this correctly; you'll see a message like - java.lang.NoClassDefFoundError: - org/thdl/tib/input/TibetanConverter; Exception in thread - "main" if you've not correctly told Java where to find - Jskad.jar. -

- -

Known Bugs

- -

- All known bugs are listed in this section.  They're more likely - to be fixed if users complain, so complain away. -

- -

- There are no known bugs at present. -

Please diff --git a/htdocs/TMW_or_TM_To_X_Converters.html b/htdocs/TMW_or_TM_To_X_Converters.html new file mode 100644 index 0000000..e2d4812 --- /dev/null +++ b/htdocs/TMW_or_TM_To_X_Converters.html @@ -0,0 +1,368 @@ + + + + + + + + + Converting from TM or TMW + + + + + + + + +

+ +
+ + +
+ + +
+ +

Converting from Tibetan Machine or Tibetan Machine Web

+ +

+ Among the converters in + Jskad are some converters that take input that is encoded to use + either the Tibetan + Machine (TM) or Tibetan + Machine Web (TMW) fonts.  These converters are described + here. +

+ +

+ First, to learn how to invoke the converters, see these instructions. +

+ +

+ The converters embody the same technology as Jskad + itself, but often work even when Jskad fails due to Java's presently + poor support for viewing Rich Text Format (RTF) documents.  + These converters can convert a TMW-encoded RTF file to any of these + output formats: +

+ + +

+ In addition, this converter can convert a Tibetan Machine RTF file to + a Tibetan Machine Web RTF file. +

+ + +

+ All the converters take precautions to ensure that only a 100% + perfect conversion is done.  One such precaution is that two + independent teams (Garrett and Garson, Chandler) turned the Tibetan + Machine Web + documentation into TM<->TMW tables.  These tables + were compared, giving full confidence that the tables are as + accurate as the documentation (which has a few flaws itself, + documented in the errata we have + created).  That documentation has been verified against the + actual fonts.  David Chapman's assistance in this area has been + invaluable. +

+ +

+ Another precaution is that any unknown characters (in the font being + converted from) cause the conversion to fail, + and the result is either a document containing merely the unknown + characters or a document with conspicuous error messages + interspersed. +

+ +

+ These converters are smart enough to solve the "curly-brace + problem", wherein '{', '}', and '\' characters in the Tahoma + font appear instead of the TMW stacks they are supposed to + represent.  This problem originates with certain versions of + Microsoft Word's Rich Text Format writing capabilities.  These + converters are also smart enough to work around Java's Bug + 4907759. +

+ +

+ Furthermore, these converters give a polite error message when a + given RTF file simply cannot be read by the version of Java used. +

+ + +

Invoking the Converters

+ +

+ See here for details + on how to invoke the converters. +

+ + + +

Failed Conversions

+ +

+ In this section, you'll learn how to tell if a conversion has + succeeded in full, ran into minor problems, or failed altogether. +

+ +

TMW to ACIP

+ +

+ When a TMW->ACIP conversion fails, a message such as + [# JSKAD_TMW_TO_ACIP_ERROR_NO_SUCH_ACIP: Cannot convert + <glyph font=TibetanMachineWeb8 charNum=38 character=&/> to + ACIP. Please transcribe this yourself.] will appear in your + output, but it will be amidst the successfully converted text. +

+ +

TMW to Wylie (i.e., EWTS)

+ +

+ A TMW to EWTS conversion rarely fails; EWTS is almost entirely + comprehensive (and may have been revised to be comprehensive by the + time you read this. +

+ +

+ That said, you may want to search the output for EWTS constructs + that you don't like, such as \u0F39- and + \uF021-style escape sequences. +

+ +

+ If a TMW glyph has no transliteration according to EWTS, + then an error message like + <<[[JSKAD_TMW_TO_WYLIE_ERROR_NO_SUCH_WYLIE: Cannot convert + <glyph font=TibetanMachineWeb7 charNum=95 character=_/> to + THDL Extended Wylie. Please see the documentation for the TM or TMW + font and transcribe this yourself.]]>> appears in the + output. +

+ +

+ Upon finding such a message in your output, you should consult the + + documentation for the specific TMW font named.  Find the + glyph and decide how to proceed.  If you find a glyph that you + believe should have been converted into Extended Wylie by the tool, + please report this as a bug through the SourceForge website or via + e-mail. +

+ + +

TMW to Unicode, TM to TMW, and TMW to TM Conversions

+ +

+ The TMW->Unicode, TM->TMW, and TMW->TM conversions are + all-or-nothing.  That is, if you run into any trouble + whatsoever, the result will be a file containing just the + problematic glyphs, each preceded by a-chen (i.e., U+0F68, the + letter whose THDL Extended Wylie representation is 'a').  These + glyphs will be bracketed on the left by U+0F3C (for which the THDL + Extended Wylie is '(') and on the right by U+0F3D (for which the + THDL Extended Wylie is ')').  If your result is as long as your + input, then the conversion went flawlessly. +

+ +

+ There is one TMW glyph (TibetanMachineWeb7, glyph 91 [\tmw7091]) + that has no Tibetan Machine equivalent.  This glyph is the only + TMW glyph that can cause a TMW->TM conversion to fail.  It + is fairly common, though, especially if you've used Jskad to prepare + your document.  It might be appropriate to change the document + to use TibetanMachineWeb7, glyph 90 (decimal ordinal 90, that is), a + similar glyph that does have a TM equivalent. +

+ +

+ You might consider using the GUI converter interface in Jskad to + convert documents that give impenetrable errors when converted by + the command-line tool, as the GUI has better error reporting and can + tell you just what's wrong. +

+ + +

Finding Potential Problems Before Conversion

+ +

+ The converters that take TM and TMW input deal with problematic + input in a clean way, but you might prefer the mechanism described + here. +

+ +

+ There is a --find-some-non-tmw mode of operation that gives + you, the user, confidence that RTF reading and writing + idiosyncrasies are not going to interfere with a flawless + conversion.  It does so by printing out the first occurrence of + a given character in a non-TMW font.  Here is some example + output: +

+
+java -cp "c:\my thdl tools\Jskad.jar" \
+     org.thdl.tib.input.TibetanConverter \
+        --find-some-non-tmw \
+        "Dalai Lama Fifth History 01.rtf"
+
+Non-TMW character newline [decimal 10] in the font Tahoma appears first at location 39
+Non-TMW character ' ' [decimal 32] in the font TimesNewRoman appears first at location 45
+Non-TMW character '}' [decimal 125] in the font Tahoma appears first at location 66
+Non-TMW character '{' [decimal 123] in the font Tahoma appears first at location 219
+Non-TMW character '\' [decimal 92] in the font Tahoma appears first at location 1237
+Non-TMW character newline [decimal 10] in the font Times New Roman appears first at location 9754
+
+ +

+ Given the above output, you can be sure that a flawless conversion + (barring the appearance of known bugs) will + result when you run java -cp "c:\my thdl tools\Jskad.jar" + org.thdl.tib.input.TibetanConverter --to-wylie "Dalai Lama Fifth + History 01.rtf" > "Dalai Lama Fifth History 01 in THDL Extended + Wylie.rtf".  (Note that the '>' causes the output to be + directed to the file named thereafter; this is quite handy.)  + This is because the only text in the input file besides Tibetan is + whitespace and the Tahoma characters '{', '}', and + '\'. These Tahoma characters are understood by the tool; + they are symptoms of the "curly-brace problem". +

+ +

+ There is a similar --find-some-non-tm mode of operation, + useful for ensuring a trouble-free TM->TMW conversion. +

+ + +

Known Bugs

+ +

+ All known bugs are listed in this section.  They're more likely + to be fixed if users complain, so complain away.  And if you + ever encounter problems in a conversion that are not listed here, + please send us mail with the error report (and the problem input + document's resulting document) so that we can improve our + tools.  The bugs are as follows: +

+ + + +

+

+ +

License

+ +

Both the converters and this document are released under the THDL +Open Community License Version 1.0.

+ + + +

+ Please + + + e-mail us + + your comments about this page. +

+ +

+The + + THDL Tools +project is generously hosted by: + + + SourceForge Logo + + +

+
+ + + +