Converting from TM or TMW

Converting from Tibetan Machine or Tibetan Machine Web

Among the converters in Jskad are some converters that take input that is encoded to use either the Tibetan Machine (TM) or Tibetan Machine Web (TMW) fonts. These converters are described here.

First, to learn how to invoke the converters, see these instructions.

The converters embody the same technology as Jskad itself, but often work even when Jskad fails due to Java's presently poor support for viewing Rich Text Format (RTF) documents. These converters can convert a TMW-encoded RTF file to any of these output formats:

an RTF file using Unicode, a standard encoding that will be widely supported in the future
an RTF file using the appropriate THDL Extended Wylie (EWTS) instead of TMW
a text file using the appropriate THDL Extended Wylie (EWTS) instead of TMW
an RTF file using the appropriate Asian Classics Input Project (ACIP) Tibetan Input Code instead of TMW
a text file using the appropriate Asian Classics Input Project (ACIP) Tibetan Input Code instead of TMW
an RTF file using the Tibetan Machine encoding (used in legacy systems).

In addition, this converter can convert a Tibetan Machine RTF file to a Tibetan Machine Web RTF file.

All the converters take precautions to ensure that only a 100% perfect conversion is done. One such precaution is that two independent teams (Garrett and Garson, Chandler) turned the Tibetan Machine Web documentation into TM<->TMW tables. These tables were compared, giving full confidence that the tables are as accurate as the documentation (which has a few flaws itself, documented in the errata we have created). That documentation has been verified against the actual fonts. David Chapman's assistance in this area has been invaluable.

Another precaution is that any unknown characters (in the font being converted from) cause the conversion to fail, and the result is either a document containing merely the unknown characters or a document with conspicuous error messages interspersed.

These converters are smart enough to solve the "curly-brace problem", wherein '{', '}', and '\' characters in the Tahoma font appear instead of the TMW stacks they are supposed to represent. This problem originates with certain versions of Microsoft Word's Rich Text Format writing capabilities. These converters are also smart enough to work around Java's Bug 4907759.

Furthermore, these converters give a polite error message when a given RTF file simply cannot be read by the version of Java used.

Invoking the Converters

See here for details on how to invoke the converters.

Failed Conversions

In this section, you'll learn how to tell if a conversion has succeeded in full, ran into minor problems, or failed altogether.

TMW to ACIP

When a TMW->ACIP conversion fails, a message such as [# JSKAD_TMW_TO_ACIP_ERROR_NO_SUCH_ACIP: Cannot convert <glyph font=TibetanMachineWeb8 charNum=38 character=&/> to ACIP. Please transcribe this yourself.] will appear in your output, but it will be amidst the successfully converted text.

TMW to Wylie (i.e., EWTS)

A TMW to EWTS conversion rarely fails; EWTS is almost entirely comprehensive (and may have been revised to be comprehensive by the time you read this.

That said, you may want to search the output for EWTS constructs that you don't like, such as \u0F39- and \uF021-style escape sequences.

If a TMW glyph has no transliteration according to EWTS, then an error message like <<[[JSKAD_TMW_TO_WYLIE_ERROR_NO_SUCH_WYLIE: Cannot convert <glyph font=TibetanMachineWeb7 charNum=95 character=_/> to THDL Extended Wylie. Please see the documentation for the TM or TMW font and transcribe this yourself.]]>> appears in the output.

Upon finding such a message in your output, you should consult the documentation for the specific TMW font named. Find the glyph and decide how to proceed. If you find a glyph that you believe should have been converted into Extended Wylie by the tool, please report this as a bug through the SourceForge website or via e-mail.

TMW to Unicode, TM to TMW, and TMW to TM Conversions

The TMW->Unicode, TM->TMW, and TMW->TM conversions are all-or-nothing. That is, if you run into any trouble whatsoever, the result will be a file containing just the problematic glyphs, each preceded by a-chen (i.e., U+0F68, the letter whose THDL Extended Wylie representation is 'a'). These glyphs will be bracketed on the left by U+0F3C (for which the THDL Extended Wylie is '(') and on the right by U+0F3D (for which the THDL Extended Wylie is ')'). If your result is as long as your input, then the conversion went flawlessly.

There is one TMW glyph (TibetanMachineWeb7, glyph 91 [\tmw7091]) that has no Tibetan Machine equivalent. This glyph is the only TMW glyph that can cause a TMW->TM conversion to fail. It is fairly common, though, especially if you've used Jskad to prepare your document. It might be appropriate to change the document to use TibetanMachineWeb7, glyph 90 (decimal ordinal 90, that is), a similar glyph that does have a TM equivalent.

You might consider using the GUI converter interface in Jskad to convert documents that give impenetrable errors when converted by the command-line tool, as the GUI has better error reporting and can tell you just what's wrong.

Finding Potential Problems Before Conversion

The converters that take TM and TMW input deal with problematic input in a clean way, but you might prefer the mechanism described here.

There is a --find-some-non-tmw mode of operation that gives you, the user, confidence that RTF reading and writing idiosyncrasies are not going to interfere with a flawless conversion. It does so by printing out the first occurrence of a given character in a non-TMW font. Here is some example output:

java -cp "c:\my thdl tools\Jskad.jar" \
     org.thdl.tib.input.TibetanConverter \
        --find-some-non-tmw \
        "Dalai Lama Fifth History 01.rtf"

Non-TMW character newline [decimal 10] in the font Tahoma appears first at location 39
Non-TMW character ' ' [decimal 32] in the font TimesNewRoman appears first at location 45
Non-TMW character '}' [decimal 125] in the font Tahoma appears first at location 66
Non-TMW character '{' [decimal 123] in the font Tahoma appears first at location 219
Non-TMW character '\' [decimal 92] in the font Tahoma appears first at location 1237
Non-TMW character newline [decimal 10] in the font Times New Roman appears first at location 9754

Given the above output, you can be sure that a flawless conversion (barring the appearance of known bugs) will result when you run java -cp "c:\my thdl tools\Jskad.jar" org.thdl.tib.input.TibetanConverter --to-wylie "Dalai Lama Fifth History 01.rtf" > "Dalai Lama Fifth History 01 in THDL Extended Wylie.rtf". (Note that the '>' causes the output to be directed to the file named thereafter; this is quite handy.) This is because the only text in the input file besides Tibetan is whitespace and the Tahoma characters '{', '}', and '\'. These Tahoma characters are understood by the tool; they are symptoms of the "curly-brace problem".

There is a similar --find-some-non-tm mode of operation, useful for ensuring a trouble-free TM->TMW conversion.

Known Bugs

All known bugs are listed in this section. They're more likely to be fixed if users complain, so complain away. And if you ever encounter problems in a conversion that are not listed here, please send us mail with the error report (and the problem input document's resulting document) so that we can improve our tools. The bugs are as follows:

TMW->ACIP does not produce {KA (KHA)} to indicate differing font sizes.
TMW to Unicode fails subtly when the TMW for {\u0F28\u0F3E} is converted: {\u0F3E\u0F28} appears instead. [855480]

License

Both the converters and this document are released under the THDL Open Community License Version 1.0.

Please e-mail us your comments about this page.

The THDL Tools project is generously hosted by:

The Tibetan & Himalayan Digital Library