Tibetan and Himalayan Library - THL

thl header title text

Tibetan Format Converter Design Document

This document describes the design of a mechanism for converting from any of a number of representations of Tibetan+Roman text to any of a number of representations. This converter will store Tibetan+Roman text internally in a org.thdl.tib.text.TibetanDocument, and it will use a org.thdl.tib.text.TibetanKeyboard to populate a TibetanDocument. These two classes exist presently inside the Jskad application, but will be modified as needed so that servlets, console applications, and AWT/Swing-based applications can all make use of them.

The difficulty is in fault-tolerance, reliability (DLC address both verification AND validation), and speed. Speed will be of least concern.

Input formats

The converter will support, in a modular fashion, mixed Tibetan and Roman input in the following formats:

  • An HTML file with embedded <tibetan translit="extended-wylie">sgra</tibetan> tags (from the SimpleTibetanAndRomanDocument DTD mentioned below)
  • Unicode (regardless of the order of consonants in a stack)
  • RTF for TibetanMachine
  • RTF for TibetanMachineWeb
  • RTF for Sambhota Old
  • RTF for Sambhota New
  • Edward and Than's XHTML

In addition, the converter will support, in a modular fashion, strictly Tibetan input in the following formats:

  • Extended Wylie, ACIP, and any other format for which there exists a Jskad keyboard (i.e., a .ini file in the desired format). In practice, only ACIP and some Wylie variants are used for storing Tibetan, but the mechanism is general. (This will be in UTF8 with no metadata)

The converter will attempt to accept input that has minor flaws, but it will also have a mode that rejects input with even the slightest flaw.

Output formats

The converter will support, in a modular fashion, outputting a TibetanDocument that is entirely Tibetan, entirely Roman, or a mix of Tibetan and Roman, to the following output formats:

  • A proprietary, not-very-well-thought-out XML file of David Chandler's design. For ease of imputation, let's say that this will adhere to the LetterByLetterTibetanAndRomanDocument DTD. This is useful for testing the software. Also useful because it can easily be transformed into as-yet-unthought-of output formats.
  • Extended Wylie or ACIP (inside a trivial XML[UTF8] document that describes the tool that output this file and links to a versioned DTD on the THDL web server) [only these two are used, but we could generate output in the TCC keyboard #1 "transliteration" because the mechanism is general]
  • Unicode (DLC: in which order for consonantal stacks? also, normalized or not?)
  • RTF for TibetanMachine
  • RTF for TibetanMachineWeb
  • RTF for Sambhota Old
  • RTF for Sambhota New
  • Edward and Than's XHTML
  • XML that is much leaner and has <tibetan translit="acip | extended-wylie"> and <roman> tags (just a minimum of them). This will be according to the not-yet-in-existence SimpleTibetanAndRomanDocument DTD.

The converter will support, in a modular fashion, outputting a TibetanDocument that contains only Tibetan and no Roman text to the following additional output formats:

  • Extended Wylie, ACIP, and any other format for which there exists a Jskad keyboard (i.e., a .ini file in the desired format). In practice, only ACIP and some Wylie variants are used for storing Tibetan, but the mechanism is general. (This will be in UTF8 with no metadata)
  • Phonetic Tibetan (ACIP loose standard)
  • Phonetic Tibetan (THDL standard)

What formats am I missing? E-mail me them.

Advantages and Benefits

After this work item is completed, Jskad will be a powerful viewer of the various input formats described above.

Command-line tools will exist to convert to-and-fro this-and-that. The most useful conversions will be to-and-from Unicode. This will allow long-term storage in a format that will exist for years, while still allowing day-to-day work on systems without support for rendering Unicode.

In addition, it will be possible with a little extra work to use Jskad as an HTML source editor rather than notepad. You can save as the ugly, uneditable XHTML source that browsers can display, or preview in your system's default browser.

Edward envisions a servlet that allows users to paste in, type in, or upload Tibetan in their format of choice. This will be shown on the left side of the web page. Upon identifying that format (perhaps the servlet will make an educated guess, even), they can then select any of our supported output formats and see the result (and download at their leisure) on the right half of the web page.

Implementation Plan

To implement this converter, we will do the following:

  1. Have TibetanDocument output a dense XML document that adheres to the LetterByLetterTibetanAndRomanDocument DTD.
  2. Play with XSLT and use it where appropriate to create output.
  3. Get the keyboard input logic out of org.thdl.tib.input.DuffPane. At this point, it will be possible to programmatically simulate a human user at the keyboard. Automated tests that certain Tibetan keyboards are working correctly will be performed at this point, and these tests will work off the LetterByLetterTibetanAndRomanDocument that TibetanDocument was made to output above.
  4. Create a command-line tool to convert from ACIP or Extended Wylie to the currently supported output formats using Chandler's modified gengetopt-2.4 [dubbed 2.4j] for command-line parameter processing.
  5. Add "Save As [Unicode|Extended-Wylie|ACIP|XHTML|RTF(TMW)|RTF(SambhotaNew)|...]" options to Jskad.
  6. Code up Edward's servlet (described above).

DLC: address fault-tolerance etc.

Things to think more about...

Things to think more about:

  • Unicode normalization

Please e-mail us your comments about this page.

The THDL Tools project is generously hosted by: SourceForge Logo


Loading...