Added a design document concerning the Tibetan Format Converter,
a.k.a. the "Rosetta Stone".
This commit is contained in:
parent
58287c09a5
commit
5cfbcdfd30
1 changed files with 293 additions and 0 deletions
293
htdocs/TibetanFormatConverterDesign.html
Normal file
293
htdocs/TibetanFormatConverterDesign.html
Normal file
|
@ -0,0 +1,293 @@
|
|||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
||||
<html>
|
||||
|
||||
<!-- @author David Chandler -->
|
||||
<!-- @date November 14, 2002 -->
|
||||
<!-- @editor Emacs, baby! -->
|
||||
|
||||
<head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
||||
<title>Tibetan Format Converter Design Document</title>
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<h1>Tibetan Format Converter Design Document</h1>
|
||||
|
||||
<p>
|
||||
This document describes the design of a mechanism for converting
|
||||
from any of a number of representations of Tibetan+Roman text to any
|
||||
of a number of representations. This converter will store
|
||||
Tibetan+Roman text internally in a
|
||||
org.thdl.tib.text.TibetanDocument, and it will use a
|
||||
org.thdl.tib.text.TibetanKeyboard to populate a TibetanDocument.
|
||||
These two classes exist presently inside the Jskad application, but
|
||||
will be modified as needed so that servlets, console applications,
|
||||
and AWT/Swing-based applications can all make use of them.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The difficulty is in fault-tolerance, reliability (DLC address both
|
||||
verification AND validation), and speed. Speed will be of least
|
||||
concern.
|
||||
</p>
|
||||
|
||||
<h3>Input formats</h3>
|
||||
|
||||
<p>
|
||||
The converter will support, in a modular fashion, <b>mixed Tibetan
|
||||
and Roman</b> input in the following formats:
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
An HTML file with embedded <tibetan
|
||||
translit="extended-wylie">sgra</tibetan> tags (from the
|
||||
SimpleTibetanAndRomanDocument DTD mentioned below)
|
||||
</li>
|
||||
<li>
|
||||
Unicode (regardless of the order of consonants in a stack)
|
||||
</li>
|
||||
<li>
|
||||
RTF for TibetanMachine
|
||||
</li>
|
||||
<li>
|
||||
RTF for TibetanMachineWeb
|
||||
</li>
|
||||
<li>
|
||||
RTF for Sambhota Old
|
||||
</li>
|
||||
<li>
|
||||
RTF for Sambhota New
|
||||
</li>
|
||||
<li>
|
||||
Edward and Than's XHTML
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
In addition, the converter will support, in a modular fashion,
|
||||
<b>strictly Tibetan</b> input in the following formats:
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
Extended Wylie, ACIP, and any other format for which there
|
||||
exists a Jskad keyboard (i.e., a .ini file in the desired
|
||||
format). In practice, only ACIP and some Wylie variants are
|
||||
used for storing Tibetan, but the mechanism is general. (This
|
||||
will be in UTF8 with no metadata)
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
|
||||
<p>
|
||||
The converter will attempt to accept input that has minor flaws, but
|
||||
it will also have a mode that rejects input with even the slightest
|
||||
flaw.
|
||||
</p>
|
||||
|
||||
|
||||
<h3>Output formats</h3>
|
||||
|
||||
<p>
|
||||
The converter will support, in a modular fashion, outputting a
|
||||
TibetanDocument that is <b>entirely Tibetan, entirely Roman, or a
|
||||
mix of Tibetan and Roman</b>, to the following output formats:
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
A proprietary, not-very-well-thought-out XML file of David
|
||||
Chandler's design. For ease of imputation, let's say that this
|
||||
will adhere to the LetterByLetterTibetanAndRomanDocument DTD.
|
||||
This is useful for testing the software. Also useful because it
|
||||
can easily be transformed into as-yet-unthought-of output
|
||||
formats.
|
||||
</li>
|
||||
<li>
|
||||
Extended Wylie or ACIP (inside a trivial XML[UTF8] document that
|
||||
describes the tool that output this file and links to a
|
||||
versioned DTD on the THDL web server) [only these two are used,
|
||||
but we could generate output in the TCC keyboard #1
|
||||
"transliteration" because the mechanism is general]
|
||||
</li>
|
||||
<li>
|
||||
Unicode (DLC: in which order for consonantal stacks? also,
|
||||
normalized or not?)
|
||||
</li>
|
||||
<li>
|
||||
RTF for TibetanMachine
|
||||
</li>
|
||||
<li>
|
||||
RTF for TibetanMachineWeb
|
||||
</li>
|
||||
<li>
|
||||
RTF for Sambhota Old
|
||||
</li>
|
||||
<li>
|
||||
RTF for Sambhota New
|
||||
</li>
|
||||
<li>
|
||||
Edward and Than's XHTML
|
||||
</li>
|
||||
<li>
|
||||
XML that is much leaner and has <tibetan translit="acip |
|
||||
extended-wylie"> and <roman> tags (just a minimum of
|
||||
them). This will be according to the not-yet-in-existence
|
||||
SimpleTibetanAndRomanDocument DTD.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
The converter will support, in a modular fashion, outputting a
|
||||
TibetanDocument that contains <b>only Tibetan and no Roman text</b>
|
||||
to the following additional output formats:
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
Extended Wylie, ACIP, and any other format for which there
|
||||
exists a Jskad keyboard (i.e., a .ini file in the desired
|
||||
format). In practice, only ACIP and some Wylie variants are
|
||||
used for storing Tibetan, but the mechanism is general. (This
|
||||
will be in UTF8 with no metadata)
|
||||
</li>
|
||||
<li>
|
||||
Phonetic Tibetan (ACIP loose standard)
|
||||
</li>
|
||||
<li>
|
||||
Phonetic Tibetan (THDL standard)
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
What formats am I missing? E-mail <a
|
||||
href="mailto:dchandler@users.sourceforge.net">me</a> them.
|
||||
</p>
|
||||
|
||||
<h3>Advantages and Benefits</h3>
|
||||
|
||||
<p>
|
||||
After this work item is completed, Jskad will be a powerful viewer
|
||||
of the various input formats described above.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Command-line tools will exist to convert to-and-fro this-and-that.
|
||||
The most useful conversions will be to-and-from Unicode. This will
|
||||
allow long-term storage in a format that will exist for years, while
|
||||
still allowing day-to-day work on systems without support for
|
||||
rendering Unicode.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
In addition, it will be possible with a little extra work to use
|
||||
Jskad as an HTML source editor rather than notepad. You can save as
|
||||
the ugly, uneditable XHTML source that browsers can display, or
|
||||
preview in your system's default browser.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Edward envisions a servlet that allows users to paste in, type in,
|
||||
or upload Tibetan in their format of choice. This will be shown on
|
||||
the left side of the web page. Upon identifying that format
|
||||
(perhaps the servlet will make an educated guess, even), they can
|
||||
then select any of our supported output formats and see the result
|
||||
(and download at their leisure) on the right half of the web page.
|
||||
</p>
|
||||
|
||||
<h3>Implementation Plan</h3>
|
||||
|
||||
<p>
|
||||
To implement this converter, we will do the following:
|
||||
</p>
|
||||
<ol>
|
||||
<li>
|
||||
Have TibetanDocument output a dense XML document that adheres to
|
||||
the LetterByLetterTibetanAndRomanDocument DTD.
|
||||
</li>
|
||||
<li>
|
||||
Play with XSLT and use it where appropriate to create output.
|
||||
</li>
|
||||
<li>
|
||||
Get the keyboard input logic out of org.thdl.tib.input.DuffPane.
|
||||
At this point, it will be possible to programmatically simulate
|
||||
a human user at the keyboard. Automated tests that certain
|
||||
Tibetan keyboards are working correctly will be performed at
|
||||
this point, and these tests will work off the
|
||||
LetterByLetterTibetanAndRomanDocument that TibetanDocument was
|
||||
made to output above.
|
||||
</li>
|
||||
<li>
|
||||
Create a command-line tool to convert from ACIP or Extended
|
||||
Wylie to the currently supported output formats using Chandler's
|
||||
modified gengetopt-2.4 [dubbed 2.4j] for command-line parameter
|
||||
processing.
|
||||
</li>
|
||||
<li>
|
||||
Add "Save As
|
||||
[Unicode|Extended-Wylie|ACIP|XHTML|RTF(TMW)|RTF(SambhotaNew)|...]"
|
||||
options to Jskad.
|
||||
</li>
|
||||
<li>
|
||||
Code up Edward's servlet (described above).
|
||||
</li>
|
||||
</ol>
|
||||
|
||||
<p>
|
||||
DLC: address fault-tolerance etc.
|
||||
</p>
|
||||
|
||||
<h3>Things to think more about...</h3>
|
||||
|
||||
<p>
|
||||
Things to think more about:
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
Unicode normalization
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- THDLTools FOOTER: -->
|
||||
<hr>
|
||||
|
||||
<i>
|
||||
Please
|
||||
|
||||
<a href="mailto:thdltools-devel@lists.sourceforge.net">
|
||||
e-mail us</a>
|
||||
|
||||
your comments about this page.
|
||||
</i>
|
||||
|
||||
<hr>
|
||||
|
||||
The
|
||||
|
||||
<a href="index.html">
|
||||
THDL Tools</a>
|
||||
|
||||
project is generously hosted by:
|
||||
|
||||
<!--
|
||||
|
||||
DO NOT DELETE THE SF.NET LOGO.
|
||||
|
||||
We have a choice of colors and sizes for this logo (see
|
||||
"https://sourceforge.net/docman/display_doc.php?docid=790&group_id=1"),
|
||||
but we do not have the option of removing it. SourceForge requests
|
||||
that we put it on each web page for our project, and to give us
|
||||
incentive to do so, they will not track the number of hits for our
|
||||
project web pages unless we put this link in. To track hits, see
|
||||
"http://sourceforge.net/project/stats/index.php?report=months&group_id=61934".
|
||||
|
||||
-->
|
||||
<a href="http://sourceforge.net/">
|
||||
<img src="http://sourceforge.net/sflogo.php?group_id=61934&type=1"
|
||||
width="88" height="31" border="0" alt="SourceForge Logo">
|
||||
</a>
|
||||
<!-- AGAIN, DO NOT DELETE THE SF.NET LOGO. -->
|
||||
|
||||
|
||||
</body>
|
||||
</html>
|
Loading…
Reference in a new issue