Added a design document concerning the Tibetan Format Converter,
a.k.a. the "Rosetta Stone".
This commit is contained in:
		
							parent
							
								
									58287c09a5
								
							
						
					
					
						commit
						5cfbcdfd30
					
				
					 1 changed files with 293 additions and 0 deletions
				
			
		
							
								
								
									
										293
									
								
								htdocs/TibetanFormatConverterDesign.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										293
									
								
								htdocs/TibetanFormatConverterDesign.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,293 @@ | |||
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> | ||||
| <html> | ||||
| 
 | ||||
| <!-- @author David Chandler --> | ||||
| <!-- @date November 14, 2002 --> | ||||
| <!-- @editor Emacs, baby! --> | ||||
| 
 | ||||
| <head> | ||||
|   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> | ||||
|   <title>Tibetan Format Converter Design Document</title> | ||||
| </head> | ||||
| 
 | ||||
| <body> | ||||
| <h1>Tibetan Format Converter Design Document</h1> | ||||
| 
 | ||||
| <p> | ||||
|   This document describes the design of a mechanism for converting | ||||
|   from any of a number of representations of Tibetan+Roman text to any | ||||
|   of a number of representations.  This converter will store | ||||
|   Tibetan+Roman text internally in a | ||||
|   org.thdl.tib.text.TibetanDocument, and it will use a | ||||
|   org.thdl.tib.text.TibetanKeyboard to populate a TibetanDocument. | ||||
|   These two classes exist presently inside the Jskad application, but | ||||
|   will be modified as needed so that servlets, console applications, | ||||
|   and AWT/Swing-based applications can all make use of them. | ||||
| </p> | ||||
| 
 | ||||
| <p> | ||||
|   The difficulty is in fault-tolerance, reliability (DLC address both | ||||
|   verification AND validation), and speed.  Speed will be of least | ||||
|   concern. | ||||
| </p> | ||||
| 
 | ||||
| <h3>Input formats</h3> | ||||
| 
 | ||||
| <p> | ||||
|   The converter will support, in a modular fashion, <b>mixed Tibetan | ||||
|   and Roman</b> input in the following formats: | ||||
| </p> | ||||
|   <ul> | ||||
|     <li> | ||||
|       An HTML file with embedded <tibetan | ||||
|       translit="extended-wylie">sgra</tibetan> tags (from the | ||||
|       SimpleTibetanAndRomanDocument DTD mentioned below) | ||||
|     </li> | ||||
|     <li> | ||||
|       Unicode (regardless of the order of consonants in a stack) | ||||
|     </li> | ||||
|     <li> | ||||
|       RTF for TibetanMachine | ||||
|     </li> | ||||
|     <li> | ||||
|       RTF for TibetanMachineWeb | ||||
|     </li> | ||||
|     <li> | ||||
|       RTF for Sambhota Old | ||||
|     </li> | ||||
|     <li> | ||||
|       RTF for Sambhota New | ||||
|     </li> | ||||
|     <li> | ||||
|       Edward and Than's XHTML | ||||
|     </li> | ||||
|   </ul> | ||||
| 
 | ||||
| <p> | ||||
|   In addition, the converter will support, in a modular fashion, | ||||
|   <b>strictly Tibetan</b> input in the following formats: | ||||
| </p> | ||||
|   <ul> | ||||
|     <li> | ||||
|       Extended Wylie, ACIP, and any other format for which there | ||||
|       exists a Jskad keyboard (i.e., a .ini file in the desired | ||||
|       format).  In practice, only ACIP and some Wylie variants are | ||||
|       used for storing Tibetan, but the mechanism is general. (This | ||||
|       will be in UTF8 with no metadata) | ||||
|     </li> | ||||
|   </ul> | ||||
| 
 | ||||
| 
 | ||||
| <p> | ||||
|   The converter will attempt to accept input that has minor flaws, but | ||||
|   it will also have a mode that rejects input with even the slightest | ||||
|   flaw. | ||||
| </p> | ||||
| 
 | ||||
| 
 | ||||
| <h3>Output formats</h3> | ||||
| 
 | ||||
| <p> | ||||
|   The converter will support, in a modular fashion, outputting a | ||||
|   TibetanDocument that is <b>entirely Tibetan, entirely Roman, or a | ||||
|   mix of Tibetan and Roman</b>, to the following output formats: | ||||
| </p> | ||||
|   <ul> | ||||
|     <li> | ||||
|       A proprietary, not-very-well-thought-out XML file of David | ||||
|       Chandler's design.  For ease of imputation, let's say that this | ||||
|       will adhere to the LetterByLetterTibetanAndRomanDocument DTD. | ||||
|       This is useful for testing the software.  Also useful because it | ||||
|       can easily be transformed into as-yet-unthought-of output | ||||
|       formats. | ||||
|     </li> | ||||
|     <li> | ||||
|       Extended Wylie or ACIP (inside a trivial XML[UTF8] document that | ||||
|       describes the tool that output this file and links to a | ||||
|       versioned DTD on the THDL web server) [only these two are used, | ||||
|       but we could generate output in the TCC keyboard #1 | ||||
|       "transliteration" because the mechanism is general] | ||||
|     </li> | ||||
|     <li> | ||||
|       Unicode (DLC: in which order for consonantal stacks? also, | ||||
|       normalized or not?) | ||||
|     </li> | ||||
|     <li> | ||||
|       RTF for TibetanMachine | ||||
|     </li> | ||||
|     <li> | ||||
|       RTF for TibetanMachineWeb | ||||
|     </li> | ||||
|     <li> | ||||
|       RTF for Sambhota Old | ||||
|     </li> | ||||
|     <li> | ||||
|       RTF for Sambhota New | ||||
|     </li> | ||||
|     <li> | ||||
|       Edward and Than's XHTML | ||||
|     </li> | ||||
|     <li> | ||||
|       XML that is much leaner and has <tibetan translit="acip | | ||||
|       extended-wylie"> and <roman> tags (just a minimum of | ||||
|       them).  This will be according to the not-yet-in-existence | ||||
|       SimpleTibetanAndRomanDocument DTD. | ||||
|     </li> | ||||
|   </ul> | ||||
| 
 | ||||
| <p> | ||||
|   The converter will support, in a modular fashion, outputting a | ||||
|   TibetanDocument that contains <b>only Tibetan and no Roman text</b> | ||||
|   to the following additional output formats: | ||||
| </p> | ||||
|   <ul> | ||||
|     <li> | ||||
|       Extended Wylie, ACIP, and any other format for which there | ||||
|       exists a Jskad keyboard (i.e., a .ini file in the desired | ||||
|       format).  In practice, only ACIP and some Wylie variants are | ||||
|       used for storing Tibetan, but the mechanism is general. (This | ||||
|       will be in UTF8 with no metadata) | ||||
|     </li> | ||||
|     <li> | ||||
|       Phonetic Tibetan (ACIP loose standard) | ||||
|     </li> | ||||
|     <li> | ||||
|       Phonetic Tibetan (THDL standard) | ||||
|     </li> | ||||
|   </ul> | ||||
| 
 | ||||
| <p> | ||||
| What formats am I missing?  E-mail <a | ||||
| href="mailto:dchandler@users.sourceforge.net">me</a> them. | ||||
| </p> | ||||
| 
 | ||||
| <h3>Advantages and Benefits</h3> | ||||
| 
 | ||||
| <p> | ||||
|   After this work item is completed, Jskad will be a powerful viewer | ||||
|   of the various input formats described above. | ||||
| </p> | ||||
| 
 | ||||
| <p> | ||||
|   Command-line tools will exist to convert to-and-fro this-and-that. | ||||
|   The most useful conversions will be to-and-from Unicode.  This will | ||||
|   allow long-term storage in a format that will exist for years, while | ||||
|   still allowing day-to-day work on systems without support for | ||||
|   rendering Unicode. | ||||
| </p> | ||||
| 
 | ||||
| <p> | ||||
|   In addition, it will be possible with a little extra work to use | ||||
|   Jskad as an HTML source editor rather than notepad.  You can save as | ||||
|   the ugly, uneditable XHTML source that browsers can display, or | ||||
|   preview in your system's default browser. | ||||
| </p> | ||||
| 
 | ||||
| <p> | ||||
|   Edward envisions a servlet that allows users to paste in, type in, | ||||
|   or upload Tibetan in their format of choice.  This will be shown on | ||||
|   the left side of the web page.  Upon identifying that format | ||||
|   (perhaps the servlet will make an educated guess, even), they can | ||||
|   then select any of our supported output formats and see the result | ||||
|   (and download at their leisure) on the right half of the web page. | ||||
| </p> | ||||
| 
 | ||||
| <h3>Implementation Plan</h3> | ||||
| 
 | ||||
| <p> | ||||
|   To implement this converter, we will do the following: | ||||
| </p> | ||||
|   <ol> | ||||
|     <li> | ||||
|       Have TibetanDocument output a dense XML document that adheres to | ||||
|       the LetterByLetterTibetanAndRomanDocument DTD. | ||||
|     </li> | ||||
|     <li> | ||||
|       Play with XSLT and use it where appropriate to create output. | ||||
|     </li> | ||||
|     <li> | ||||
|       Get the keyboard input logic out of org.thdl.tib.input.DuffPane. | ||||
|       At this point, it will be possible to programmatically simulate | ||||
|       a human user at the keyboard.  Automated tests that certain | ||||
|       Tibetan keyboards are working correctly will be performed at | ||||
|       this point, and these tests will work off the | ||||
|       LetterByLetterTibetanAndRomanDocument that TibetanDocument was | ||||
|       made to output above. | ||||
|     </li> | ||||
|     <li> | ||||
|       Create a command-line tool to convert from ACIP or Extended | ||||
|       Wylie to the currently supported output formats using Chandler's | ||||
|       modified gengetopt-2.4 [dubbed 2.4j] for command-line parameter | ||||
|       processing. | ||||
|     </li> | ||||
|     <li> | ||||
|       Add "Save As | ||||
|       [Unicode|Extended-Wylie|ACIP|XHTML|RTF(TMW)|RTF(SambhotaNew)|...]"  | ||||
|       options to Jskad. | ||||
|     </li> | ||||
|     <li> | ||||
|       Code up Edward's servlet (described above). | ||||
|     </li> | ||||
|   </ol> | ||||
| 
 | ||||
| <p> | ||||
| DLC: address fault-tolerance etc. | ||||
| </p> | ||||
| 
 | ||||
| <h3>Things to think more about...</h3> | ||||
| 
 | ||||
| <p> | ||||
|   Things to think more about: | ||||
| </p> | ||||
|   <ul> | ||||
|     <li> | ||||
|       Unicode normalization | ||||
|     </li> | ||||
|   </ul> | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| <!-- THDLTools FOOTER: --> | ||||
| <hr> | ||||
| 
 | ||||
|   <i> | ||||
|     Please | ||||
| 
 | ||||
|     <a href="mailto:thdltools-devel@lists.sourceforge.net"> | ||||
|       e-mail us</a>  | ||||
| 
 | ||||
|     your comments about this page. | ||||
|   </i> | ||||
| 
 | ||||
| <hr> | ||||
| 
 | ||||
| The | ||||
| 
 | ||||
| <a href="index.html"> | ||||
|   THDL Tools</a> | ||||
| 
 | ||||
| project is generously hosted by: | ||||
| 
 | ||||
| <!-- | ||||
| 
 | ||||
| DO NOT DELETE THE SF.NET LOGO. | ||||
| 
 | ||||
| We have a choice of colors and sizes for this logo (see | ||||
| "https://sourceforge.net/docman/display_doc.php?docid=790&group_id=1"), | ||||
| but we do not have the option of removing it.  SourceForge requests | ||||
| that we put it on each web page for our project, and to give us | ||||
| incentive to do so, they will not track the number of hits for our | ||||
| project web pages unless we put this link in.  To track hits, see | ||||
| "http://sourceforge.net/project/stats/index.php?report=months&group_id=61934". | ||||
| 
 | ||||
| --> | ||||
| <a href="http://sourceforge.net/"> | ||||
|   <img src="http://sourceforge.net/sflogo.php?group_id=61934&type=1" | ||||
|        width="88" height="31" border="0" alt="SourceForge Logo"> | ||||
| </a> | ||||
| <!-- AGAIN, DO NOT DELETE THE SF.NET LOGO. --> | ||||
| 
 | ||||
| 
 | ||||
| </body> | ||||
| </html> | ||||
		Loading…
	
	Add table
		Add a link
		
	
		Reference in a new issue