added documentation

2002-11-28 06:54:46 +00:00 · 2002-11-28 06:54:46 +00:00 · 178ffcb800
commit 178ffcb800
parent efa69fe225
5 changed files with 267 additions and 12 deletions
--- a/source/org/thdl/tib/scanner/AcipToWylie.java
+++ b/source/org/thdl/tib/scanner/AcipToWylie.java
@ -21,8 +21,20 @@ package org.thdl.tib.scanner;
 import java.net.*;
 import java.io.*;
-/** Provides interfase to convert from tibetan text transliterated in
+/** Provides an interfase to convert from tibetan text transliterated in the Acip scheme to THDL's <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">Extended Wylie</a> scheme.
-    the Acip scheme to THDL's extended wylie scheme.
+
 <p>If no arguments are sent, it takes the Acip text from the standard input and sends the 
 Wylie text to the standard output. If one argument is sent, it interprets it as the
 file name for the input. If two arguments are sent, it interprets the first one as the file name for the input and
 the second one as the file name for the output. For example, the following 
 command converts the <i>lam-rim-chen-mo.act</i> storing the results in <i>
 lam-rim-chen-mo.txt</i>:</p>
 <pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.AcipToWylie lam-rim-chen-mo.act lam-rim-chen-mo.txt</pre>
 <p>Alternatively by redirecting the standard input/output you perform the same 
 job:</p>
 <pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.AcipToWylie &lt; lam-rim-chen-mo.act &gt; lam-rim-chen-mo.txt</pre>
 <p>If you only want to display the results to the screen, you can run:</p>
 <pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.AcipToWylie lam-rim-chen-mo.act | more</pre>
    @author Andr&eacute;s Montano Pellegrini
 	@see WindowScannerFilter
--- a/source/org/thdl/tib/scanner/BinaryFileGenerator.java
+++ b/source/org/thdl/tib/scanner/BinaryFileGenerator.java
@ -23,8 +23,121 @@ import java.io.*;
 	into a binary file tree structure format, to be used
 	by some implementations of the SyllableListTree.
-	<p>The text files must be in the format used by the
+<p>Syntax (Dictionary files are assumed to be .txt. Don't include extensions!):<ul>
-	The Rangjung Yeshe Tibetan-English Dictionary of Buddhist Culture.</p>
+	<li><b>For one dictionary</b>, to read the definitions stored in <i>
    dic-name.txt</i> and organize them into <i>dic-name.wrd</i> and <i>
    dic-name.def</i>:<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator [-delimiter] dict-name</pre>
 	</li>
 	<li><b>For multiple dictionaries</b>, to read the definitions stored in <i>
    dict-name1.txt</i>, <i>dict-name2.txt</i>, etc.and organize them into <i>
    dest-file-name.wrd</i> and <i>dest-file-name.def</i>:<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator dest-file-name [-delimiter1] dict-name1 [[-delimiter2] dict-name2 ...]</pre>
 	</li>
 </ul>
 <p>-delimiter<ul>
 <li><b>If this option is omitted</b>, it is assumed that each line is an entry 
 (no multiple-line entries) and the definition and definiendum are separated 
 by '-' (a dash). Even though it is not 
 required, it is highly recommended to include a space before and afterwards 
 (to eliminate any possible ambiguity with regards to the transliteration of 
 reverse vowels in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
    Extended Wylie</a>). A sample entry for the dictionary is:
    <hr>
    <pre>bkra shis - 1) auspiciousness, good luck, good fortune, goodness, prosperity, happiness. 2) auspicious, favorable, fortunate, successful, felicitous, lucky. 3) verse of auspiciousness; benediction, blessing. 4) a personal name.
 bde legs - 1) goodness, happiness, well-being, wellfare, auspiciousness, good fortune. 2) well, fine.</pre>
 <hr>
    <p>If this were the content of a file called &quot;<i>my-glossary.txt</i>&quot; the 
    binary tree file would be generated with the command:</p>
    <pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator my-glossary</pre>
    </li>
 <li>-<b>tab</b>: it is assumed that each line is an entry (no multiple-line 
 entries) and the definition and definiendum are separated by '\t' (horizontal tabulation). 
 One tabulation is enough; don't feel the need to &quot;align&quot; the definitions in your 
 word-processor. A sample entry for the dictionary is:<hr>
    <pre>bkra shis	1) auspiciousness, good luck, good fortune, goodness, prosperity, happiness. 2) auspicious, favorable, fortunate, successful, felicitous, lucky. 3) verse of auspiciousness; benediction, blessing. 4) a personal name.
 bde legs	1) goodness, happiness, well-being, wellfare, auspiciousness, good fortune. 2) well, fine.</pre>
 <hr>
    <p>Here, the 
    binary tree file would be generated with the command:</p>
    <pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator -tab my-glossary</pre>
 </li>
 <li>
 <b>-<i>string</i></b>: it is assumed that each line is an entry (no multiple-line 
 entries) and the definition and definiendum are separated by the character or 
 string of characters specified by the user. A sample entry for the dictionary 
 is:<hr>
    <pre>bkra shis ** 1) auspiciousness, good luck, good fortune, goodness, prosperity, happiness. 2) auspicious, favorable, fortunate, successful, felicitous, lucky. 3) verse of auspiciousness; benediction, blessing. 4) a personal name.
 bde legs ** 1) goodness, happiness, well-being, wellfare, auspiciousness, good fortune. 2) well, fine.</pre>
 <hr>
    <p>Here, the 
    binary tree file would be generated with the command:</p>
    <pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator -** my-glossary</pre>
 </li>
 <li>-<b>acip</b>: it is assumed that the electronic file is a transliteration of 
 a Tibetan dictionary. It is called &quot;acip&quot; because it accepts Acip's comment 
 codes ('@' to mark page numbers, brackets to mark comments, etc). Nevertheless, 
 it still requires the files to be in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
    Extended Wylie</a>, so if your file is in Acip's transliteration scheme make 
 sure to run <i><a href="#org.thdl.tib.scanner.AcipToWylie">org.thdl.tib.scanner.AcipToWylie</a></i> first. Definitions here can 
 be of multiple lines, but with no blank lines in between. It is assumed that the 
 definiendum starts after a blank line (except at the beginning of a new page 
 where it could start with the last part of the previous definition) up to the <i>
 shad</i> (except when the <i>shad</i> is omitted because of grammar rules as for 
 instance no shad after a &quot;ga&quot; suffix without a secondary suffix). Each 
 time a new letter starts, it should be clearly marked in brackets ('[', ']'), 
 parenthesis ('(', ')') or llaves ('{','}'). A sample entry for the dictionary is:
 <hr>
 <pre>@1
 (ka)
 ka ba/ gdung 'degs don byed nus pa/
 rkyen/ grogs byed
@2
 (kha)
 khyod dngos po dang de byung 'brel/  khyod dngos po las byung
 zhing/ dngos po ldog stops kyis khyod ldog pa/
 khyod dngos po dang bdag gcig 'brel/ khyod ngos po dang bdag
 nyid gcig pa'i sgo nas tha dad gang zhig/ dngos po ldog
 stops kyis khyod ldog pa/
 khyod dngos po dang 'brel pa/ khyod dngos po dang tha dad gang
@3
 zhig/ ngos po ldog stobs kyis khyod ldog pa/
 kha dog  mdog du rung ba'am/ sngo ser dkar dmar sogs mdog tu
 rung ba'i gzugs/</pre>
 <hr>
    <p>Here the 
    binary tree file would be generated with the command:</p>
    <pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator -acip my-glossary</pre>
 <p><i>Comments:</i>&nbsp; Notice in the sample text that at the beginning of page 2, &quot;<i>zhig</i>&quot; is not a 
 new definiendum, but still is part of the definition of &quot;<i>khyod dngos po dang 'brel 
 pa</i>&quot;. Also the definiendum of the last entry&nbsp; is &quot;<i>kha dog</i>&quot; 
 (the <i>shad</i> was omitted after &quot;<i>ga</i>&quot; suffix) and not &quot;<i>kha dog mdog du rung ba'am</i>&quot;. 
 Nevertheless the definiendum of the second term is not &quot;<i>khyod dngos po dang bdag</i>&quot; 
 since there is no omitted <i>shad</i> after that &quot;<i>ga</i>&quot; suffix; the 
 definiedum is &quot;<i>khyod dngos po dang bdag gcig 'brel</i>&quot;. As is clear from the 
 sample text, the tool has to make a series of &quot;smart guesses&quot; to try to figure 
 out where each definiendum end and it's definition start.&nbsp; Such process is 
 not 100% full-proof, so expect some mistakes.<br>
 &nbsp;</p>
 </li>
  <li>
 <p>Dictionaries in different formats can be processed together. For instance the 
 command:
 <pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator alldicts ry-dic99 -acip myglossary_uma -tab myglossary_rdzogs-chen</pre>
 <p>would generate <i>alldicts.def</i> and <i>alldicts.wrd</i> processing <i>ry-dic99.txt</i> 
 as dash-separated, <i>myglossary_rdzogs-chen.txt</i> as tab-separated and <i>
 myglossary_uma.txt</i> in the transliteration format explained above.<br>
 &nbsp;</li>
 </ul>
    @author Andr&eacute;s Montano Pellegrini
    @see SyllableListTree
--- a/source/org/thdl/tib/scanner/ConsoleScannerFilter.java
+++ b/source/org/thdl/tib/scanner/ConsoleScannerFilter.java
@ -23,7 +23,12 @@ import java.util.*;
 /** Inputs a Tibetan text and displays the words with their
    definitions through the console over a shell. Use when no
-    graphical interfase is supported or for batch processes.
+    graphical interfase is supported or for batch processes. For instance:</p>
    <pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.ConsoleScannerFilter ry-dic99</pre>
    <p>It reads from the standard input and prints the results to the
    standard output. For example if you want to parse a text stored in <i>puja.txt</i>
    and save the results in <i>puja_words.txt</i>, you can run the command:</p>
    <pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.ConsoleScannerFilter ry-dic99 &lt; puja.txt &gt; puja_words.txt</pre>
    @author Andr&eacute;s Montano Pellegrini
 */
--- a/source/org/thdl/tib/scanner/WindowScannerFilter.java
+++ b/source/org/thdl/tib/scanner/WindowScannerFilter.java
@ -33,7 +33,12 @@ import org.thdl.tib.input.DuffPane;
    Tibetan script) and displays the words (Roman or Tibetan script)
    with their definitions. Works without Tibetan script in
    platforms that don't support Swing. Can access dictionaries stored
-    locally or remotely.
+    locally or remotely. For example, to access the public dictionary database run the command:</p>
    <pre>java -jar DictionarySearchStandalone.jar http://iris.lib.virginia.edu/tibetan/servlet/org.thdl.tib.scanner.RemoteScannerFilter</pre>
  <p>If the JRE you installed does not support <i> Swing</i> classes but supports
    <i>
    AWT</i> (as the JRE for handhelds), run the command: </p>
    <pre>java -jar DictionarySearchHandheld.jar -simple ry-dic99</pre>
    @author Andr&eacute;s Montano Pellegrini
 */
--- a/source/org/thdl/tib/scanner/package.html
+++ b/source/org/thdl/tib/scanner/package.html
@ -8,14 +8,134 @@
 -->
 </head>
 <body bgcolor="white">
-Provides classes and methods for translating Tibetan text to English.
+Provides the classes to take Tibetan language passages and divide the passages up
-<p>
+into their component phrases and words, and display corresponding dictionary definitions.
-Right now, this package scans Tibetan text, but we aim to make it parse Tibetan text.
+<p>This  tool helps Tibetan to English translators partially automate the
-<p>
+translation process. In the Tibetan language, the boundaries of individual words
-Author: Andr&eacute;s Montano Pellegrini
+are not marked in any manner such as the way in which spaces separate and mark
 words in English. Instead, there is
 a punctuation mark called a &quot;tsheg&quot; which separates each syllable. Thus while syllabic boundaries are utterly explicit, word boundaries are
 often unclear. One of the main
 difficulties beginning students thus have with translating Tibetan texts is
 figuring out where each word ends and the next word starts, and determining what
 series of syllables to look up in the dictionary either as constituting a single
 word or a larger compound phrase. This
 entails a very time consuming process of looking up multiple combinations of
 syllables to determine which are found within a given dictionary.</p>
 <p>It partially automates that process by
 breaking up a sentence/paragraph entered in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
    Extended Wylie</a>  or Tibetan script
 into the biggest component parts it can find in multiple dictionary databases.
 Then for each component part found, it displays its stored definitions and
 relevant information. This will
 thus often yield only the definition of a long phrase, rather than its component
 words, but one can also search for the syllables of that phrase one by one
 separately.</p>
 <p>The tool can run on-line through a:</p>
 <ul>
  <li> 
  Java servlet (using Roman script for input and Tibetan script for output) 
  directly on a browser<p>The text is typed (or pasted) using <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
    Extended Wylie</a> in a
 text box within a form. All of the processing is done on the server, and the
 results are returned in plain HTML. This allows the user to run this version on
 even the most basic browser without needing any additional software installed.
 Also, because the results are returned in HTML, features of HTML like
 hyperlinks, tables, and text formatting allow it to be skimmed more easily. The
 user can choose between seeing the Tibetan within the results in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
    Extended Wylie</a> or in Tibetan script
 (<a href="http://iris.lib.virginia.edu/tibet/tools/tmw.html" target="_blank">using
 Tibetan Machine Web font now available for free</a>).<br>
 &nbsp;</p>  
  </li>
  <li> 
  Java applet &amp; application (using Tibetan script for both input and output) 
  communicating to a servlet<p>The text is typed in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
    Extended Wylie</a>, but with the added
 value that optionally the user can choose to see it directly in the Tibetan script
 (<a href="http://iris.lib.virginia.edu/tibet/tools/tmw.html" target="_blank">using
 Tibetan Machine Web font now available for free</a>) as he types. We
 eventually plan to support other keyboard methods of entry as well. Here all the processing is also done on the server side, and the results
 are displayed interactively within the program's window. Again the user can choose
 to see the results in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
    Extended Wylie</a> or in the Tibetan script.&nbsp;</p>
 <p>Even though the application runs as a stand-alone application in the
 desktop's user, connection to the Internet is still necessary to access the
 dictionary databases. Easy launching of the application can be done over the
 Internet using <a href="http://java.sun.com/products/javawebstart/">Java Web
 Start</a>, which comes with <a href="#Sun's Java Runtime Environment">
 Sun's Java Runtime Environment version 1.4</a> or higher. This is the
 recommended way to run the tool.</p>
 <p>The applet runs within a browser. The browser not only needs
 to support Java, but since the classes that handle the Tibetan font use <i>Swing</i>, <a href="#Sun's Java Runtime Environment">
 Sun's Java Runtime Environment version 1.4</a> or higher must additionally be installed.<br>
 &nbsp;</p>  
  </li>
 </ul>
 <p>The tool can  also run off-line in:</p>
 <ul>
  <li><b>Desktop &amp; laptop computers</b>
 supporting the Sun's <i>Java Runtime Environment</i> version 1.2 or higher;
    although <a href="#Sun's Java Runtime Environment">version 1.4</a> or higher is
    recommended.&nbsp;This is distributed as <i>DictionarySearchStandalone.jar</i>.</li>
  <li><b>Handheld devices</b> supporting <a href="http://java.sun.com/products/personaljava/">PersonalJava
 Application Environment</a> version 1.2a or higher. This is distributed as <i>
  DictionarySearchHandheld.jar</i>.</li>
 </ul>
 <p>The classes designed to be run from the command-line are:</p>
 <ul>
  <li>BinaryFileGenerator 
 (included only in DictionarySearchStandalone.jar)</li>
  <li>AcipToWylie 
 (included only in DictionarySearchStandalone.jar)</li>
  <li>WindowScannerFilter 
 (included in both DictionarySearchStandalone.jar and 
 DictionarySearchHandheld.jar)</li>
  <li>ConsoleScannerFilter 
  (included in both DictionarySearchStandalone.jar and 
  DictionarySearchHandheld.jar)</li>
 </ul>
 <p><i>Notes on Input:</i></p>
 <ul>
  <li>For the &quot;punctuation marks&quot;, the tool assumes that
  <ul>
  <li>'&nbsp; ' (tsheg), '_' (space), &lt;enter&gt;, &lt;tab&gt;:
    function as syllable separators and may show up in between component word or phrases.</li>
  <li> '/' (shad), ';', '|', '!', ':', '[', ']', '^', '@', '#', '$', '%', '=', '&lt;', '&gt;',
    '(', ')', '{', '}', <i>blank line</i> (two enters in a row): may not show up in between
    component word or phrases (and hence is interpreted as marking the end of a
    component word or phrase). See <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
    Extended Wylie</a> documentation for the corresponding symbols in the
    Tibetan script.
  </li>
  <li>all other characters are part of the syllable<br>
  </li>
  </ul></li>
 <li>To force the parser to &quot;break up&quot; a component word or phrase into
 its individual syllables, use any character of the second set in between the syllables. For
  example, if the entry is:
  <p><i>chos nyid</i></p>
  <p>or</p>
 <p><i>chos<br>
 nyid</i></p>
 <p>the parser will recognize it as a single word &quot;<i>chos nyid</i>&quot;.
 But if the entry is:</p>
 <p><i>chos / nyid</i></p>
 <p>or</p>
 <p><i>chos</i></p>
 <p><i>nyid</i></p>
 <p>the parser will assume &quot;chos&quot; and &quot;nyid&quot; are independent,
 and will be looked up separately.</p>
 </li>
 </ul>
 <p>Author: Andr&eacute;s Montano Pellegrini</p>
 <p>
 <h2>Related Documentation</h2>
@see <a href="../text/package-summary.html">org.thdl.tib.text</a>
@see <a href="../input/package-summary.html">org.thdl.tib.input</a>
 </body>
-</html>
+</html>