added documentation

This commit is contained in:
amontano 2002-11-28 06:54:46 +00:00
parent efa69fe225
commit 178ffcb800
5 changed files with 267 additions and 12 deletions

View file

@ -21,8 +21,20 @@ package org.thdl.tib.scanner;
import java.net.*;
import java.io.*;
/** Provides interfase to convert from tibetan text transliterated in
the Acip scheme to THDL's extended wylie scheme.
/** Provides an interfase to convert from tibetan text transliterated in the Acip scheme to THDL's <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">Extended Wylie</a> scheme.
<p>If no arguments are sent, it takes the Acip text from the standard input and sends the
Wylie text to the standard output. If one argument is sent, it interprets it as the
file name for the input. If two arguments are sent, it interprets the first one as the file name for the input and
the second one as the file name for the output. For example, the following
command converts the <i>lam-rim-chen-mo.act</i> storing the results in <i>
lam-rim-chen-mo.txt</i>:</p>
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.AcipToWylie lam-rim-chen-mo.act lam-rim-chen-mo.txt</pre>
<p>Alternatively by redirecting the standard input/output you perform the same
job:</p>
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.AcipToWylie &lt; lam-rim-chen-mo.act &gt; lam-rim-chen-mo.txt</pre>
<p>If you only want to display the results to the screen, you can run:</p>
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.AcipToWylie lam-rim-chen-mo.act | more</pre>
@author Andr&eacute;s Montano Pellegrini
@see WindowScannerFilter

View file

@ -23,8 +23,121 @@ import java.io.*;
into a binary file tree structure format, to be used
by some implementations of the SyllableListTree.
<p>The text files must be in the format used by the
The Rangjung Yeshe Tibetan-English Dictionary of Buddhist Culture.</p>
<p>Syntax (Dictionary files are assumed to be .txt. Don't include extensions!):<ul>
<li><b>For one dictionary</b>, to read the definitions stored in <i>
dic-name.txt</i> and organize them into <i>dic-name.wrd</i> and <i>
dic-name.def</i>:<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator [-delimiter] dict-name</pre>
</li>
<li><b>For multiple dictionaries</b>, to read the definitions stored in <i>
dict-name1.txt</i>, <i>dict-name2.txt</i>, etc.and organize them into <i>
dest-file-name.wrd</i> and <i>dest-file-name.def</i>:<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator dest-file-name [-delimiter1] dict-name1 [[-delimiter2] dict-name2 ...]</pre>
</li>
</ul>
<p>-delimiter<ul>
<li><b>If this option is omitted</b>, it is assumed that each line is an entry
(no multiple-line entries) and the definition and definiendum are separated
by '-' (a dash). Even though it is not
required, it is highly recommended to include a space before and afterwards
(to eliminate any possible ambiguity with regards to the transliteration of
reverse vowels in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
Extended Wylie</a>). A sample entry for the dictionary is:
<hr>
<pre>bkra shis - 1) auspiciousness, good luck, good fortune, goodness, prosperity, happiness. 2) auspicious, favorable, fortunate, successful, felicitous, lucky. 3) verse of auspiciousness; benediction, blessing. 4) a personal name.
bde legs - 1) goodness, happiness, well-being, wellfare, auspiciousness, good fortune. 2) well, fine.</pre>
<hr>
<p>If this were the content of a file called &quot;<i>my-glossary.txt</i>&quot; the
binary tree file would be generated with the command:</p>
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator my-glossary</pre>
</li>
<li>-<b>tab</b>: it is assumed that each line is an entry (no multiple-line
entries) and the definition and definiendum are separated by '\t' (horizontal tabulation).
One tabulation is enough; don't feel the need to &quot;align&quot; the definitions in your
word-processor. A sample entry for the dictionary is:<hr>
<pre>bkra shis 1) auspiciousness, good luck, good fortune, goodness, prosperity, happiness. 2) auspicious, favorable, fortunate, successful, felicitous, lucky. 3) verse of auspiciousness; benediction, blessing. 4) a personal name.
bde legs 1) goodness, happiness, well-being, wellfare, auspiciousness, good fortune. 2) well, fine.</pre>
<hr>
<p>Here, the
binary tree file would be generated with the command:</p>
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator -tab my-glossary</pre>
</li>
<li>
<b>-<i>string</i></b>: it is assumed that each line is an entry (no multiple-line
entries) and the definition and definiendum are separated by the character or
string of characters specified by the user. A sample entry for the dictionary
is:<hr>
<pre>bkra shis ** 1) auspiciousness, good luck, good fortune, goodness, prosperity, happiness. 2) auspicious, favorable, fortunate, successful, felicitous, lucky. 3) verse of auspiciousness; benediction, blessing. 4) a personal name.
bde legs ** 1) goodness, happiness, well-being, wellfare, auspiciousness, good fortune. 2) well, fine.</pre>
<hr>
<p>Here, the
binary tree file would be generated with the command:</p>
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator -** my-glossary</pre>
</li>
<li>-<b>acip</b>: it is assumed that the electronic file is a transliteration of
a Tibetan dictionary. It is called &quot;acip&quot; because it accepts Acip's comment
codes ('@' to mark page numbers, brackets to mark comments, etc). Nevertheless,
it still requires the files to be in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
Extended Wylie</a>, so if your file is in Acip's transliteration scheme make
sure to run <i><a href="#org.thdl.tib.scanner.AcipToWylie">org.thdl.tib.scanner.AcipToWylie</a></i> first. Definitions here can
be of multiple lines, but with no blank lines in between. It is assumed that the
definiendum starts after a blank line (except at the beginning of a new page
where it could start with the last part of the previous definition) up to the <i>
shad</i> (except when the <i>shad</i> is omitted because of grammar rules as for
instance no shad after a &quot;ga&quot; suffix without a secondary suffix). Each
time a new letter starts, it should be clearly marked in brackets ('[', ']'),
parenthesis ('(', ')') or llaves ('{','}'). A sample entry for the dictionary is:
<hr>
<pre>@1
(ka)
ka ba/ gdung 'degs don byed nus pa/
rkyen/ grogs byed
@2
(kha)
khyod dngos po dang de byung 'brel/ khyod dngos po las byung
zhing/ dngos po ldog stops kyis khyod ldog pa/
khyod dngos po dang bdag gcig 'brel/ khyod ngos po dang bdag
nyid gcig pa'i sgo nas tha dad gang zhig/ dngos po ldog
stops kyis khyod ldog pa/
khyod dngos po dang 'brel pa/ khyod dngos po dang tha dad gang
@3
zhig/ ngos po ldog stobs kyis khyod ldog pa/
kha dog mdog du rung ba'am/ sngo ser dkar dmar sogs mdog tu
rung ba'i gzugs/</pre>
<hr>
<p>Here the
binary tree file would be generated with the command:</p>
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator -acip my-glossary</pre>
<p><i>Comments:</i>&nbsp; Notice in the sample text that at the beginning of page 2, &quot;<i>zhig</i>&quot; is not a
new definiendum, but still is part of the definition of &quot;<i>khyod dngos po dang 'brel
pa</i>&quot;. Also the definiendum of the last entry&nbsp; is &quot;<i>kha dog</i>&quot;
(the <i>shad</i> was omitted after &quot;<i>ga</i>&quot; suffix) and not &quot;<i>kha dog mdog du rung ba'am</i>&quot;.
Nevertheless the definiendum of the second term is not &quot;<i>khyod dngos po dang bdag</i>&quot;
since there is no omitted <i>shad</i> after that &quot;<i>ga</i>&quot; suffix; the
definiedum is &quot;<i>khyod dngos po dang bdag gcig 'brel</i>&quot;. As is clear from the
sample text, the tool has to make a series of &quot;smart guesses&quot; to try to figure
out where each definiendum end and it's definition start.&nbsp; Such process is
not 100% full-proof, so expect some mistakes.<br>
&nbsp;</p>
</li>
<li>
<p>Dictionaries in different formats can be processed together. For instance the
command:
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator alldicts ry-dic99 -acip myglossary_uma -tab myglossary_rdzogs-chen</pre>
<p>would generate <i>alldicts.def</i> and <i>alldicts.wrd</i> processing <i>ry-dic99.txt</i>
as dash-separated, <i>myglossary_rdzogs-chen.txt</i> as tab-separated and <i>
myglossary_uma.txt</i> in the transliteration format explained above.<br>
&nbsp;</li>
</ul>
@author Andr&eacute;s Montano Pellegrini
@see SyllableListTree

View file

@ -23,7 +23,12 @@ import java.util.*;
/** Inputs a Tibetan text and displays the words with their
definitions through the console over a shell. Use when no
graphical interfase is supported or for batch processes.
graphical interfase is supported or for batch processes. For instance:</p>
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.ConsoleScannerFilter ry-dic99</pre>
<p>It reads from the standard input and prints the results to the
standard output. For example if you want to parse a text stored in <i>puja.txt</i>
and save the results in <i>puja_words.txt</i>, you can run the command:</p>
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.ConsoleScannerFilter ry-dic99 &lt; puja.txt &gt; puja_words.txt</pre>
@author Andr&eacute;s Montano Pellegrini
*/

View file

@ -33,7 +33,12 @@ import org.thdl.tib.input.DuffPane;
Tibetan script) and displays the words (Roman or Tibetan script)
with their definitions. Works without Tibetan script in
platforms that don't support Swing. Can access dictionaries stored
locally or remotely.
locally or remotely. For example, to access the public dictionary database run the command:</p>
<pre>java -jar DictionarySearchStandalone.jar http://iris.lib.virginia.edu/tibetan/servlet/org.thdl.tib.scanner.RemoteScannerFilter</pre>
<p>If the JRE you installed does not support <i> Swing</i> classes but supports
<i>
AWT</i> (as the JRE for handhelds), run the command: </p>
<pre>java -jar DictionarySearchHandheld.jar -simple ry-dic99</pre>
@author Andr&eacute;s Montano Pellegrini
*/

View file

@ -8,11 +8,131 @@
-->
</head>
<body bgcolor="white">
Provides classes and methods for translating Tibetan text to English.
<p>
Right now, this package scans Tibetan text, but we aim to make it parse Tibetan text.
<p>
Author: Andr&eacute;s Montano Pellegrini
Provides the classes to take Tibetan language passages and divide the passages up
into their component phrases and words, and display corresponding dictionary definitions.
<p>This tool helps Tibetan to English translators partially automate the
translation process. In the Tibetan language, the boundaries of individual words
are not marked in any manner such as the way in which spaces separate and mark
words in English. Instead, there is
a punctuation mark called a &quot;tsheg&quot; which separates each syllable. Thus while syllabic boundaries are utterly explicit, word boundaries are
often unclear. One of the main
difficulties beginning students thus have with translating Tibetan texts is
figuring out where each word ends and the next word starts, and determining what
series of syllables to look up in the dictionary either as constituting a single
word or a larger compound phrase. This
entails a very time consuming process of looking up multiple combinations of
syllables to determine which are found within a given dictionary.</p>
<p>It partially automates that process by
breaking up a sentence/paragraph entered in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
Extended Wylie</a> or Tibetan script
into the biggest component parts it can find in multiple dictionary databases.
Then for each component part found, it displays its stored definitions and
relevant information. This will
thus often yield only the definition of a long phrase, rather than its component
words, but one can also search for the syllables of that phrase one by one
separately.</p>
<p>The tool can run on-line through a:</p>
<ul>
<li>
Java servlet (using Roman script for input and Tibetan script for output)
directly on a browser<p>The text is typed (or pasted) using <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
Extended Wylie</a> in a
text box within a form. All of the processing is done on the server, and the
results are returned in plain HTML. This allows the user to run this version on
even the most basic browser without needing any additional software installed.
Also, because the results are returned in HTML, features of HTML like
hyperlinks, tables, and text formatting allow it to be skimmed more easily. The
user can choose between seeing the Tibetan within the results in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
Extended Wylie</a> or in Tibetan script
(<a href="http://iris.lib.virginia.edu/tibet/tools/tmw.html" target="_blank">using
Tibetan Machine Web font now available for free</a>).<br>
&nbsp;</p>
</li>
<li>
Java applet &amp; application (using Tibetan script for both input and output)
communicating to a servlet<p>The text is typed in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
Extended Wylie</a>, but with the added
value that optionally the user can choose to see it directly in the Tibetan script
(<a href="http://iris.lib.virginia.edu/tibet/tools/tmw.html" target="_blank">using
Tibetan Machine Web font now available for free</a>) as he types. We
eventually plan to support other keyboard methods of entry as well. Here all the processing is also done on the server side, and the results
are displayed interactively within the program's window. Again the user can choose
to see the results in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
Extended Wylie</a> or in the Tibetan script.&nbsp;</p>
<p>Even though the application runs as a stand-alone application in the
desktop's user, connection to the Internet is still necessary to access the
dictionary databases. Easy launching of the application can be done over the
Internet using <a href="http://java.sun.com/products/javawebstart/">Java Web
Start</a>, which comes with <a href="#Sun's Java Runtime Environment">
Sun's Java Runtime Environment version 1.4</a> or higher. This is the
recommended way to run the tool.</p>
<p>The applet runs within a browser. The browser not only needs
to support Java, but since the classes that handle the Tibetan font use <i>Swing</i>, <a href="#Sun's Java Runtime Environment">
Sun's Java Runtime Environment version 1.4</a> or higher must additionally be installed.<br>
&nbsp;</p>
</li>
</ul>
<p>The tool can also run off-line in:</p>
<ul>
<li><b>Desktop &amp; laptop computers</b>
supporting the Sun's <i>Java Runtime Environment</i> version 1.2 or higher;
although <a href="#Sun's Java Runtime Environment">version 1.4</a> or higher is
recommended.&nbsp;This is distributed as <i>DictionarySearchStandalone.jar</i>.</li>
<li><b>Handheld devices</b> supporting <a href="http://java.sun.com/products/personaljava/">PersonalJava
Application Environment</a> version 1.2a or higher. This is distributed as <i>
DictionarySearchHandheld.jar</i>.</li>
</ul>
<p>The classes designed to be run from the command-line are:</p>
<ul>
<li>BinaryFileGenerator
(included only in DictionarySearchStandalone.jar)</li>
<li>AcipToWylie
(included only in DictionarySearchStandalone.jar)</li>
<li>WindowScannerFilter
(included in both DictionarySearchStandalone.jar and
DictionarySearchHandheld.jar)</li>
<li>ConsoleScannerFilter
(included in both DictionarySearchStandalone.jar and
DictionarySearchHandheld.jar)</li>
</ul>
<p><i>Notes on Input:</i></p>
<ul>
<li>For the &quot;punctuation marks&quot;, the tool assumes that
<ul>
<li>'&nbsp; ' (tsheg), '_' (space), &lt;enter&gt;, &lt;tab&gt;:
function as syllable separators and may show up in between component word or phrases.</li>
<li> '/' (shad), ';', '|', '!', ':', '[', ']', '^', '@', '#', '$', '%', '=', '&lt;', '&gt;',
'(', ')', '{', '}', <i>blank line</i> (two enters in a row): may not show up in between
component word or phrases (and hence is interpreted as marking the end of a
component word or phrase). See <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
Extended Wylie</a> documentation for the corresponding symbols in the
Tibetan script.
</li>
<li>all other characters are part of the syllable<br>
</li>
</ul></li>
<li>To force the parser to &quot;break up&quot; a component word or phrase into
its individual syllables, use any character of the second set in between the syllables. For
example, if the entry is:
<p><i>chos nyid</i></p>
<p>or</p>
<p><i>chos<br>
nyid</i></p>
<p>the parser will recognize it as a single word &quot;<i>chos nyid</i>&quot;.
But if the entry is:</p>
<p><i>chos / nyid</i></p>
<p>or</p>
<p><i>chos</i></p>
<p><i>nyid</i></p>
<p>the parser will assume &quot;chos&quot; and &quot;nyid&quot; are independent,
and will be looked up separately.</p>
</li>
</ul>
<p>Author: Andr&eacute;s Montano Pellegrini</p>
<p>
<h2>Related Documentation</h2>
@see <a href="../text/package-summary.html">org.thdl.tib.text</a>