added documentation
This commit is contained in:
parent
efa69fe225
commit
178ffcb800
5 changed files with 267 additions and 12 deletions
|
@ -21,8 +21,20 @@ package org.thdl.tib.scanner;
|
||||||
import java.net.*;
|
import java.net.*;
|
||||||
import java.io.*;
|
import java.io.*;
|
||||||
|
|
||||||
/** Provides interfase to convert from tibetan text transliterated in
|
/** Provides an interfase to convert from tibetan text transliterated in the Acip scheme to THDL's <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">Extended Wylie</a> scheme.
|
||||||
the Acip scheme to THDL's extended wylie scheme.
|
|
||||||
|
<p>If no arguments are sent, it takes the Acip text from the standard input and sends the
|
||||||
|
Wylie text to the standard output. If one argument is sent, it interprets it as the
|
||||||
|
file name for the input. If two arguments are sent, it interprets the first one as the file name for the input and
|
||||||
|
the second one as the file name for the output. For example, the following
|
||||||
|
command converts the <i>lam-rim-chen-mo.act</i> storing the results in <i>
|
||||||
|
lam-rim-chen-mo.txt</i>:</p>
|
||||||
|
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.AcipToWylie lam-rim-chen-mo.act lam-rim-chen-mo.txt</pre>
|
||||||
|
<p>Alternatively by redirecting the standard input/output you perform the same
|
||||||
|
job:</p>
|
||||||
|
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.AcipToWylie < lam-rim-chen-mo.act > lam-rim-chen-mo.txt</pre>
|
||||||
|
<p>If you only want to display the results to the screen, you can run:</p>
|
||||||
|
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.AcipToWylie lam-rim-chen-mo.act | more</pre>
|
||||||
|
|
||||||
@author Andrés Montano Pellegrini
|
@author Andrés Montano Pellegrini
|
||||||
@see WindowScannerFilter
|
@see WindowScannerFilter
|
||||||
|
|
|
@ -23,8 +23,121 @@ import java.io.*;
|
||||||
into a binary file tree structure format, to be used
|
into a binary file tree structure format, to be used
|
||||||
by some implementations of the SyllableListTree.
|
by some implementations of the SyllableListTree.
|
||||||
|
|
||||||
<p>The text files must be in the format used by the
|
<p>Syntax (Dictionary files are assumed to be .txt. Don't include extensions!):<ul>
|
||||||
The Rangjung Yeshe Tibetan-English Dictionary of Buddhist Culture.</p>
|
<li><b>For one dictionary</b>, to read the definitions stored in <i>
|
||||||
|
dic-name.txt</i> and organize them into <i>dic-name.wrd</i> and <i>
|
||||||
|
dic-name.def</i>:<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator [-delimiter] dict-name</pre>
|
||||||
|
</li>
|
||||||
|
<li><b>For multiple dictionaries</b>, to read the definitions stored in <i>
|
||||||
|
dict-name1.txt</i>, <i>dict-name2.txt</i>, etc.and organize them into <i>
|
||||||
|
dest-file-name.wrd</i> and <i>dest-file-name.def</i>:<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator dest-file-name [-delimiter1] dict-name1 [[-delimiter2] dict-name2 ...]</pre>
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
<p>-delimiter<ul>
|
||||||
|
<li><b>If this option is omitted</b>, it is assumed that each line is an entry
|
||||||
|
(no multiple-line entries) and the definition and definiendum are separated
|
||||||
|
by '-' (a dash). Even though it is not
|
||||||
|
required, it is highly recommended to include a space before and afterwards
|
||||||
|
(to eliminate any possible ambiguity with regards to the transliteration of
|
||||||
|
reverse vowels in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
|
||||||
|
Extended Wylie</a>). A sample entry for the dictionary is:
|
||||||
|
<hr>
|
||||||
|
<pre>bkra shis - 1) auspiciousness, good luck, good fortune, goodness, prosperity, happiness. 2) auspicious, favorable, fortunate, successful, felicitous, lucky. 3) verse of auspiciousness; benediction, blessing. 4) a personal name.
|
||||||
|
bde legs - 1) goodness, happiness, well-being, wellfare, auspiciousness, good fortune. 2) well, fine.</pre>
|
||||||
|
<hr>
|
||||||
|
<p>If this were the content of a file called "<i>my-glossary.txt</i>" the
|
||||||
|
binary tree file would be generated with the command:</p>
|
||||||
|
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator my-glossary</pre>
|
||||||
|
</li>
|
||||||
|
<li>-<b>tab</b>: it is assumed that each line is an entry (no multiple-line
|
||||||
|
entries) and the definition and definiendum are separated by '\t' (horizontal tabulation).
|
||||||
|
One tabulation is enough; don't feel the need to "align" the definitions in your
|
||||||
|
word-processor. A sample entry for the dictionary is:<hr>
|
||||||
|
<pre>bkra shis 1) auspiciousness, good luck, good fortune, goodness, prosperity, happiness. 2) auspicious, favorable, fortunate, successful, felicitous, lucky. 3) verse of auspiciousness; benediction, blessing. 4) a personal name.
|
||||||
|
bde legs 1) goodness, happiness, well-being, wellfare, auspiciousness, good fortune. 2) well, fine.</pre>
|
||||||
|
<hr>
|
||||||
|
<p>Here, the
|
||||||
|
binary tree file would be generated with the command:</p>
|
||||||
|
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator -tab my-glossary</pre>
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
<b>-<i>string</i></b>: it is assumed that each line is an entry (no multiple-line
|
||||||
|
entries) and the definition and definiendum are separated by the character or
|
||||||
|
string of characters specified by the user. A sample entry for the dictionary
|
||||||
|
is:<hr>
|
||||||
|
<pre>bkra shis ** 1) auspiciousness, good luck, good fortune, goodness, prosperity, happiness. 2) auspicious, favorable, fortunate, successful, felicitous, lucky. 3) verse of auspiciousness; benediction, blessing. 4) a personal name.
|
||||||
|
bde legs ** 1) goodness, happiness, well-being, wellfare, auspiciousness, good fortune. 2) well, fine.</pre>
|
||||||
|
<hr>
|
||||||
|
<p>Here, the
|
||||||
|
binary tree file would be generated with the command:</p>
|
||||||
|
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator -** my-glossary</pre>
|
||||||
|
</li>
|
||||||
|
<li>-<b>acip</b>: it is assumed that the electronic file is a transliteration of
|
||||||
|
a Tibetan dictionary. It is called "acip" because it accepts Acip's comment
|
||||||
|
codes ('@' to mark page numbers, brackets to mark comments, etc). Nevertheless,
|
||||||
|
it still requires the files to be in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
|
||||||
|
Extended Wylie</a>, so if your file is in Acip's transliteration scheme make
|
||||||
|
sure to run <i><a href="#org.thdl.tib.scanner.AcipToWylie">org.thdl.tib.scanner.AcipToWylie</a></i> first. Definitions here can
|
||||||
|
be of multiple lines, but with no blank lines in between. It is assumed that the
|
||||||
|
definiendum starts after a blank line (except at the beginning of a new page
|
||||||
|
where it could start with the last part of the previous definition) up to the <i>
|
||||||
|
shad</i> (except when the <i>shad</i> is omitted because of grammar rules as for
|
||||||
|
instance no shad after a "ga" suffix without a secondary suffix). Each
|
||||||
|
time a new letter starts, it should be clearly marked in brackets ('[', ']'),
|
||||||
|
parenthesis ('(', ')') or llaves ('{','}'). A sample entry for the dictionary is:
|
||||||
|
<hr>
|
||||||
|
<pre>@1
|
||||||
|
|
||||||
|
(ka)
|
||||||
|
|
||||||
|
ka ba/ gdung 'degs don byed nus pa/
|
||||||
|
|
||||||
|
rkyen/ grogs byed
|
||||||
|
|
||||||
|
@2
|
||||||
|
|
||||||
|
(kha)
|
||||||
|
|
||||||
|
khyod dngos po dang de byung 'brel/ khyod dngos po las byung
|
||||||
|
zhing/ dngos po ldog stops kyis khyod ldog pa/
|
||||||
|
|
||||||
|
khyod dngos po dang bdag gcig 'brel/ khyod ngos po dang bdag
|
||||||
|
nyid gcig pa'i sgo nas tha dad gang zhig/ dngos po ldog
|
||||||
|
stops kyis khyod ldog pa/
|
||||||
|
|
||||||
|
khyod dngos po dang 'brel pa/ khyod dngos po dang tha dad gang
|
||||||
|
|
||||||
|
@3
|
||||||
|
|
||||||
|
zhig/ ngos po ldog stobs kyis khyod ldog pa/
|
||||||
|
|
||||||
|
kha dog mdog du rung ba'am/ sngo ser dkar dmar sogs mdog tu
|
||||||
|
rung ba'i gzugs/</pre>
|
||||||
|
<hr>
|
||||||
|
<p>Here the
|
||||||
|
binary tree file would be generated with the command:</p>
|
||||||
|
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator -acip my-glossary</pre>
|
||||||
|
<p><i>Comments:</i> Notice in the sample text that at the beginning of page 2, "<i>zhig</i>" is not a
|
||||||
|
new definiendum, but still is part of the definition of "<i>khyod dngos po dang 'brel
|
||||||
|
pa</i>". Also the definiendum of the last entry is "<i>kha dog</i>"
|
||||||
|
(the <i>shad</i> was omitted after "<i>ga</i>" suffix) and not "<i>kha dog mdog du rung ba'am</i>".
|
||||||
|
Nevertheless the definiendum of the second term is not "<i>khyod dngos po dang bdag</i>"
|
||||||
|
since there is no omitted <i>shad</i> after that "<i>ga</i>" suffix; the
|
||||||
|
definiedum is "<i>khyod dngos po dang bdag gcig 'brel</i>". As is clear from the
|
||||||
|
sample text, the tool has to make a series of "smart guesses" to try to figure
|
||||||
|
out where each definiendum end and it's definition start. Such process is
|
||||||
|
not 100% full-proof, so expect some mistakes.<br>
|
||||||
|
</p>
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
<p>Dictionaries in different formats can be processed together. For instance the
|
||||||
|
command:
|
||||||
|
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator alldicts ry-dic99 -acip myglossary_uma -tab myglossary_rdzogs-chen</pre>
|
||||||
|
<p>would generate <i>alldicts.def</i> and <i>alldicts.wrd</i> processing <i>ry-dic99.txt</i>
|
||||||
|
as dash-separated, <i>myglossary_rdzogs-chen.txt</i> as tab-separated and <i>
|
||||||
|
myglossary_uma.txt</i> in the transliteration format explained above.<br>
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
@author Andrés Montano Pellegrini
|
@author Andrés Montano Pellegrini
|
||||||
@see SyllableListTree
|
@see SyllableListTree
|
||||||
|
|
|
@ -23,7 +23,12 @@ import java.util.*;
|
||||||
|
|
||||||
/** Inputs a Tibetan text and displays the words with their
|
/** Inputs a Tibetan text and displays the words with their
|
||||||
definitions through the console over a shell. Use when no
|
definitions through the console over a shell. Use when no
|
||||||
graphical interfase is supported or for batch processes.
|
graphical interfase is supported or for batch processes. For instance:</p>
|
||||||
|
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.ConsoleScannerFilter ry-dic99</pre>
|
||||||
|
<p>It reads from the standard input and prints the results to the
|
||||||
|
standard output. For example if you want to parse a text stored in <i>puja.txt</i>
|
||||||
|
and save the results in <i>puja_words.txt</i>, you can run the command:</p>
|
||||||
|
<pre>java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.ConsoleScannerFilter ry-dic99 < puja.txt > puja_words.txt</pre>
|
||||||
|
|
||||||
@author Andrés Montano Pellegrini
|
@author Andrés Montano Pellegrini
|
||||||
*/
|
*/
|
||||||
|
|
|
@ -33,7 +33,12 @@ import org.thdl.tib.input.DuffPane;
|
||||||
Tibetan script) and displays the words (Roman or Tibetan script)
|
Tibetan script) and displays the words (Roman or Tibetan script)
|
||||||
with their definitions. Works without Tibetan script in
|
with their definitions. Works without Tibetan script in
|
||||||
platforms that don't support Swing. Can access dictionaries stored
|
platforms that don't support Swing. Can access dictionaries stored
|
||||||
locally or remotely.
|
locally or remotely. For example, to access the public dictionary database run the command:</p>
|
||||||
|
<pre>java -jar DictionarySearchStandalone.jar http://iris.lib.virginia.edu/tibetan/servlet/org.thdl.tib.scanner.RemoteScannerFilter</pre>
|
||||||
|
<p>If the JRE you installed does not support <i> Swing</i> classes but supports
|
||||||
|
<i>
|
||||||
|
AWT</i> (as the JRE for handhelds), run the command: </p>
|
||||||
|
<pre>java -jar DictionarySearchHandheld.jar -simple ry-dic99</pre>
|
||||||
|
|
||||||
@author Andrés Montano Pellegrini
|
@author Andrés Montano Pellegrini
|
||||||
*/
|
*/
|
||||||
|
|
|
@ -8,14 +8,134 @@
|
||||||
-->
|
-->
|
||||||
</head>
|
</head>
|
||||||
<body bgcolor="white">
|
<body bgcolor="white">
|
||||||
Provides classes and methods for translating Tibetan text to English.
|
Provides the classes to take Tibetan language passages and divide the passages up
|
||||||
<p>
|
into their component phrases and words, and display corresponding dictionary definitions.
|
||||||
Right now, this package scans Tibetan text, but we aim to make it parse Tibetan text.
|
<p>This tool helps Tibetan to English translators partially automate the
|
||||||
<p>
|
translation process. In the Tibetan language, the boundaries of individual words
|
||||||
Author: Andrés Montano Pellegrini
|
are not marked in any manner such as the way in which spaces separate and mark
|
||||||
|
words in English. Instead, there is
|
||||||
|
a punctuation mark called a "tsheg" which separates each syllable. Thus while syllabic boundaries are utterly explicit, word boundaries are
|
||||||
|
often unclear. One of the main
|
||||||
|
difficulties beginning students thus have with translating Tibetan texts is
|
||||||
|
figuring out where each word ends and the next word starts, and determining what
|
||||||
|
series of syllables to look up in the dictionary either as constituting a single
|
||||||
|
word or a larger compound phrase. This
|
||||||
|
entails a very time consuming process of looking up multiple combinations of
|
||||||
|
syllables to determine which are found within a given dictionary.</p>
|
||||||
|
<p>It partially automates that process by
|
||||||
|
breaking up a sentence/paragraph entered in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
|
||||||
|
Extended Wylie</a> or Tibetan script
|
||||||
|
into the biggest component parts it can find in multiple dictionary databases.
|
||||||
|
Then for each component part found, it displays its stored definitions and
|
||||||
|
relevant information. This will
|
||||||
|
thus often yield only the definition of a long phrase, rather than its component
|
||||||
|
words, but one can also search for the syllables of that phrase one by one
|
||||||
|
separately.</p>
|
||||||
|
<p>The tool can run on-line through a:</p>
|
||||||
|
<ul>
|
||||||
|
<li>
|
||||||
|
Java servlet (using Roman script for input and Tibetan script for output)
|
||||||
|
directly on a browser<p>The text is typed (or pasted) using <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
|
||||||
|
Extended Wylie</a> in a
|
||||||
|
text box within a form. All of the processing is done on the server, and the
|
||||||
|
results are returned in plain HTML. This allows the user to run this version on
|
||||||
|
even the most basic browser without needing any additional software installed.
|
||||||
|
Also, because the results are returned in HTML, features of HTML like
|
||||||
|
hyperlinks, tables, and text formatting allow it to be skimmed more easily. The
|
||||||
|
user can choose between seeing the Tibetan within the results in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
|
||||||
|
Extended Wylie</a> or in Tibetan script
|
||||||
|
(<a href="http://iris.lib.virginia.edu/tibet/tools/tmw.html" target="_blank">using
|
||||||
|
Tibetan Machine Web font now available for free</a>).<br>
|
||||||
|
</p>
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
Java applet & application (using Tibetan script for both input and output)
|
||||||
|
communicating to a servlet<p>The text is typed in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
|
||||||
|
Extended Wylie</a>, but with the added
|
||||||
|
value that optionally the user can choose to see it directly in the Tibetan script
|
||||||
|
(<a href="http://iris.lib.virginia.edu/tibet/tools/tmw.html" target="_blank">using
|
||||||
|
Tibetan Machine Web font now available for free</a>) as he types. We
|
||||||
|
eventually plan to support other keyboard methods of entry as well. Here all the processing is also done on the server side, and the results
|
||||||
|
are displayed interactively within the program's window. Again the user can choose
|
||||||
|
to see the results in <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
|
||||||
|
Extended Wylie</a> or in the Tibetan script. </p>
|
||||||
|
<p>Even though the application runs as a stand-alone application in the
|
||||||
|
desktop's user, connection to the Internet is still necessary to access the
|
||||||
|
dictionary databases. Easy launching of the application can be done over the
|
||||||
|
Internet using <a href="http://java.sun.com/products/javawebstart/">Java Web
|
||||||
|
Start</a>, which comes with <a href="#Sun's Java Runtime Environment">
|
||||||
|
Sun's Java Runtime Environment version 1.4</a> or higher. This is the
|
||||||
|
recommended way to run the tool.</p>
|
||||||
|
<p>The applet runs within a browser. The browser not only needs
|
||||||
|
to support Java, but since the classes that handle the Tibetan font use <i>Swing</i>, <a href="#Sun's Java Runtime Environment">
|
||||||
|
Sun's Java Runtime Environment version 1.4</a> or higher must additionally be installed.<br>
|
||||||
|
</p>
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
<p>The tool can also run off-line in:</p>
|
||||||
|
<ul>
|
||||||
|
<li><b>Desktop & laptop computers</b>
|
||||||
|
supporting the Sun's <i>Java Runtime Environment</i> version 1.2 or higher;
|
||||||
|
although <a href="#Sun's Java Runtime Environment">version 1.4</a> or higher is
|
||||||
|
recommended. This is distributed as <i>DictionarySearchStandalone.jar</i>.</li>
|
||||||
|
<li><b>Handheld devices</b> supporting <a href="http://java.sun.com/products/personaljava/">PersonalJava
|
||||||
|
Application Environment</a> version 1.2a or higher. This is distributed as <i>
|
||||||
|
DictionarySearchHandheld.jar</i>.</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<p>The classes designed to be run from the command-line are:</p>
|
||||||
|
<ul>
|
||||||
|
<li>BinaryFileGenerator
|
||||||
|
(included only in DictionarySearchStandalone.jar)</li>
|
||||||
|
<li>AcipToWylie
|
||||||
|
(included only in DictionarySearchStandalone.jar)</li>
|
||||||
|
<li>WindowScannerFilter
|
||||||
|
(included in both DictionarySearchStandalone.jar and
|
||||||
|
DictionarySearchHandheld.jar)</li>
|
||||||
|
<li>ConsoleScannerFilter
|
||||||
|
(included in both DictionarySearchStandalone.jar and
|
||||||
|
DictionarySearchHandheld.jar)</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
|
||||||
|
<p><i>Notes on Input:</i></p>
|
||||||
|
<ul>
|
||||||
|
<li>For the "punctuation marks", the tool assumes that
|
||||||
|
<ul>
|
||||||
|
<li>' ' (tsheg), '_' (space), <enter>, <tab>:
|
||||||
|
function as syllable separators and may show up in between component word or phrases.</li>
|
||||||
|
<li> '/' (shad), ';', '|', '!', ':', '[', ']', '^', '@', '#', '$', '%', '=', '<', '>',
|
||||||
|
'(', ')', '{', '}', <i>blank line</i> (two enters in a row): may not show up in between
|
||||||
|
component word or phrases (and hence is interpreted as marking the end of a
|
||||||
|
component word or phrase). See <a href="http://iris.lib.virginia.edu/tibet/tools/ewts.pdf" target="_blank">
|
||||||
|
Extended Wylie</a> documentation for the corresponding symbols in the
|
||||||
|
Tibetan script.
|
||||||
|
</li>
|
||||||
|
<li>all other characters are part of the syllable<br>
|
||||||
|
</li>
|
||||||
|
</ul></li>
|
||||||
|
<li>To force the parser to "break up" a component word or phrase into
|
||||||
|
its individual syllables, use any character of the second set in between the syllables. For
|
||||||
|
example, if the entry is:
|
||||||
|
<p><i>chos nyid</i></p>
|
||||||
|
<p>or</p>
|
||||||
|
<p><i>chos<br>
|
||||||
|
nyid</i></p>
|
||||||
|
<p>the parser will recognize it as a single word "<i>chos nyid</i>".
|
||||||
|
But if the entry is:</p>
|
||||||
|
<p><i>chos / nyid</i></p>
|
||||||
|
<p>or</p>
|
||||||
|
<p><i>chos</i></p>
|
||||||
|
<p><i>nyid</i></p>
|
||||||
|
<p>the parser will assume "chos" and "nyid" are independent,
|
||||||
|
and will be looked up separately.</p>
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<p>Author: Andrés Montano Pellegrini</p>
|
||||||
<p>
|
<p>
|
||||||
<h2>Related Documentation</h2>
|
<h2>Related Documentation</h2>
|
||||||
@see <a href="../text/package-summary.html">org.thdl.tib.text</a>
|
@see <a href="../text/package-summary.html">org.thdl.tib.text</a>
|
||||||
@see <a href="../input/package-summary.html">org.thdl.tib.input</a>
|
@see <a href="../input/package-summary.html">org.thdl.tib.input</a>
|
||||||
</body>
|
</body>
|
||||||
</html>
|
</html>
|
Loading…
Reference in a new issue