/* The contents of this file are subject to the AMP Open Community License Version 1.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License on the AMP web site (http://www.tibet.iteso.mx/Guatemala/). Software distributed under the License is distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License for the specific terms governing rights and limitations under the License. The Initial Developer of this software is Andres Montano Pellegrini. Portions created by Andres Montano Pellegrini are Copyright 2001 Andres Montano Pellegrini. All Rights Reserved. Contributor(s): ______________________________________. */ package org.thdl.tib.scanner; import java.io.*; /** Converts Tibetan dictionaries stored in text files into a binary file tree structure format, to be used by some implementations of the SyllableListTree.

Syntax (Dictionary files are assumed to be .txt. Don't include extensions!):

For one dictionary, to read the definitions stored in dic-name.txt and organize them into dic-name.wrd and dic-name.def:
```
java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator [-delimiter] dict-name
```
For multiple dictionaries, to read the definitions stored in dict-name1.txt, dict-name2.txt, etc.and organize them into dest-file-name.wrd and dest-file-name.def:
```
java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator dest-file-name [-delimiter1] dict-name1 [[-delimiter2] dict-name2 ...]
```

-delimiter

If this option is omitted, it is assumed that each line is an entry (no multiple-line entries) and the definition and definiendum are separated by '-' (a dash). Even though it is not required, it is highly recommended to include a space before and afterwards (to eliminate any possible ambiguity with regards to the transliteration of reverse vowels in Extended Wylie). A sample entry for the dictionary is:
```
bkra shis - 1) auspiciousness, good luck, good fortune, goodness, prosperity, happiness. 2) auspicious, favorable, fortunate, successful, felicitous, lucky. 3) verse of auspiciousness; benediction, blessing. 4) a personal name.
bde legs - 1) goodness, happiness, well-being, wellfare, auspiciousness, good fortune. 2) well, fine.
```
If this were the content of a file called "my-glossary.txt" the binary tree file would be generated with the command:
```
java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator my-glossary
```

-tab: it is assumed that each line is an entry (no multiple-line entries) and the definition and definiendum are separated by '\t' (horizontal tabulation). One tabulation is enough; don't feel the need to "align" the definitions in your word-processor. A sample entry for the dictionary is:

bkra shis	1) auspiciousness, good luck, good fortune, goodness, prosperity, happiness. 2) auspicious, favorable, fortunate, successful, felicitous, lucky. 3) verse of auspiciousness; benediction, blessing. 4) a personal name.
bde legs	1) goodness, happiness, well-being, wellfare, auspiciousness, good fortune. 2) well, fine.

Here, the binary tree file would be generated with the command:

java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator -tab my-glossary

-string: it is assumed that each line is an entry (no multiple-line entries) and the definition and definiendum are separated by the character or string of characters specified by the user. A sample entry for the dictionary is:

bkra shis ** 1) auspiciousness, good luck, good fortune, goodness, prosperity, happiness. 2) auspicious, favorable, fortunate, successful, felicitous, lucky. 3) verse of auspiciousness; benediction, blessing. 4) a personal name.
bde legs ** 1) goodness, happiness, well-being, wellfare, auspiciousness, good fortune. 2) well, fine.

Here, the binary tree file would be generated with the command:

java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator -** my-glossary

-acip: it is assumed that the electronic file is a transliteration of a Tibetan dictionary. It is called "acip" because it accepts Acip's comment codes ('@' to mark page numbers, brackets to mark comments, etc). Nevertheless, it still requires the files to be in Extended Wylie, so if your file is in Acip's transliteration scheme make sure to run org.thdl.tib.scanner.AcipToWylie first. Definitions here can be of multiple lines, but with no blank lines in between. It is assumed that the definiendum starts after a blank line (except at the beginning of a new page where it could start with the last part of the previous definition) up to the shad (except when the shad is omitted because of grammar rules as for instance no shad after a "ga" suffix without a secondary suffix). Each time a new letter starts, it should be clearly marked in brackets ('[', ']'), parenthesis ('(', ')') or llaves ('{','}'). A sample entry for the dictionary is:
```
@1

(ka)

ka ba/ gdung 'degs don byed nus pa/

rkyen/ grogs byed

@2

(kha)

khyod dngos po dang de byung 'brel/  khyod dngos po las byung
zhing/ dngos po ldog stops kyis khyod ldog pa/

khyod dngos po dang bdag gcig 'brel/ khyod ngos po dang bdag
nyid gcig pa'i sgo nas tha dad gang zhig/ dngos po ldog
stops kyis khyod ldog pa/

khyod dngos po dang 'brel pa/ khyod dngos po dang tha dad gang

@3

zhig/ ngos po ldog stobs kyis khyod ldog pa/

kha dog  mdog du rung ba'am/ sngo ser dkar dmar sogs mdog tu
rung ba'i gzugs/
```
Here the binary tree file would be generated with the command:
```
java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator -acip my-glossary
```
Comments: Notice in the sample text that at the beginning of page 2, "zhig" is not a new definiendum, but still is part of the definition of "khyod dngos po dang 'brel pa". Also the definiendum of the last entry is "kha dog" (the shad was omitted after "ga" suffix) and not "kha dog mdog du rung ba'am". Nevertheless the definiendum of the second term is not "khyod dngos po dang bdag" since there is no omitted shad after that "ga" suffix; the definiedum is "khyod dngos po dang bdag gcig 'brel". As is clear from the sample text, the tool has to make a series of "smart guesses" to try to figure out where each definiendum end and it's definition start. Such process is not 100% full-proof, so expect some mistakes.
Dictionaries in different formats can be processed together. For instance the command:
```
java -cp DictionarySearchStandalone.jar org.thdl.tib.scanner.BinaryFileGenerator alldicts ry-dic99 -acip myglossary_uma -tab myglossary_rdzogs-chen
```
would generate alldicts.def and alldicts.wrd processing ry-dic99.txt as dash-separated, myglossary_rdzogs-chen.txt as tab-separated and myglossary_uma.txt in the transliteration format explained above.

@author Andrés Montano Pellegrini @see SyllableListTree @see FileSyllableListTree @see CachedSyllableListTree */ public class BinaryFileGenerator extends LinkedList { private long posHijos; private String sil, def[]; private static String delimiter; private static int delimiterType; private final static int delimiterGeneric=0; private final static int delimiterAcip=1; private final static int delimiterDash=2; /** Number of dictionary. If 0, partial word (no definition). */ private DictionarySource sourceDef; public static RandomAccessFile wordRaf; private static RandomAccessFile defRaf; static { wordRaf = null; defRaf = null; delimiter = null; delimiterType=delimiterDash; } public BinaryFileGenerator() { super(); sil = null; def = null; posHijos=-1; sourceDef = null; } public BinaryFileGenerator(String sil, String def, int numDef) { super(); int marker = sil.indexOf(" "); this.sourceDef = new DictionarySource(); if (marker<0) { this.sil = sil; this.def = new String[1]; this.def[0] = def; this.sourceDef.add(numDef); } else { this.sil = sil.substring(0, marker); this.def = null; addLast(new BinaryFileGenerator(sil.substring(marker+1).trim(), def, numDef)); } posHijos=-1; } public String toString() { return sil; } private static String deleteQuotes(String s) { int length = s.length(); if (length>2) { if ((s.charAt(0)=='\"') && (s.charAt(length-1)=='\"')) return s.substring(1,length-2); } return s; } public void addFile(String archivo, int defNum) throws Exception { final short newDefiniendum=1, halfDefiniendum=2, definition=3; short status=newDefiniendum; int marker, len, marker2; // int n=0; int currentPage=0, currentLine=1; char ch; BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(archivo))); String entrada="", s1="", s2="", currentLetter="", temp="", lastWeirdDefiniendum=""; boolean markerNotFound; // used for acip dict switch(delimiterType) { case delimiterAcip: outAHere: while (true) { entrada=br.readLine(); if (entrada==null) break; currentLine++; entrada = entrada.trim(); len = entrada.length(); if (len<=0) continue; // get page number if (entrada.charAt(0)=='@') { marker = 1; while(marker0) currentPage=Integer.parseInt(temp); if (marker=0 && marker2) { printSintax(); return; } sl.addFile(args[1] + ".txt",0); a=1; } else { a=0; if (args.length==1) { sl.addFile(args[0] + ".txt",0); } else { i=1; while(i< args.length) { if (args[i].charAt(0)=='-') { if (args[i].equals("-tab")) { delimiterType=delimiterGeneric; delimiter="\t"; } else if (args[i].equals("-acip")) delimiterType=delimiterAcip; else { delimiterType=delimiterGeneric; delimiter=args[i].substring(1); } i++; } else { delimiterType=delimiterDash; } sl.addFile(args[i] + ".txt", n); n++; i++; } } } File wordF = new File(args[a] + ".wrd"), defF = new File(args[a] + ".def"); wordF.delete(); defF.delete(); wordRaf = new RandomAccessFile(wordF,"rw"); defRaf = new RandomAccessFile(defF,"rw"); sl.print(); wordRaf.writeInt((int)sl.posHijos); } }