Jskad/source/org/thdl/tib/text/ttt/ACIPString.java

/*
The contents of this file are subject to the THDL Open Community License
Version 1.0 (the "License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License on the THDL web site 
(http://www.thdl.org/).

Software distributed under the License is distributed on an "AS IS" basis, 
WITHOUT WARRANTY OF ANY KIND, either express or implied. See the 
License for the specific terms governing rights and limitations under the 
License. 

The Initial Developer of this software is the Tibetan and Himalayan Digital
Library (THDL). Portions created by the THDL are Copyright 2003 THDL.
All Rights Reserved. 

Contributor(s): ______________________________________.
*/

package org.thdl.tib.text.ttt;

/**
* An ACIPString is some Latin text and a type, the type stating
* whether said text is Latin (usually English) or transliteration of
* Tibetan and which particular kind.  Scanning errors are also encoded
* as ACIPStrings using a special type.
*
* @author David Chandler
*/
public class ACIPString {
    private int type;
    private String text;

    /** Returns true if and only if an ACIPString with type type is to
     *  be converted to Latin, not Tibetan, text. */
    public static boolean isLatin(int type) {
        return (type != TIBETAN_NON_PUNCTUATION
                && type != TIBETAN_PUNCTUATION
                && type != TSHEG_BAR_ADORNMENT
                && type != START_PAREN
                && type != END_PAREN
                && type != START_SLASH
                && type != END_SLASH);
    }

    /** For [#COMMENTS] */
    public static final int COMMENT = 0;
    /** For Folio markers like @012B */
    public static final int FOLIO_MARKER = 1;
    /** For Latin letters and numbers etc.  [*LINE BREAK?] uses this,
     *  for example. */
    public static final int LATIN = 2;
    /** For Tibetan letters and numbers etc. */
    public static final int TIBETAN_NON_PUNCTUATION = 3;
    /** For tshegs, whitespace and the like, but not combining
     *  punctutation like %, o, :, m, and x */
    public static final int TIBETAN_PUNCTUATION = 4;
    /** For the start of a [*probable correction] or [*possible correction?] */
    public static final int CORRECTION_START = 5;
    /** Denotes the end of a [*probable correction] */
    public static final int PROBABLE_CORRECTION = 6;
    /** Denotes the end of a [*possible correction?] */
    public static final int POSSIBLE_CORRECTION = 7;
    /** For [BP] -- blank page */
    public static final int BP = 8;
    /** For [LS] -- Lanycha script on page */
    public static final int LS = 9;
    /** For [DR] -- picture (without caption) on page */
    public static final int DR = 10;
    /** For [DD], [DDD], [DD1], [DD2], etc. -- picture with caption on page */
    public static final int DD = 11;
    /** For [?] */
    public static final int QUESTION = 12;
    /** For the first / in /NYA/ */
    public static final int START_SLASH = 13;
    /** For the last / in /NYA/ */
    public static final int END_SLASH = 14;
    /** For the opening ( in (NYA) */
    public static final int START_PAREN = 15;
    /** For the closing ) in (NYA) */
    public static final int END_PAREN = 16;
    /** For things that may not be legal syntax, such as {KA . KHA} */
    public static final int WARNING = 17;
    /** For ACIP %, o, and x */
    public static final int TSHEG_BAR_ADORNMENT = 18;
    /** For things that are not legal syntax, such as a file that
     * contains just "[# HALF A COMMEN" */
    public static final int ERROR = 19;

    /** Returns true if and only if this string is Latin (usually
     *  English).  Returns false if this string is transliteration of
     *  Tibetan. */
    public int getType() {
        return type;
    }

    /** Returns the non-null, non-empty String of text associated with
     *  this string. */
    public String getText() {
        return text;
    }

    private void setType(int t) {
        if (t < COMMENT || t > ERROR)
            throw new IllegalArgumentException("Bad type");
        type = t;
    }

    private void setText(String t) {
        if (t == null || "".equals(t))
            throw new IllegalArgumentException("null or empty text, DD should have text [DD] e.g.");
        text = t;
    }

    /** Don't instantiate me. */
    private ACIPString() { }

    /** Creates a new ACIPString with source text <i>text</i> and type
     *  <i>type</i> being a characterization like {@link DD}. */
    public ACIPString(String text, int type) {
        setType(type);
        setText(text);
    }
    public String toString() {
        String typeString = "HUH?????";
        if (type == COMMENT) typeString = "COMMENT";
        if (type == FOLIO_MARKER) typeString = "FOLIO_MARKER";
        if (type == LATIN) typeString = "LATIN";
        if (type == TIBETAN_NON_PUNCTUATION) typeString = "TIBETAN_NON_PUNCTUATION";
        if (type == TIBETAN_PUNCTUATION) typeString = "TIBETAN_PUNCTUATION";
        if (type == CORRECTION_START) typeString = "CORRECTION_START";
        if (type == PROBABLE_CORRECTION) typeString = "PROBABLE_CORRECTION";
        if (type == POSSIBLE_CORRECTION) typeString = "POSSIBLE_CORRECTION";
        if (type == BP) typeString = "BP";
        if (type == LS) typeString = "LS";
        if (type == DR) typeString = "DR";
        if (type == DD) typeString = "DD";
        if (type == QUESTION) typeString = "QUESTION";
        if (type == START_SLASH) typeString = "START_SLASH";
        if (type == END_SLASH) typeString = "END_SLASH";
        if (type == START_PAREN) typeString = "START_PAREN";
        if (type == END_PAREN) typeString = "END_PAREN";
        if (type == WARNING) typeString = "WARNING";
        if (type == TSHEG_BAR_ADORNMENT) typeString = "TSHEG_BAR_ADORNMENT";
        if (type == ERROR) typeString = "ERROR";
        return typeString + ":{" + getText() + "}";
    }
}
I now have a function that takes as input a String of ACIP and breaks up that String into tsheg bars, punctuation, etc., while finding errors. I've tested it some, but I'm not yet committing the tests. Next step: a converter that takes an ACIP file as input and outputs TMW+Latin. 2003-08-14 05:10:47 +00:00			`/*`
			`The contents of this file are subject to the THDL Open Community License`
			`Version 1.0 (the "License"); you may not use this file except in compliance`
			`with the License. You may obtain a copy of the License on the THDL web site`
			`(http://www.thdl.org/).`

			`Software distributed under the License is distributed on an "AS IS" basis,`
			`WITHOUT WARRANTY OF ANY KIND, either express or implied. See the`
			`License for the specific terms governing rights and limitations under the`
			`License.`

			`The Initial Developer of this software is the Tibetan and Himalayan Digital`
			`Library (THDL). Portions created by the THDL are Copyright 2003 THDL.`
			`All Rights Reserved.`

			`Contributor(s): ______________________________________.`
			`*/`

			`package org.thdl.tib.text.ttt;`

			`/**`
			`* An ACIPString is some Latin text and a type, the type stating`
			`* whether said text is Latin (usually English) or transliteration of`
			`* Tibetan and which particular kind. Scanning errors are also encoded`
			`* as ACIPStrings using a special type.`
			`*`
			`* @author David Chandler`
			`*/`
			`public class ACIPString {`
			`private int type;`
			`private String text;`

ACIP->Unicode, without going through TMW, is now possible, so long as \, the Sanskrit virama, is not used. Of the 1370-odd ACIP texts I've got here, about 57% make it through the gauntlet (fewer if you demand a vowel or disambiguator on every stack of a non-Tibetan tsheg bar). 2003-08-18 02:38:54 +00:00			`/** Returns true if and only if an ACIPString with type type is to`
			`* be converted to Latin, not Tibetan, text. */`
			`public static boolean isLatin(int type) {`
			`return (type != TIBETAN_NON_PUNCTUATION`
			`&& type != TIBETAN_PUNCTUATION`
ACIP font shrinking as in {KA (GA)} is now supported. 2003-09-07 18:30:59 +00:00			`&& type != TSHEG_BAR_ADORNMENT`
			`&& type != START_PAREN`
			`&& type != END_PAREN`
ACIP->Unicode, without going through TMW, is now possible, so long as \, the Sanskrit virama, is not used. Of the 1370-odd ACIP texts I've got here, about 57% make it through the gauntlet (fewer if you demand a vowel or disambiguator on every stack of a non-Tibetan tsheg bar). 2003-08-18 02:38:54 +00:00			`&& type != START_SLASH`
			`&& type != END_SLASH);`
			`}`

I now have a function that takes as input a String of ACIP and breaks up that String into tsheg bars, punctuation, etc., while finding errors. I've tested it some, but I'm not yet committing the tests. Next step: a converter that takes an ACIP file as input and outputs TMW+Latin. 2003-08-14 05:10:47 +00:00			`/** For [#COMMENTS] */`
			`public static final int COMMENT = 0;`
			`/** For Folio markers like @012B */`
			`public static final int FOLIO_MARKER = 1;`
Improved the ACIP scanner (the part of the converter that says, "This is a correction, that's a comment, this is Tibetan, that's Latin (English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now accepts more real-world ACIP files, i.e. it handles illegal constructs. The error checking is more user-friendly. There are now tests. Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the tests. Many thanks, Peter. I still need to implement rules that say, "This is not Tibetan, it must be Sanskrit, because that letter doesn't take a MA prefix." 2003-08-17 01:45:55 +00:00			`/** For Latin letters and numbers etc. [*LINE BREAK?] uses this,`
			`* for example. */`
			`public static final int LATIN = 2;`
I now have a function that takes as input a String of ACIP and breaks up that String into tsheg bars, punctuation, etc., while finding errors. I've tested it some, but I'm not yet committing the tests. Next step: a converter that takes an ACIP file as input and outputs TMW+Latin. 2003-08-14 05:10:47 +00:00			`/** For Tibetan letters and numbers etc. */`
Improved the ACIP scanner (the part of the converter that says, "This is a correction, that's a comment, this is Tibetan, that's Latin (English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now accepts more real-world ACIP files, i.e. it handles illegal constructs. The error checking is more user-friendly. There are now tests. Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the tests. Many thanks, Peter. I still need to implement rules that say, "This is not Tibetan, it must be Sanskrit, because that letter doesn't take a MA prefix." 2003-08-17 01:45:55 +00:00			`public static final int TIBETAN_NON_PUNCTUATION = 3;`
I now have a function that takes as input a String of ACIP and breaks up that String into tsheg bars, punctuation, etc., while finding errors. I've tested it some, but I'm not yet committing the tests. Next step: a converter that takes an ACIP file as input and outputs TMW+Latin. 2003-08-14 05:10:47 +00:00			`/** For tshegs, whitespace and the like, but not combining`
			`* punctutation like %, o, :, m, and x */`
Improved the ACIP scanner (the part of the converter that says, "This is a correction, that's a comment, this is Tibetan, that's Latin (English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now accepts more real-world ACIP files, i.e. it handles illegal constructs. The error checking is more user-friendly. There are now tests. Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the tests. Many thanks, Peter. I still need to implement rules that say, "This is not Tibetan, it must be Sanskrit, because that letter doesn't take a MA prefix." 2003-08-17 01:45:55 +00:00			`public static final int TIBETAN_PUNCTUATION = 4;`
I now have a function that takes as input a String of ACIP and breaks up that String into tsheg bars, punctuation, etc., while finding errors. I've tested it some, but I'm not yet committing the tests. Next step: a converter that takes an ACIP file as input and outputs TMW+Latin. 2003-08-14 05:10:47 +00:00			`/** For the start of a [probable correction] or [possible correction?] */`
			`public static final int CORRECTION_START = 5;`
			`/** Denotes the end of a [probable correction] /`
			`public static final int PROBABLE_CORRECTION = 6;`
			`/** Denotes the end of a [possible correction?] /`
			`public static final int POSSIBLE_CORRECTION = 7;`
			`/** For [BP] -- blank page */`
			`public static final int BP = 8;`
			`/** For [LS] -- Lanycha script on page */`
			`public static final int LS = 9;`
			`/** For [DR] -- picture (without caption) on page */`
			`public static final int DR = 10;`
			`/** For [DD], [DDD], [DD1], [DD2], etc. -- picture with caption on page */`
			`public static final int DD = 11;`
			`/** For [?] */`
			`public static final int QUESTION = 12;`
			`/** For the first / in /NYA/ */`
			`public static final int START_SLASH = 13;`
			`/** For the last / in /NYA/ */`
			`public static final int END_SLASH = 14;`
			`/** For the opening ( in (NYA) */`
			`public static final int START_PAREN = 15;`
			`/** For the closing ) in (NYA) */`
			`public static final int END_PAREN = 16;`
Jskad's converter now has ACIP-to-Unicode built in. There are known bugs; it is pre-alpha. It's usable, though, and finds tons of errors in ACIP input files, with the user deciding just how pedantic to be. The biggest outstanding bug is the silent one: treating { }, space, as tsheg instead of whitespace when we ought to know better. 2003-08-24 06:40:53 +00:00			`/** For things that may not be legal syntax, such as {KA . KHA} */`
			`public static final int WARNING = 17;`
The ACIP {NYA%} is supported. {NYAo} and {NYAx} are confusing to me, because I don't know which glyphs o and x correspond to. For that reason, they cause ERRORs. The proposed THDL Extended Wylie ~X and X is now used for U+0F35 and U+0F37 respectively. 2003-09-07 16:19:50 +00:00			`/** For ACIP %, o, and x */`
			`public static final int TSHEG_BAR_ADORNMENT = 18;`
I now have a function that takes as input a String of ACIP and breaks up that String into tsheg bars, punctuation, etc., while finding errors. I've tested it some, but I'm not yet committing the tests. Next step: a converter that takes an ACIP file as input and outputs TMW+Latin. 2003-08-14 05:10:47 +00:00			`/** For things that are not legal syntax, such as a file that`
			`* contains just "[# HALF A COMMEN" */`
The ACIP {NYA%} is supported. {NYAo} and {NYAx} are confusing to me, because I don't know which glyphs o and x correspond to. For that reason, they cause ERRORs. The proposed THDL Extended Wylie ~X and X is now used for U+0F35 and U+0F37 respectively. 2003-09-07 16:19:50 +00:00			`public static final int ERROR = 19;`
I now have a function that takes as input a String of ACIP and breaks up that String into tsheg bars, punctuation, etc., while finding errors. I've tested it some, but I'm not yet committing the tests. Next step: a converter that takes an ACIP file as input and outputs TMW+Latin. 2003-08-14 05:10:47 +00:00
			`/** Returns true if and only if this string is Latin (usually`
			`* English). Returns false if this string is transliteration of`
			`* Tibetan. */`
			`public int getType() {`
			`return type;`
			`}`

			`/** Returns the non-null, non-empty String of text associated with`
			`* this string. */`
			`public String getText() {`
			`return text;`
			`}`

			`private void setType(int t) {`
			`if (t < COMMENT \|\| t > ERROR)`
			`throw new IllegalArgumentException("Bad type");`
			`type = t;`
			`}`

			`private void setText(String t) {`
			`if (t == null \|\| "".equals(t))`
			`throw new IllegalArgumentException("null or empty text, DD should have text [DD] e.g.");`
			`text = t;`
			`}`

			`/** Don't instantiate me. */`
			`private ACIPString() { }`

			`/** Creates a new ACIPString with source text <i>text</i> and type`
			`* <i>type</i> being a characterization like {@link DD}. */`
			`public ACIPString(String text, int type) {`
			`setType(type);`
			`setText(text);`
			`}`
			`public String toString() {`
			`String typeString = "HUH?????";`
			`if (type == COMMENT) typeString = "COMMENT";`
			`if (type == FOLIO_MARKER) typeString = "FOLIO_MARKER";`
Improved the ACIP scanner (the part of the converter that says, "This is a correction, that's a comment, this is Tibetan, that's Latin (English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now accepts more real-world ACIP files, i.e. it handles illegal constructs. The error checking is more user-friendly. There are now tests. Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the tests. Many thanks, Peter. I still need to implement rules that say, "This is not Tibetan, it must be Sanskrit, because that letter doesn't take a MA prefix." 2003-08-17 01:45:55 +00:00			`if (type == LATIN) typeString = "LATIN";`
I now have a function that takes as input a String of ACIP and breaks up that String into tsheg bars, punctuation, etc., while finding errors. I've tested it some, but I'm not yet committing the tests. Next step: a converter that takes an ACIP file as input and outputs TMW+Latin. 2003-08-14 05:10:47 +00:00			`if (type == TIBETAN_NON_PUNCTUATION) typeString = "TIBETAN_NON_PUNCTUATION";`
			`if (type == TIBETAN_PUNCTUATION) typeString = "TIBETAN_PUNCTUATION";`
			`if (type == CORRECTION_START) typeString = "CORRECTION_START";`
			`if (type == PROBABLE_CORRECTION) typeString = "PROBABLE_CORRECTION";`
			`if (type == POSSIBLE_CORRECTION) typeString = "POSSIBLE_CORRECTION";`
			`if (type == BP) typeString = "BP";`
			`if (type == LS) typeString = "LS";`
			`if (type == DR) typeString = "DR";`
			`if (type == DD) typeString = "DD";`
			`if (type == QUESTION) typeString = "QUESTION";`
			`if (type == START_SLASH) typeString = "START_SLASH";`
			`if (type == END_SLASH) typeString = "END_SLASH";`
			`if (type == START_PAREN) typeString = "START_PAREN";`
			`if (type == END_PAREN) typeString = "END_PAREN";`
Jskad's converter now has ACIP-to-Unicode built in. There are known bugs; it is pre-alpha. It's usable, though, and finds tons of errors in ACIP input files, with the user deciding just how pedantic to be. The biggest outstanding bug is the silent one: treating { }, space, as tsheg instead of whitespace when we ought to know better. 2003-08-24 06:40:53 +00:00			`if (type == WARNING) typeString = "WARNING";`
The ACIP {NYA%} is supported. {NYAo} and {NYAx} are confusing to me, because I don't know which glyphs o and x correspond to. For that reason, they cause ERRORs. The proposed THDL Extended Wylie ~X and X is now used for U+0F35 and U+0F37 respectively. 2003-09-07 16:19:50 +00:00			`if (type == TSHEG_BAR_ADORNMENT) typeString = "TSHEG_BAR_ADORNMENT";`
I now have a function that takes as input a String of ACIP and breaks up that String into tsheg bars, punctuation, etc., while finding errors. I've tested it some, but I'm not yet committing the tests. Next step: a converter that takes an ACIP file as input and outputs TMW+Latin. 2003-08-14 05:10:47 +00:00			`if (type == ERROR) typeString = "ERROR";`
Improved the ACIP scanner (the part of the converter that says, "This is a correction, that's a comment, this is Tibetan, that's Latin (English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now accepts more real-world ACIP files, i.e. it handles illegal constructs. The error checking is more user-friendly. There are now tests. Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the tests. Many thanks, Peter. I still need to implement rules that say, "This is not Tibetan, it must be Sanskrit, because that letter doesn't take a MA prefix." 2003-08-17 01:45:55 +00:00			`return typeString + ":{" + getText() + "}";`
I now have a function that takes as input a String of ACIP and breaks up that String into tsheg bars, punctuation, etc., while finding errors. I've tested it some, but I'm not yet committing the tests. Next step: a converter that takes an ACIP file as input and outputs TMW+Latin. 2003-08-14 05:10:47 +00:00			`}`
			`}`