2003-08-14 05:10:47 +00:00
/ *
The contents of this file are subject to the THDL Open Community License
Version 1 . 0 ( the " License " ) ; you may not use this file except in compliance
with the License . You may obtain a copy of the License on the THDL web site
( http : //www.thdl.org/).
Software distributed under the License is distributed on an " AS IS " basis ,
WITHOUT WARRANTY OF ANY KIND , either express or implied . See the
License for the specific terms governing rights and limitations under the
License .
The Initial Developer of this software is the Tibetan and Himalayan Digital
Library ( THDL ) . Portions created by the THDL are Copyright 2003 THDL .
All Rights Reserved .
Contributor ( s ) : ______________________________________ .
* /
package org.thdl.tib.text.ttt ;
import java.io.* ;
import java.util.ArrayList ;
import java.util.Stack ;
import org.thdl.util.ThdlDebug ;
/ * *
* This class is able to break up Strings of ACIP text ( for example , an
* entire sutra file ) into tsheg bars , comments , etc . Folio markers ,
* comments , and the like are segregated ( so that consumers can ensure
* that they remain in Latin ) , and Tibetan passages are broken up into
* tsheg bars .
* @author David Chandler
* /
public class ACIPTshegBarScanner {
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
/ * * Useful for testing . Gives error messages on standard output
* about why we can ' t scan the document perfectly and exits with
* non - zero return code , or says " Good scan! " otherwise and exits
* with code zero . < p > FIXME : not so efficient ; copies the whole
* file into memory first . * /
2003-08-16 16:13:53 +00:00
public static void main ( String [ ] args ) throws IOException {
2003-08-24 06:40:53 +00:00
if ( args . length ! = 1 ) {
System . out . println ( " Bad args! Need just the name of the ACIP text file. " ) ;
2003-08-16 16:13:53 +00:00
System . exit ( 1 ) ;
}
StringBuffer errors = new StringBuffer ( ) ;
2003-08-17 02:12:49 +00:00
int maxErrors = 250 ;
2003-08-24 06:40:53 +00:00
ArrayList al = scanFile ( args [ 0 ] , errors , maxErrors - 1 ) ;
2003-08-17 02:12:49 +00:00
if ( null = = al ) {
System . out . println ( maxErrors + " or more errors occurred while scanning ACIP input file; is this " ) ;
System . out . println ( " Tibetan or English input? " ) ;
System . out . println ( " " ) ;
System . out . println ( " First " + maxErrors + " errors scanning ACIP input file: " ) ;
System . out . println ( errors ) ;
System . out . println ( " Exiting with " + maxErrors + " or more errors; please fix input file and try again. " ) ;
System . exit ( 1 ) ;
}
2003-08-16 16:13:53 +00:00
if ( errors . length ( ) > 0 ) {
System . out . println ( " Errors scanning ACIP input file: " ) ;
System . out . println ( errors ) ;
System . out . println ( " Exiting; please fix input file and try again. " ) ;
System . exit ( 1 ) ;
}
System . out . println ( " Good scan! " ) ;
System . exit ( 0 ) ;
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
/ * * Scans an ACIP file with path fname into tsheg bars . If errors
2003-08-24 06:40:53 +00:00
* is non - null , error messages will be appended to it . Returns a
* list of ACIPStrings that is the scan . < p > FIXME : not so
* efficient ; copies the whole file into memory first .
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
* @throws IOException if we cannot read in the ACIP input file * /
2003-08-24 06:40:53 +00:00
public static ArrayList scanFile ( String fname , StringBuffer errors , int maxErrors )
throws IOException
{
return scanStream ( new FileInputStream ( fname ) ,
errors , maxErrors ) ;
}
/ * * Scans a stream of ACIP into tsheg bars . If errors is
* non - null , error messages will be appended to it . You can
* recover both errors and warnings ( modulo offset information )
* from the result , though . Returns a list of ACIPStrings that
* is the scan , or null if more than maxErrors occur . < p > FIXME :
* not so efficient ; copies the whole file into memory first .
* @throws IOException if we cannot read the whole ACIP stream * /
public static ArrayList scanStream ( InputStream stream , StringBuffer errors ,
int maxErrors )
2003-08-17 02:12:49 +00:00
throws IOException
{
2003-08-16 16:13:53 +00:00
StringBuffer s = new StringBuffer ( ) ;
char ch [ ] = new char [ 8192 ] ;
BufferedReader in
2003-08-24 06:40:53 +00:00
= new BufferedReader ( new InputStreamReader ( stream , " US-ASCII " ) ) ;
2003-08-16 16:13:53 +00:00
int amt ;
while ( - 1 ! = ( amt = in . read ( ch ) ) ) {
s . append ( ch , 0 , amt ) ;
}
2003-08-18 02:38:54 +00:00
in . close ( ) ;
2003-08-24 06:40:53 +00:00
return scan ( s . toString ( ) , errors , maxErrors ) ;
2003-08-16 16:13:53 +00:00
}
2003-08-14 05:10:47 +00:00
/ * * Returns a list of { @link ACIPString ACIPStrings } corresponding
* to s , possibly the empty list ( when the empty string is the
* input ) . Each String is either a Latin comment , some Latin
* text , a tsheg bar ( minus the tsheg or shad or whatever ) , a
* String of inter - tsheg - bar punctuation , etc .
*
2003-08-24 06:40:53 +00:00
* < p > This not only scans ; it finds all the errors and warnings a
* parser would too , like " NYA x " and " ( " and " ) " and " /NYA " etc .
* It puts those in as ACIPStrings with type { @link
* ACIPString # ERROR } or { @link ACIPString # WARNING } , and also , if
* errors is non - null , appends helpful messages to errors , each
* followed by a '\n' .
2003-08-17 02:12:49 +00:00
* @param s the ACIP text
* @param errors if non - null , the buffer to which to append error
2003-08-24 06:40:53 +00:00
* messages ( DLC FIXME : cludge , just get this info by scanning
* the result for ACIPString . ERROR ( and maybe ACIPString . WARNING ,
* if you care about warnings ) , but then we ' d have to put the
* Offset info in the ACIPString )
2003-08-17 02:12:49 +00:00
* @param maxErrors if nonnegative , then scanning will stop when
* more than maxErrors errors occur . In this event , null is
* returned .
* @return null if more than maxErrors errors occur , or the scan
* otherwise
2003-08-14 05:10:47 +00:00
* /
2003-08-24 06:40:53 +00:00
public static ArrayList scan ( String s , StringBuffer errors , int maxErrors ) {
2003-08-14 05:10:47 +00:00
// the size depends on whether it's mostly Tibetan or mostly
// Latin and a number of other factors. This is meant to be
// an underestimate, but not too much of an underestimate.
2003-08-17 02:12:49 +00:00
int numErrors = 0 ;
2003-08-14 05:10:47 +00:00
ArrayList al = new ArrayList ( s . length ( ) / 10 ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
boolean waitingForMatchingIllegalClose = false ;
2003-08-14 05:10:47 +00:00
int sl = s . length ( ) ;
int currentType = ACIPString . ERROR ;
int startOfString = 0 ;
Stack bracketTypeStack = new Stack ( ) ;
int startSlashIndex = - 1 ;
int startParenIndex = - 1 ;
2003-08-23 22:03:37 +00:00
int numNewlines = 0 ;
2003-08-14 05:10:47 +00:00
for ( int i = 0 ; i < sl ; i + + ) {
if ( i < startOfString ) throw new Error ( " bad reset " ) ;
char ch ;
ch = s . charAt ( i ) ;
2003-08-23 22:03:37 +00:00
if ( ch = = '\n' ) + + numNewlines ;
2003-08-16 16:13:53 +00:00
if ( ACIPString . COMMENT = = currentType & & ch ! = ']' ) {
if ( '[' = = ch ) {
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
al . add ( new ACIPString ( " Found an open bracket within a [#COMMENT]-style comment. Brackets may not appear in comments. \ n " ,
2003-08-16 16:13:53 +00:00
ACIPString . ERROR ) ) ;
if ( null ! = errors )
2003-08-23 22:03:37 +00:00
errors . append ( " Offset " + i + " or maybe " + ( i - numNewlines ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found an open bracket within a [#COMMENT]-style comment. Brackets may not appear in comments. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-16 16:13:53 +00:00
}
2003-08-14 05:10:47 +00:00
continue ;
2003-08-16 16:13:53 +00:00
}
2003-08-14 05:10:47 +00:00
switch ( ch ) {
2003-08-16 16:13:53 +00:00
case '}' :
2003-08-14 05:10:47 +00:00
case ']' :
if ( bracketTypeStack . empty ( ) ) {
// Error.
if ( startOfString < i ) {
al . add ( new ACIPString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( ! waitingForMatchingIllegalClose ) {
2003-08-24 06:40:53 +00:00
al . add ( new ACIPString ( " Found a truly unmatched close bracket, " + s . substring ( i , i + 1 ) ,
ACIPString . ERROR ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( null ! = errors ) {
2003-08-23 22:03:37 +00:00
errors . append ( " Offset " + i + " or maybe " + ( i - numNewlines ) + " : "
2003-08-17 02:12:49 +00:00
+ " Found a truly unmatched close bracket, ] or }. \ n " ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
}
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
}
waitingForMatchingIllegalClose = false ;
2003-08-24 06:40:53 +00:00
al . add ( new ACIPString ( " Found a closing bracket without a matching open bracket. Perhaps a [#COMMENT] incorrectly written as [COMMENT], or a [*CORRECTION] written incorrectly as [CORRECTION], caused this. " ,
ACIPString . ERROR ) ) ;
2003-08-16 16:13:53 +00:00
if ( null ! = errors )
2003-08-23 22:03:37 +00:00
errors . append ( " Offset " + i + " or maybe " + ( i - numNewlines ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found a closing bracket without a matching open bracket. Perhaps a [#COMMENT] incorrectly written as [COMMENT], or a [*CORRECTION] written incorrectly as [CORRECTION], caused this. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
currentType = ACIPString . ERROR ;
} else {
int stackTop = ( ( Integer ) bracketTypeStack . pop ( ) ) . intValue ( ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
int end = startOfString ;
2003-08-14 05:10:47 +00:00
if ( ACIPString . CORRECTION_START = = stackTop ) {
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// This definitely indicates a new token.
2003-08-14 05:10:47 +00:00
char prevCh = s . charAt ( i - 1 ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( prevCh = = '?' )
end = i - 1 ;
else
end = i ;
if ( startOfString < end ) {
al . add ( new ACIPString ( s . substring ( startOfString , end ) ,
currentType ) ) ;
}
2003-08-14 05:10:47 +00:00
if ( '?' ! = prevCh ) {
currentType = ACIPString . PROBABLE_CORRECTION ;
} else {
currentType = ACIPString . POSSIBLE_CORRECTION ;
}
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
al . add ( new ACIPString ( s . substring ( end , i + 1 ) , currentType ) ) ;
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
currentType = ACIPString . ERROR ;
}
2003-08-16 16:13:53 +00:00
break ; // end ']','}' case
2003-08-14 05:10:47 +00:00
2003-08-16 16:13:53 +00:00
case '{' : // NOTE WELL: KX0016I.ACT, KD0095M.ACT, and a
// host of other ACIP files use {} brackets like
// [] brackets. I treat both the same.
2003-08-14 05:10:47 +00:00
case '[' :
// This definitely indicates a new token.
if ( startOfString < i ) {
al . add ( new ACIPString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
startOfString = i ;
currentType = ACIPString . ERROR ;
}
String thingy = null ;
if ( i + " [DD] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [DD] " . length ( ) ) . equals ( " [DD] " )
| | s . substring ( i , i + " [DD] " . length ( ) ) . equals ( " {DD} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [DD] " ;
currentType = ACIPString . DD ;
} else if ( i + " [DD1] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [DD1] " . length ( ) ) . equals ( " [DD1] " )
| | s . substring ( i , i + " [DD1] " . length ( ) ) . equals ( " {DD1} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [DD1] " ;
currentType = ACIPString . DD ;
} else if ( i + " [DD2] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [DD2] " . length ( ) ) . equals ( " [DD2] " )
| | s . substring ( i , i + " [DD2] " . length ( ) ) . equals ( " {DD2} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [DD2] " ;
currentType = ACIPString . DD ;
} else if ( i + " [DDD] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [DDD] " . length ( ) ) . equals ( " [DDD] " )
| | s . substring ( i , i + " [DDD] " . length ( ) ) . equals ( " {DDD} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [DDD] " ;
currentType = ACIPString . DD ;
} else if ( i + " [DR] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [DR] " . length ( ) ) . equals ( " [DR] " )
| | s . substring ( i , i + " [DR] " . length ( ) ) . equals ( " {DR} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [DR] " ;
currentType = ACIPString . DR ;
} else if ( i + " [LS] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [LS] " . length ( ) ) . equals ( " [LS] " )
| | s . substring ( i , i + " [LS] " . length ( ) ) . equals ( " {LS} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [LS] " ;
currentType = ACIPString . LS ;
} else if ( i + " [BP] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [BP] " . length ( ) ) . equals ( " [BP] " )
| | s . substring ( i , i + " [BP] " . length ( ) ) . equals ( " {BP} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [BP] " ;
currentType = ACIPString . BP ;
2003-08-23 22:03:37 +00:00
} else if ( i + " [BLANK PAGE] " . length ( ) < = sl
& & ( s . substring ( i , i + " [BLANK PAGE] " . length ( ) ) . equals ( " [BLANK PAGE] " )
| | s . substring ( i , i + " [BLANK PAGE] " . length ( ) ) . equals ( " {BLANK PAGE} " ) ) ) {
thingy = " [BLANK PAGE] " ;
currentType = ACIPString . BP ;
2003-08-16 16:13:53 +00:00
} else if ( i + " [ BP ] " . length ( ) < = sl
& & ( s . substring ( i , i + " [ BP ] " . length ( ) ) . equals ( " [ BP ] " )
| | s . substring ( i , i + " [ BP ] " . length ( ) ) . equals ( " { BP } " ) ) ) {
thingy = " { BP } " ; // found in TD3790E2.ACT
currentType = ACIPString . BP ;
} else if ( i + " [ DD ] " . length ( ) < = sl
& & ( s . substring ( i , i + " [ DD ] " . length ( ) ) . equals ( " [ DD ] " )
| | s . substring ( i , i + " [ DD ] " . length ( ) ) . equals ( " { DD } " ) ) ) {
thingy = " { DD } " ; // found in TD3790E2.ACT
currentType = ACIPString . DD ;
2003-08-14 05:10:47 +00:00
} else if ( i + " [?] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [?] " . length ( ) ) . equals ( " [?] " )
| | s . substring ( i , i + " [?] " . length ( ) ) . equals ( " {?} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [?] " ;
currentType = ACIPString . QUESTION ;
2003-08-16 16:13:53 +00:00
} else {
// We see comments appear not as [#COMMENT], but
// as [COMMENT] sometimes. We make special cases
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// for some English comments. There's no need to
// make this mechanism extensible, because you
// can easily edit the ACIP text so that it uses
// [#COMMENT] notation instead of [COMMENT].
2003-08-16 16:13:53 +00:00
String [ ] englishComments = new String [ ] {
" FIRST " , " SECOND " , // S5274I.ACT
" Additional verses added by Khen Rinpoche here are " , // S0216M.ACT
" ADDENDUM: The text of " , // S0216M.ACT
" END OF ADDENDUM " , // S0216M.ACT
" Some of the verses added here by Khen Rinpoche include: " , // S0216M.ACT
" Note that, in the second verse, the {YUL LJONG} was orignally {GANG LJONG}, \ nand is now recited this way since the ceremony is not only taking place in Tibet. " , // S0216M.ACT
" Note that, in the second verse, the {YUL LJONG} was orignally {GANG LJONG}, \ r \ nand is now recited this way since the ceremony is not only taking place in Tibet. " , // S0216M.ACT
" text missing " , // S6954E1.ACT
" INCOMPLETE " , // TD3817I.INC
" MISSING PAGE " , // S0935m.act
" MISSING FOLIO " , // S0975I.INC
" UNCLEAR LINE " , // S0839D1I.INC
" THE FOLLOWING TEXT HAS INCOMPLETE SECTIONS, WHICH ARE ON ORDER " , // SE6260A.INC
" @DATA INCOMPLETE HERE " , // SE6260A.INC
" @DATA MISSING HERE " , // SE6260A.INC
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
" LINE APPARENTLY MISSING THIS PAGE " , // TD4035I.INC
2003-08-16 16:13:53 +00:00
" DATA INCOMPLETE HERE " , // TD4226I2.INC
" DATA MISSING HERE " , // just being consistent
" FOLLOWING SECTION WAS NOT AVAILABLE WHEN THIS EDITION WAS \ nPRINTED, AND IS SUPPLIED FROM ANOTHER, PROBABLY THE ORIGINAL: " , // S0018N.ACT
" FOLLOWING SECTION WAS NOT AVAILABLE WHEN THIS EDITION WAS \ r \ nPRINTED, AND IS SUPPLIED FROM ANOTHER, PROBABLY THE ORIGINAL: " , // S0018N.ACT
" THESE PAGE NUMBERS RESERVED IN THIS EDITION FOR PAGES \ nMISSING FROM ORIGINAL ON WHICH IT WAS BASED " , // S0018N.ACT
" THESE PAGE NUMBERS RESERVED IN THIS EDITION FOR PAGES \ r \ nMISSING FROM ORIGINAL ON WHICH IT WAS BASED " , // S0018N.ACT
" PAGE NUMBERS RESERVED FROM THIS EDITION FOR MISSING \ nSECTION SUPPLIED BY PRECEDING " , // S0018N.ACT
" PAGE NUMBERS RESERVED FROM THIS EDITION FOR MISSING \ r \ nSECTION SUPPLIED BY PRECEDING " , // S0018N.ACT
" SW: OK " , // S0057M.ACT
" m:ok " , // S0057M.ACT
" A FIRST ONE \ nMISSING HERE? " , // S0057M.ACT
" A FIRST ONE \ r \ nMISSING HERE? " , // S0057M.ACT
" THE INITIAL PART OF THIS TEXT WAS INPUT BY THE SERA MEY LIBRARY IN \ nTIBETAN FONT AND NEEDS TO BE REDONE BY DOUBLE INPUT " , // S0195A1.INC
" THE INITIAL PART OF THIS TEXT WAS INPUT BY THE SERA MEY LIBRARY IN \ r \ nTIBETAN FONT AND NEEDS TO BE REDONE BY DOUBLE INPUT " , // S0195A1.INC
} ;
boolean foundOne = false ;
for ( int ec = 0 ; ec < englishComments . length ; ec + + ) {
if ( i + 2 + englishComments [ ec ] . length ( ) < = sl
& & ( s . substring ( i , i + 2 + englishComments [ ec ] . length ( ) ) . equals ( " [ " + englishComments [ ec ] + " ] " )
| | s . substring ( i , i + 2 + englishComments [ ec ] . length ( ) ) . equals ( " [ " + englishComments [ ec ] + " ] " ) ) ) {
al . add ( new ACIPString ( " [# " + englishComments [ ec ] + " ] " ,
ACIPString . COMMENT ) ) ;
startOfString = i + 2 + englishComments [ ec ] . length ( ) ;
i = startOfString - 1 ;
foundOne = true ;
break ;
}
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( ! foundOne & & i + 1 < sl & & s . charAt ( i + 1 ) = = '*' ) {
// Identify [*LINE BREAK?] as an English
// correction. Every correction not on this
// list is considered to be Tibetan. DLC
// FIXME: make this extensible via a config
// file or at least a System property (which
// could be a comma-separated list of these
// creatures.
// If "LINE" is in the list below, then [*
// LINE], [* LINE?], [*LINE], [*LINE?], [*
// LINE OUT ?], etc. will be considered
// English corrections. I.e., whitespace
// before and anything after doesn't prevent a
// match.
String [ ] englishCorrections = new String [ ] {
" LINE " , // KD0001I1.ACT
" DATA " , // KL0009I2.INC
" BLANK " , // KL0009I2.INC
" NOTE " , // R0001F.ACM
" alternate " , // R0018F.ACE
" 02101-02150 missing " , // R1003A3.INC
" 51501-51550 missing " , // R1003A52.ACT
" BRTAGS ETC " , // S0002N.ACT
" TSAN, ETC " , // S0015N.ACT
" SNYOMS, THROUGHOUT " , // S0016N.ACT
" KYIS ETC " , // S0019N.ACT
" MISSING " , // S0455M.ACT
" this " , // S6850I1B.ALT
" THIS " , // S0057M.ACT
} ;
int begin ;
for ( begin = i + 2 ; begin < sl ; begin + + ) {
if ( ! isWhitespace ( s . charAt ( begin ) ) )
break ;
}
int end ;
for ( end = i + 2 ; end < sl ; end + + ) {
if ( s . charAt ( end ) = = ']' )
break ;
}
int realEnd = end ;
if ( end < sl & & s . charAt ( end - 1 ) = = '?' )
- - realEnd ;
if ( end < sl & & begin < realEnd ) {
String interestingSubstring
= s . substring ( begin , realEnd ) ;
for ( int ec = 0 ; ec < englishCorrections . length ; ec + + ) {
if ( interestingSubstring . startsWith ( englishCorrections [ ec ] ) ) {
al . add ( new ACIPString ( s . substring ( i , i + 2 ) ,
ACIPString . CORRECTION_START ) ) ;
al . add ( new ACIPString ( s . substring ( i + 2 , realEnd ) ,
ACIPString . LATIN ) ) ;
if ( s . charAt ( end - 1 ) = = '?' ) {
al . add ( new ACIPString ( s . substring ( end - 1 , end + 1 ) ,
ACIPString . POSSIBLE_CORRECTION ) ) ;
} else {
al . add ( new ACIPString ( s . substring ( end , end + 1 ) ,
ACIPString . PROBABLE_CORRECTION ) ) ;
}
foundOne = true ;
startOfString = end + 1 ;
i = startOfString - 1 ;
break ;
}
}
}
}
2003-08-16 16:13:53 +00:00
if ( foundOne )
break ;
2003-08-14 05:10:47 +00:00
}
if ( null ! = thingy ) {
al . add ( new ACIPString ( thingy ,
currentType ) ) ;
startOfString = i + thingy . length ( ) ;
i = startOfString - 1 ;
} else {
if ( i + 1 < sl ) {
char nextCh = s . charAt ( i + 1 ) ;
if ( '*' = = nextCh ) {
currentType = ACIPString . CORRECTION_START ;
bracketTypeStack . push ( new Integer ( currentType ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
al . add ( new ACIPString ( s . substring ( i , i + 2 ) ,
ACIPString . CORRECTION_START ) ) ;
currentType = ACIPString . ERROR ;
startOfString = i + 2 ;
i = startOfString - 1 ;
2003-08-14 05:10:47 +00:00
break ;
} else if ( '#' = = nextCh ) {
currentType = ACIPString . COMMENT ;
bracketTypeStack . push ( new Integer ( currentType ) ) ;
break ;
}
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// This is an error. Sometimes [COMMENTS APPEAR
// WITHOUT # MARKS]. Though "... [" could cause
// this too.
if ( waitingForMatchingIllegalClose ) {
2003-08-24 06:40:53 +00:00
al . add ( new ACIPString ( " Found a truly unmatched open bracket, [ or {, prior to this current illegal open bracket. " ,
ACIPString . ERROR ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( null ! = errors ) {
2003-08-23 22:03:37 +00:00
errors . append ( " Offset " + i + " or maybe " + ( i - numNewlines ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found a truly unmatched open bracket, [ or {, prior to this current illegal open bracket. \ n " ) ;
}
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
}
waitingForMatchingIllegalClose = true ;
2003-08-16 16:13:53 +00:00
if ( null ! = errors ) {
String inContext = s . substring ( i , i + Math . min ( sl - i , 10 ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( inContext . indexOf ( " \ r " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ r " ) ) ;
} else if ( inContext . indexOf ( " \ n " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ n " ) ) ;
} else {
if ( sl - i > 10 ) {
inContext = inContext + " ... " ;
}
2003-08-16 16:13:53 +00:00
}
2003-08-24 06:40:53 +00:00
al . add ( new ACIPString ( " Found an illegal open bracket (in context, this is " + inContext + " ). Perhaps there is a [#COMMENT] written incorrectly as [COMMENT], or a [*CORRECTION] written incorrectly as [CORRECTION], or an unmatched open bracket? " ,
ACIPString . ERROR ) ) ;
2003-08-23 22:03:37 +00:00
errors . append ( " Offset " + i + " or maybe " + ( i - numNewlines ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found an illegal open bracket (in context, this is " + inContext + " ). Perhaps there is a [#COMMENT] written incorrectly as [COMMENT], or a [*CORRECTION] written incorrectly as [CORRECTION], or an unmatched open bracket? \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-16 16:13:53 +00:00
}
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
currentType = ACIPString . ERROR ;
}
2003-08-16 16:13:53 +00:00
break ; // end '[','{' case
2003-08-14 05:10:47 +00:00
case '@' :
// This definitely indicates a new token.
if ( startOfString < i ) {
al . add ( new ACIPString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
startOfString = i ;
currentType = ACIPString . ERROR ;
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// We look for {@N{AB}, @NN{AB}, ..., @NNNNNN{AB}},
// {@[N{AB}], @[NN{AB}], ..., @[NNNNNN{AB}]},
// {@N{AB}.N, @NN{AB}.N, ..., @NNNNNN{AB}.N}, {@N,
// @NN, ..., @NNNNNN}, and {@{AB}N, @{AB}NN,
// ... @{AB}NNNNNN} only, that is from one to six
// digits. Each of these folio marker format occurs
// in practice.
for ( int numdigits = 6 ; numdigits > = 1 ; numdigits - - ) {
// @NNN{AB} and @NNN{AB}.N cases:
2003-08-14 05:10:47 +00:00
if ( i + numdigits + 1 < sl
& & ( s . charAt ( i + numdigits + 1 ) = = 'A' | | s . charAt ( i + numdigits + 1 ) = = 'B' ) ) {
boolean allAreNumeric = true ;
for ( int k = 1 ; k < = numdigits ; k + + ) {
if ( ! isNumeric ( s . charAt ( i + k ) ) ) {
allAreNumeric = false ;
break ;
}
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( allAreNumeric ) {
// Is this "@012B " or "@012B.3 "?
int extra ;
if ( i + numdigits + 2 < sl & & s . charAt ( i + numdigits + 2 ) = = '.' ) {
if ( ! ( i + numdigits + 4 < sl & & isNumeric ( s . charAt ( i + numdigits + 3 ) )
& & ! isNumeric ( s . charAt ( i + numdigits + 4 ) ) ) ) {
String inContext = s . substring ( i , i + Math . min ( sl - i , 10 ) ) ;
if ( inContext . indexOf ( " \ r " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ r " ) ) ;
} else if ( inContext . indexOf ( " \ n " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ n " ) ) ;
} else {
if ( sl - i > 10 ) {
inContext = inContext + " ... " ;
}
}
2003-08-23 22:03:37 +00:00
al . add ( new ACIPString ( " Found an illegal at sign, @ (in context, this is " + inContext + " ). This folio marker has a period, '.', at the end of it, which is illegal. " ,
ACIPString . ERROR ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( null ! = errors )
2003-08-23 22:03:37 +00:00
errors . append ( " Offset " + i + " or maybe " + ( i - numNewlines ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found an illegal at sign, @ (in context, this is " + inContext + " ). This folio marker has a period, '.', at the end of it, which is illegal. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
startOfString = i + numdigits + 3 ;
i = startOfString - 1 ;
currentType = ACIPString . ERROR ;
break ;
}
if ( i + numdigits + 4 < sl & & ( s . charAt ( i + numdigits + 4 ) = = '.' | | s . charAt ( i + numdigits + 4 ) = = 'A' | | s . charAt ( i + numdigits + 4 ) = = 'B' | | s . charAt ( i + numdigits + 4 ) = = 'a' | | s . charAt ( i + numdigits + 4 ) = = 'b' | | isNumeric ( s . charAt ( i + numdigits + 4 ) ) ) ) {
String inContext = s . substring ( i , i + Math . min ( sl - i , 10 ) ) ;
if ( inContext . indexOf ( " \ r " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ r " ) ) ;
} else if ( inContext . indexOf ( " \ n " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ n " ) ) ;
} else {
if ( sl - i > 10 ) {
inContext = inContext + " ... " ;
}
}
2003-08-23 22:03:37 +00:00
al . add ( new ACIPString ( " Found an illegal at sign, @ (in context, this is " + inContext + " ). This folio marker is not followed by whitespace, as is expected. " ,
ACIPString . ERROR ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( null ! = errors )
2003-08-23 22:03:37 +00:00
errors . append ( " Offset " + i + " or maybe " + ( i - numNewlines ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found an illegal at sign, @ (in context, this is " + inContext + " ). This folio marker is not followed by whitespace, as is expected. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
startOfString = i + 1 ; // DLC FIXME: skip over more?
currentType = ACIPString . ERROR ;
break ;
}
extra = 4 ;
} else {
extra = 2 ;
}
al . add ( new ACIPString ( s . substring ( i , i + numdigits + extra ) ,
ACIPString . FOLIO_MARKER ) ) ;
startOfString = i + numdigits + extra ;
i = startOfString - 1 ;
currentType = ACIPString . ERROR ;
break ;
}
}
// @{AB}NNN case:
if ( i + numdigits + 1 < sl
& & ( s . charAt ( i + 1 ) = = 'A' | | s . charAt ( i + 1 ) = = 'B' ) ) {
boolean allAreNumeric = true ;
for ( int k = 1 ; k < = numdigits ; k + + ) {
if ( ! isNumeric ( s . charAt ( i + 1 + k ) ) ) {
allAreNumeric = false ;
break ;
}
}
2003-08-14 05:10:47 +00:00
if ( allAreNumeric ) {
2003-08-16 16:13:53 +00:00
al . add ( new ACIPString ( s . substring ( i , i + numdigits + 2 ) ,
ACIPString . FOLIO_MARKER ) ) ;
2003-08-14 05:10:47 +00:00
startOfString = i + numdigits + 2 ;
2003-08-16 16:13:53 +00:00
i = startOfString - 1 ;
currentType = ACIPString . ERROR ;
break ;
}
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// @[NNN{AB}] case:
2003-08-16 16:13:53 +00:00
if ( i + numdigits + 3 < sl
& & s . charAt ( i + 1 ) = = '[' & & s . charAt ( i + numdigits + 3 ) = = ']'
& & ( s . charAt ( i + numdigits + 2 ) = = 'A' | | s . charAt ( i + numdigits + 2 ) = = 'B' ) ) {
boolean allAreNumeric = true ;
for ( int k = 1 ; k < = numdigits ; k + + ) {
if ( ! isNumeric ( s . charAt ( i + 1 + k ) ) ) {
allAreNumeric = false ;
break ;
}
}
if ( allAreNumeric ) {
al . add ( new ACIPString ( s . substring ( i , i + numdigits + 4 ) ,
ACIPString . FOLIO_MARKER ) ) ;
startOfString = i + numdigits + 4 ;
i = startOfString - 1 ;
2003-08-14 05:10:47 +00:00
currentType = ACIPString . ERROR ;
break ;
}
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// This case, @NNN, must come after the @NNN{AB} case.
2003-08-23 22:03:37 +00:00
if ( i + numdigits + 1 < sl & & ( s . charAt ( i + numdigits + 1 ) = = ' '
| | s . charAt ( i + numdigits + 1 ) = = '\n'
| | s . charAt ( i + numdigits + 1 ) = = '\r' ) ) {
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
boolean allAreNumeric = true ;
for ( int k = 1 ; k < = numdigits ; k + + ) {
if ( ! isNumeric ( s . charAt ( i + k ) ) ) {
allAreNumeric = false ;
break ;
}
}
if ( allAreNumeric ) {
al . add ( new ACIPString ( s . substring ( i , i + numdigits + 1 ) ,
ACIPString . FOLIO_MARKER ) ) ;
startOfString = i + numdigits + 1 ;
i = startOfString - 1 ;
currentType = ACIPString . ERROR ;
break ;
}
}
2003-08-14 05:10:47 +00:00
}
if ( startOfString = = i ) {
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
String inContext = s . substring ( i , i + Math . min ( sl - i , 10 ) ) ;
if ( inContext . indexOf ( " \ r " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ r " ) ) ;
} else if ( inContext . indexOf ( " \ n " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ n " ) ) ;
} else {
if ( sl - i > 10 ) {
inContext = inContext + " ... " ;
}
}
2003-08-23 22:03:37 +00:00
al . add ( new ACIPString ( " Found an illegal at sign, @ (in context, this is " + inContext + " ). @012B is an example of a legal folio marker. " ,
ACIPString . ERROR ) ) ;
2003-08-16 16:13:53 +00:00
if ( null ! = errors )
2003-08-23 22:03:37 +00:00
errors . append ( " Offset " + i + " or maybe " + ( i - numNewlines ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found an illegal at sign, @ (in context, this is " + inContext + " ). @012B is an example of a legal folio marker. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
currentType = ACIPString . ERROR ;
}
break ; // end '@' case
case '/' :
// This definitely indicates a new token.
if ( startOfString < i ) {
al . add ( new ACIPString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
startOfString = i ;
currentType = ACIPString . ERROR ;
}
if ( startSlashIndex > = 0 ) {
2003-08-18 02:38:54 +00:00
if ( startSlashIndex + 1 = = i ) {
/ * //NYA\\ appears in ACIP input, and I think
* it means / NYA / . We warn about // for this
* reason . \ \ causes a tsheg - bar error ( DLC
* FIXME : verify this is so ) . * /
2003-08-23 22:03:37 +00:00
al . add ( new ACIPString ( " Found //, which could be legal (the Unicode would be \\ u0F3C \\ u0F3D), but is likely in an illegal construct like //NYA \\ \\ . " ,
ACIPString . ERROR ) ) ;
2003-08-18 02:38:54 +00:00
if ( errors ! = null ) {
2003-08-23 22:03:37 +00:00
errors . append ( " Offset " + i + " or maybe " + ( i - numNewlines ) + " : "
2003-08-18 02:38:54 +00:00
+ " Found //, which could be legal (the Unicode would be \\ u0F3C \\ u0F3D), but is likely in an illegal construct like //NYA \\ \\ . \ n " ) ;
}
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
}
2003-08-16 16:13:53 +00:00
al . add ( new ACIPString ( s . substring ( i , i + 1 ) ,
ACIPString . END_SLASH ) ) ;
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
currentType = ACIPString . ERROR ;
startSlashIndex = - 1 ;
} else {
startSlashIndex = i ;
2003-08-16 16:13:53 +00:00
al . add ( new ACIPString ( s . substring ( i , i + 1 ) ,
ACIPString . START_SLASH ) ) ;
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
currentType = ACIPString . ERROR ;
}
break ; // end '/' case
case '(' :
case ')' :
// This definitely indicates a new token.
if ( startOfString < i ) {
al . add ( new ACIPString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
startOfString = i ;
currentType = ACIPString . ERROR ;
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// We do not support nesting like (NYA (BA)).
2003-08-14 05:10:47 +00:00
if ( startParenIndex > = 0 ) {
2003-08-16 16:13:53 +00:00
if ( ch = = '(' ) {
2003-08-23 22:03:37 +00:00
al . add ( new ACIPString ( " Found an illegal open parenthesis, (. Nesting of parentheses is not allowed. " ,
ACIPString . ERROR ) ) ;
2003-08-16 16:13:53 +00:00
if ( null ! = errors )
2003-08-23 22:03:37 +00:00
errors . append ( " Offset " + i + " or maybe " + ( i - numNewlines ) + " : "
2003-08-16 16:13:53 +00:00
+ " Found an illegal open parenthesis, (. Nesting of parentheses is not allowed. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-16 16:13:53 +00:00
} else {
2003-08-14 05:10:47 +00:00
al . add ( new ACIPString ( s . substring ( i , i + 1 ) , ACIPString . END_PAREN ) ) ;
startParenIndex = - 1 ;
}
startOfString = i + 1 ;
currentType = ACIPString . ERROR ;
} else {
2003-08-16 16:13:53 +00:00
if ( ch = = ')' ) {
2003-08-23 22:03:37 +00:00
al . add ( new ACIPString ( " Unexpected closing parenthesis, ), found. " ,
ACIPString . ERROR ) ) ;
2003-08-16 16:13:53 +00:00
if ( null ! = errors )
2003-08-23 22:03:37 +00:00
errors . append ( " Offset " + i + " or maybe " + ( i - numNewlines ) + " : "
2003-08-16 16:13:53 +00:00
+ " Unexpected closing parenthesis, ), found. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-16 16:13:53 +00:00
} else {
2003-08-14 05:10:47 +00:00
startParenIndex = i ;
al . add ( new ACIPString ( s . substring ( i , i + 1 ) , ACIPString . START_PAREN ) ) ;
}
startOfString = i + 1 ;
currentType = ACIPString . ERROR ;
}
2003-08-16 16:13:53 +00:00
break ; // end '(',')' case
case '?' :
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( bracketTypeStack . empty ( ) | | i + 1 > = sl
| | ( s . charAt ( i + 1 ) ! = ']' & & s . charAt ( i + 1 ) ! = '}' ) ) {
2003-08-16 16:13:53 +00:00
// The tsheg bar ends here; new token.
if ( startOfString < i ) {
al . add ( new ACIPString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
}
al . add ( new ACIPString ( s . substring ( i , i + 1 ) ,
ACIPString . QUESTION ) ) ;
startOfString = i + 1 ;
currentType = ACIPString . ERROR ;
} // else this is [*TR'A ?] or the like.
break ; // end '?' case
2003-08-14 05:10:47 +00:00
2003-08-16 16:13:53 +00:00
case '.' :
// This definitely indicates a new token.
if ( startOfString < i ) {
al . add ( new ACIPString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
startOfString = i ;
currentType = ACIPString . ERROR ;
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// . is used for a non-breaking tsheg, such as in
2003-08-24 06:40:53 +00:00
// {NGO.,} and {....,DAM}. We give a warning unless ,
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// or ., or [A-Za-z] follows '.'.
2003-08-24 06:40:53 +00:00
al . add ( new ACIPString ( s . substring ( i , i + 1 ) ,
ACIPString . TIBETAN_PUNCTUATION ) ) ;
if ( ! ( i + 1 < sl
& & ( s . charAt ( i + 1 ) = = '.' | | s . charAt ( i + 1 ) = = ','
| | ( s . charAt ( i + 1 ) = = '\r' | | s . charAt ( i + 1 ) = = '\n' )
| | ( s . charAt ( i + 1 ) > = 'a' & & s . charAt ( i + 1 ) < = 'z' )
| | ( s . charAt ( i + 1 ) > = 'A' & & s . charAt ( i + 1 ) < = 'Z' ) ) ) ) {
2003-08-23 22:03:37 +00:00
al . add ( new ACIPString ( " A non-breaking tsheg, '.', appeared, but not like \" ..., \" or \" ., \" or \" .dA \" or \" .DA \" . " ,
2003-08-24 06:40:53 +00:00
ACIPString . WARNING ) ) ;
2003-08-16 16:13:53 +00:00
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
startOfString = i + 1 ;
2003-08-16 16:13:53 +00:00
break ; // end '.' case
2003-08-14 05:10:47 +00:00
// Classic tsheg bar enders:
case ' ' :
case '\t' :
case '\r' :
case '\n' :
case ',' :
case '*' :
case ';' :
case '`' :
case '#' :
2003-09-04 04:04:21 +00:00
2003-08-14 05:10:47 +00:00
// The tsheg bar ends here; new token.
if ( startOfString < i ) {
al . add ( new ACIPString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
}
2003-09-04 04:04:21 +00:00
// Insert a tsheg if necessary. ACIP files aren't
// careful, so "KA\r\n" and "GA\n" appear where "KA
// \r\n" and "GA \n" should appear.
if ( ( '\r' = = ch
| | '\n' = = ch )
& & ! al . isEmpty ( )
& & ( ( ACIPString ) al . get ( al . size ( ) - 1 ) ) . getType ( ) = = ACIPString . TIBETAN_NON_PUNCTUATION ) {
al . add ( new ACIPString ( " " ,
ACIPString . TIBETAN_PUNCTUATION ) ) ;
}
// Don't add in a "\r\n" or "\n" unless there's a
// blank line.
boolean rn = false ;
2003-09-04 04:34:18 +00:00
boolean realNewline = false ;
2003-09-04 04:04:21 +00:00
if ( ( '\n' ! = ch & & '\r' ! = ch )
2003-09-04 04:34:18 +00:00
| | ( realNewline
= ( ( rn = ( '\n' = = ch & & i > = 3 & & s . charAt ( i - 3 ) = = '\r' & & s . charAt ( i - 2 ) = = '\n' & & s . charAt ( i - 1 ) = = '\r' ) )
| | ( '\n' = = ch & & i > = 1 & & s . charAt ( i - 1 ) = = '\n' ) ) ) ) {
for ( int h = 0 ; h < ( realNewline ? 2 : 1 ) ; h + + )
al . add ( new ACIPString ( rn ? s . substring ( i - 1 , i + 1 ) : s . substring ( i , i + 1 ) ,
ACIPString . TIBETAN_PUNCTUATION ) ) ;
2003-09-04 04:04:21 +00:00
}
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
currentType = ACIPString . ERROR ;
break ; // end TIBETAN_PUNCTUATION case
default :
2003-08-16 16:13:53 +00:00
if ( ! bracketTypeStack . empty ( ) ) {
int stackTop = ( ( Integer ) bracketTypeStack . peek ( ) ) . intValue ( ) ;
if ( ACIPString . CORRECTION_START = = stackTop & & '?' = = ch ) {
// allow it through...
break ;
}
}
2003-08-17 02:12:49 +00:00
if ( i + 1 = = sl & & 26 = = ( int ) ch )
// Silently allow the last character to be ^Z,
// which just marks end of file.
break ;
2003-08-14 05:10:47 +00:00
if ( ! ( isNumeric ( ch ) | | isAlpha ( ch ) ) ) {
if ( startOfString < i ) {
al . add ( new ACIPString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
}
2003-08-23 22:03:37 +00:00
if ( ( int ) ch = = 65533 ) {
al . add ( new ACIPString ( " Found an illegal, unprintable character. " ,
ACIPString . ERROR ) ) ;
if ( null ! = errors )
errors . append ( " Offset " + i + " or maybe " + ( i - numNewlines ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found an illegal, unprintable character. \ n " ) ;
2003-08-23 22:03:37 +00:00
} else if ( '\\' = = ch ) {
al . add ( new ACIPString ( " Found a Sanskrit virama, \\ , but the converter currently doesn't treat these properly. Sorry! Please do complain to the maintainers. " ,
ACIPString . ERROR ) ) ;
if ( null ! = errors )
errors . append ( " Offset " + i + " or maybe " + ( i - numNewlines ) + " : "
2003-08-18 02:38:54 +00:00
+ " Found a Sanskrit virama, \\ , but the converter currently doesn't treat these properly. Sorry! Please do complain to the maintainers. \ n " ) ;
2003-08-23 22:03:37 +00:00
} else {
al . add ( new ACIPString ( " Found an illegal character, " + ch + " , with ordinal " + ( int ) ch + " . " ,
ACIPString . ERROR ) ) ;
if ( null ! = errors )
errors . append ( " Offset " + i + " or maybe " + ( i - numNewlines ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found an illegal character, " + ch + " , with ordinal " + ( int ) ch + " . \ n " ) ;
}
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
currentType = ACIPString . ERROR ;
} else {
// Continue through the loop.
if ( ACIPString . ERROR = = currentType )
currentType = ACIPString . TIBETAN_NON_PUNCTUATION ;
}
break ; // end default case
}
}
if ( startOfString < sl ) {
al . add ( new ACIPString ( s . substring ( startOfString , sl ) ,
currentType ) ) ;
2003-08-16 16:13:53 +00:00
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( waitingForMatchingIllegalClose ) {
al . add ( new ACIPString ( " UNEXPECTED END OF INPUT " ,
ACIPString . ERROR ) ) ;
if ( null ! = errors ) {
errors . append ( " Offset END: "
+ " Truly unmatched open bracket found. \ n " ) ;
}
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
}
2003-08-16 16:13:53 +00:00
if ( ! bracketTypeStack . empty ( ) ) {
2003-08-24 06:40:53 +00:00
al . add ( new ACIPString ( " Unmatched open bracket found. A " + ( ( ACIPString . COMMENT = = currentType ) ? " comment " : " correction " ) + " does not terminate. " ,
2003-08-16 16:13:53 +00:00
ACIPString . ERROR ) ) ;
if ( null ! = errors ) {
2003-08-24 06:40:53 +00:00
errors . append ( " Offset END: "
+ " Unmatched open bracket found. A " + ( ( ACIPString . COMMENT = = currentType ) ? " comment " : " correction " ) + " does not terminate. \ n " ) ;
2003-08-14 05:10:47 +00:00
}
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-14 05:10:47 +00:00
}
2003-08-16 16:13:53 +00:00
if ( startSlashIndex > = 0 ) {
al . add ( new ACIPString ( " Slashes are supposed to occur in pairs, but the input had an unmatched '/' character. " ,
ACIPString . ERROR ) ) ;
if ( null ! = errors )
errors . append ( " Offset END: "
+ " Slashes are supposed to occur in pairs, but the input had an unmatched '/' character. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-16 16:13:53 +00:00
}
if ( startParenIndex > = 0 ) {
al . add ( new ACIPString ( " Parentheses are supposed to occur in pairs, but the input had an unmatched parenthesis. " ,
ACIPString . ERROR ) ) ;
if ( null ! = errors )
errors . append ( " Offset END: "
+ " Unmatched open parenthesis, (, found. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-16 16:13:53 +00:00
}
2003-08-14 05:10:47 +00:00
return al ;
}
2003-08-16 16:13:53 +00:00
2003-08-14 05:10:47 +00:00
/** See implementation. */
private static boolean isNumeric ( char ch ) {
return ch > = '0' & & ch < = '9' ;
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
/** See implementation. */
private static boolean isWhitespace ( char ch ) {
return ch = = ' ' | | ch = = '\t' | | ch = = '\r' | | ch = = '\n' ;
}
2003-08-14 05:10:47 +00:00
/** See implementation. */
private static boolean isAlpha ( char ch ) {
2003-08-16 16:13:53 +00:00
return ch = = '\'' // 23rd consonant
2003-08-14 05:10:47 +00:00
2003-08-16 16:13:53 +00:00
// combining punctuation, vowels:
2003-08-14 05:10:47 +00:00
| | ch = = '%'
| | ch = = 'o'
2003-08-17 02:38:58 +00:00
| | ch = = 'm'
2003-08-14 05:10:47 +00:00
| | ch = = 'x'
2003-08-16 16:13:53 +00:00
| | ch = = ':'
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
| | ch = = '^'
2003-08-18 02:38:54 +00:00
// DLC FIXME: we must treat this guy like a vowel, a special vowel that numerals can take on. Until then, warn. || ch == '\\'
2003-08-16 16:13:53 +00:00
| | ch = = '-'
| | ch = = '+'
2003-08-17 02:38:58 +00:00
| | ( ( ch > = 'A' & & ch < = 'Z' ) & & ch ! = 'X' & & ch ! = 'Q' & & ch ! = 'F' )
| | ch = = 'i'
| | ch = = 't'
| | ch = = 'h'
| | ch = = 'd'
| | ch = = 'n'
| | ch = = 's'
| | ch = = 'h' ;
2003-08-14 05:10:47 +00:00
}
}