2003-08-14 05:10:47 +00:00
/ *
The contents of this file are subject to the THDL Open Community License
Version 1 . 0 ( the " License " ) ; you may not use this file except in compliance
with the License . You may obtain a copy of the License on the THDL web site
( http : //www.thdl.org/).
Software distributed under the License is distributed on an " AS IS " basis ,
WITHOUT WARRANTY OF ANY KIND , either express or implied . See the
License for the specific terms governing rights and limitations under the
License .
The Initial Developer of this software is the Tibetan and Himalayan Digital
Library ( THDL ) . Portions created by the THDL are Copyright 2003 THDL .
All Rights Reserved .
Contributor ( s ) : ______________________________________ .
* /
package org.thdl.tib.text.ttt ;
import java.io.* ;
import java.util.ArrayList ;
import java.util.Stack ;
import org.thdl.util.ThdlDebug ;
2003-11-29 22:56:18 +00:00
import org.thdl.util.ThdlOptions ;
2003-08-14 05:10:47 +00:00
/ * *
* This class is able to break up Strings of ACIP text ( for example , an
* entire sutra file ) into tsheg bars , comments , etc . Folio markers ,
* comments , and the like are segregated ( so that consumers can ensure
* that they remain in Latin ) , and Tibetan passages are broken up into
* tsheg bars .
2003-11-11 03:43:11 +00:00
*
* < p > < b > FIXME : < / b > We should be handling { KA \ n \ nKHA } vs . { KA \ nKHA } in
* the parser , not here in the lexical analyzer . That ' d be cleaner ,
* and more like how you ' d do things if you used lex and yacc .
*
* @author David Chandler * /
2003-08-14 05:10:47 +00:00
public class ACIPTshegBarScanner {
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
/ * * Useful for testing . Gives error messages on standard output
* about why we can ' t scan the document perfectly and exits with
* non - zero return code , or says " Good scan! " otherwise and exits
* with code zero . < p > FIXME : not so efficient ; copies the whole
* file into memory first . * /
2003-08-16 16:13:53 +00:00
public static void main ( String [ ] args ) throws IOException {
2003-08-24 06:40:53 +00:00
if ( args . length ! = 1 ) {
System . out . println ( " Bad args! Need just the name of the ACIP text file. " ) ;
2003-08-16 16:13:53 +00:00
System . exit ( 1 ) ;
}
StringBuffer errors = new StringBuffer ( ) ;
2003-08-17 02:12:49 +00:00
int maxErrors = 250 ;
2003-08-24 06:40:53 +00:00
ArrayList al = scanFile ( args [ 0 ] , errors , maxErrors - 1 ) ;
2003-08-17 02:12:49 +00:00
if ( null = = al ) {
System . out . println ( maxErrors + " or more errors occurred while scanning ACIP input file; is this " ) ;
System . out . println ( " Tibetan or English input? " ) ;
System . out . println ( " " ) ;
System . out . println ( " First " + maxErrors + " errors scanning ACIP input file: " ) ;
System . out . println ( errors ) ;
System . out . println ( " Exiting with " + maxErrors + " or more errors; please fix input file and try again. " ) ;
System . exit ( 1 ) ;
}
2003-08-16 16:13:53 +00:00
if ( errors . length ( ) > 0 ) {
System . out . println ( " Errors scanning ACIP input file: " ) ;
System . out . println ( errors ) ;
System . out . println ( " Exiting; please fix input file and try again. " ) ;
System . exit ( 1 ) ;
}
System . out . println ( " Good scan! " ) ;
System . exit ( 0 ) ;
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
/ * * Scans an ACIP file with path fname into tsheg bars . If errors
2003-08-24 06:40:53 +00:00
* is non - null , error messages will be appended to it . Returns a
2003-10-04 01:22:59 +00:00
* list of TStrings that is the scan . < p > FIXME : not so
2003-08-24 06:40:53 +00:00
* efficient ; copies the whole file into memory first .
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
* @throws IOException if we cannot read in the ACIP input file * /
2003-08-24 06:40:53 +00:00
public static ArrayList scanFile ( String fname , StringBuffer errors , int maxErrors )
throws IOException
{
return scanStream ( new FileInputStream ( fname ) ,
errors , maxErrors ) ;
}
/ * * Scans a stream of ACIP into tsheg bars . If errors is
* non - null , error messages will be appended to it . You can
* recover both errors and warnings ( modulo offset information )
2003-10-04 01:22:59 +00:00
* from the result , though . Returns a list of TStrings that
2003-08-24 06:40:53 +00:00
* is the scan , or null if more than maxErrors occur . < p > FIXME :
* not so efficient ; copies the whole file into memory first .
* @throws IOException if we cannot read the whole ACIP stream * /
public static ArrayList scanStream ( InputStream stream , StringBuffer errors ,
int maxErrors )
2003-08-17 02:12:49 +00:00
throws IOException
{
2003-08-16 16:13:53 +00:00
StringBuffer s = new StringBuffer ( ) ;
char ch [ ] = new char [ 8192 ] ;
BufferedReader in
2003-08-24 06:40:53 +00:00
= new BufferedReader ( new InputStreamReader ( stream , " US-ASCII " ) ) ;
2003-08-16 16:13:53 +00:00
int amt ;
while ( - 1 ! = ( amt = in . read ( ch ) ) ) {
s . append ( ch , 0 , amt ) ;
}
2003-08-18 02:38:54 +00:00
in . close ( ) ;
2003-08-24 06:40:53 +00:00
return scan ( s . toString ( ) , errors , maxErrors ) ;
2003-08-16 16:13:53 +00:00
}
2003-11-11 03:43:11 +00:00
/ * * Helper . Here because ACIP { MTHAR % \ nKHA } should be treated the
same w . r . t . tsheg insertion regardless of the lex errors and
lex warnings found . * /
private static boolean lastNonExceptionalThingWasNonPunctish ( ArrayList al ) {
int i = al . size ( ) - 1 ;
while ( i > = 0 & & ( ( ( TString ) al . get ( i ) ) . getType ( ) = = TString . WARNING
| | ( ( TString ) al . get ( i ) ) . getType ( ) = = TString . ERROR ) )
- - i ;
return ( i > = 0 & & // FIXME: or maybe i < 0 || ...
( ( ( TString ) al . get ( i ) ) . getType ( ) = = TString . TIBETAN_NON_PUNCTUATION
| | ( ( TString ) al . get ( i ) ) . getType ( ) = = TString . TSHEG_BAR_ADORNMENT ) ) ;
}
2003-10-04 01:22:59 +00:00
/ * * Returns a list of { @link TString TStrings } corresponding
2003-08-14 05:10:47 +00:00
* to s , possibly the empty list ( when the empty string is the
* input ) . Each String is either a Latin comment , some Latin
* text , a tsheg bar ( minus the tsheg or shad or whatever ) , a
* String of inter - tsheg - bar punctuation , etc .
*
2003-08-24 06:40:53 +00:00
* < p > This not only scans ; it finds all the errors and warnings a
* parser would too , like " NYA x " and " ( " and " ) " and " /NYA " etc .
2003-10-04 01:22:59 +00:00
* It puts those in as TStrings with type { @link
* TString # ERROR } or { @link TString # WARNING } , and also , if
2003-08-24 06:40:53 +00:00
* errors is non - null , appends helpful messages to errors , each
* followed by a '\n' .
2003-08-17 02:12:49 +00:00
* @param s the ACIP text
* @param errors if non - null , the buffer to which to append error
2003-11-09 01:07:45 +00:00
* messages ( FIXME : kludge , just get this info by scanning
2003-10-04 01:22:59 +00:00
* the result for TString . ERROR ( and maybe TString . WARNING ,
2003-08-24 06:40:53 +00:00
* if you care about warnings ) , but then we ' d have to put the
2003-10-04 01:22:59 +00:00
* Offset info in the TString )
2003-08-17 02:12:49 +00:00
* @param maxErrors if nonnegative , then scanning will stop when
* more than maxErrors errors occur . In this event , null is
* returned .
* @return null if more than maxErrors errors occur , or the scan
* otherwise
2003-08-14 05:10:47 +00:00
* /
2003-08-24 06:40:53 +00:00
public static ArrayList scan ( String s , StringBuffer errors , int maxErrors ) {
2003-08-14 05:10:47 +00:00
// the size depends on whether it's mostly Tibetan or mostly
// Latin and a number of other factors. This is meant to be
// an underestimate, but not too much of an underestimate.
2003-08-17 02:12:49 +00:00
int numErrors = 0 ;
2003-08-14 05:10:47 +00:00
ArrayList al = new ArrayList ( s . length ( ) / 10 ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
boolean waitingForMatchingIllegalClose = false ;
2003-08-14 05:10:47 +00:00
int sl = s . length ( ) ;
2003-10-04 01:22:59 +00:00
int currentType = TString . ERROR ;
2003-08-14 05:10:47 +00:00
int startOfString = 0 ;
Stack bracketTypeStack = new Stack ( ) ;
int startSlashIndex = - 1 ;
int startParenIndex = - 1 ;
2003-08-23 22:03:37 +00:00
int numNewlines = 0 ;
2003-08-14 05:10:47 +00:00
for ( int i = 0 ; i < sl ; i + + ) {
if ( i < startOfString ) throw new Error ( " bad reset " ) ;
char ch ;
ch = s . charAt ( i ) ;
2003-08-23 22:03:37 +00:00
if ( ch = = '\n' ) + + numNewlines ;
2003-10-04 01:22:59 +00:00
if ( TString . COMMENT = = currentType & & ch ! = ']' ) {
2003-08-16 16:13:53 +00:00
if ( '[' = = ch ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Found an open bracket within a [#COMMENT]-style comment. Brackets may not appear in comments. \ n " ,
TString . ERROR ) ) ;
2003-08-16 16:13:53 +00:00
if ( null ! = errors )
2003-10-19 20:16:06 +00:00
errors . append ( " Offset " + i + ( ( numNewlines = = 0 ) ? " " : ( " or maybe " + ( i - numNewlines ) ) ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found an open bracket within a [#COMMENT]-style comment. Brackets may not appear in comments. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-16 16:13:53 +00:00
}
2003-08-14 05:10:47 +00:00
continue ;
2003-08-16 16:13:53 +00:00
}
2003-08-14 05:10:47 +00:00
switch ( ch ) {
2003-11-09 23:15:58 +00:00
case '}' : // fall through...
2003-08-14 05:10:47 +00:00
case ']' :
if ( bracketTypeStack . empty ( ) ) {
// Error.
if ( startOfString < i ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
2003-08-14 05:10:47 +00:00
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( ! waitingForMatchingIllegalClose ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Found a truly unmatched close bracket, " + s . substring ( i , i + 1 ) ,
TString . ERROR ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( null ! = errors ) {
2003-10-19 20:16:06 +00:00
errors . append ( " Offset " + i + ( ( numNewlines = = 0 ) ? " " : ( " or maybe " + ( i - numNewlines ) ) ) + " : "
2003-08-17 02:12:49 +00:00
+ " Found a truly unmatched close bracket, ] or }. \ n " ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
}
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
}
waitingForMatchingIllegalClose = false ;
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Found a closing bracket without a matching open bracket. Perhaps a [#COMMENT] incorrectly written as [COMMENT], or a [*CORRECTION] written incorrectly as [CORRECTION], caused this. " ,
TString . ERROR ) ) ;
2003-08-16 16:13:53 +00:00
if ( null ! = errors )
2003-10-19 20:16:06 +00:00
errors . append ( " Offset " + i + ( ( numNewlines = = 0 ) ? " " : ( " or maybe " + ( i - numNewlines ) ) ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found a closing bracket without a matching open bracket. Perhaps a [#COMMENT] incorrectly written as [COMMENT], or a [*CORRECTION] written incorrectly as [CORRECTION], caused this. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-14 05:10:47 +00:00
} else {
int stackTop = ( ( Integer ) bracketTypeStack . pop ( ) ) . intValue ( ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
int end = startOfString ;
2003-10-04 01:22:59 +00:00
if ( TString . CORRECTION_START = = stackTop ) {
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// This definitely indicates a new token.
2003-08-14 05:10:47 +00:00
char prevCh = s . charAt ( i - 1 ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( prevCh = = '?' )
end = i - 1 ;
else
end = i ;
if ( startOfString < end ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( startOfString , end ) ,
currentType ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
}
2003-08-14 05:10:47 +00:00
if ( '?' ! = prevCh ) {
2003-10-04 01:22:59 +00:00
currentType = TString . PROBABLE_CORRECTION ;
2003-08-14 05:10:47 +00:00
} else {
2003-10-04 01:22:59 +00:00
currentType = TString . POSSIBLE_CORRECTION ;
2003-08-14 05:10:47 +00:00
}
}
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( end , i + 1 ) , currentType ) ) ;
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-14 05:10:47 +00:00
}
2003-08-16 16:13:53 +00:00
break ; // end ']','}' case
2003-08-14 05:10:47 +00:00
2003-08-16 16:13:53 +00:00
case '{' : // NOTE WELL: KX0016I.ACT, KD0095M.ACT, and a
// host of other ACIP files use {} brackets like
// [] brackets. I treat both the same.
2003-11-09 23:15:58 +00:00
// fall through...
2003-08-14 05:10:47 +00:00
case '[' :
// This definitely indicates a new token.
if ( startOfString < i ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
2003-08-14 05:10:47 +00:00
startOfString = i ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-14 05:10:47 +00:00
}
String thingy = null ;
if ( i + " [DD] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [DD] " . length ( ) ) . equals ( " [DD] " )
| | s . substring ( i , i + " [DD] " . length ( ) ) . equals ( " {DD} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [DD] " ;
2003-10-04 01:22:59 +00:00
currentType = TString . DD ;
2003-08-14 05:10:47 +00:00
} else if ( i + " [DD1] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [DD1] " . length ( ) ) . equals ( " [DD1] " )
| | s . substring ( i , i + " [DD1] " . length ( ) ) . equals ( " {DD1} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [DD1] " ;
2003-10-04 01:22:59 +00:00
currentType = TString . DD ;
2003-08-14 05:10:47 +00:00
} else if ( i + " [DD2] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [DD2] " . length ( ) ) . equals ( " [DD2] " )
| | s . substring ( i , i + " [DD2] " . length ( ) ) . equals ( " {DD2} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [DD2] " ;
2003-10-04 01:22:59 +00:00
currentType = TString . DD ;
2003-08-14 05:10:47 +00:00
} else if ( i + " [DDD] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [DDD] " . length ( ) ) . equals ( " [DDD] " )
| | s . substring ( i , i + " [DDD] " . length ( ) ) . equals ( " {DDD} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [DDD] " ;
2003-10-04 01:22:59 +00:00
currentType = TString . DD ;
2003-08-14 05:10:47 +00:00
} else if ( i + " [DR] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [DR] " . length ( ) ) . equals ( " [DR] " )
| | s . substring ( i , i + " [DR] " . length ( ) ) . equals ( " {DR} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [DR] " ;
2003-10-04 01:22:59 +00:00
currentType = TString . DR ;
2003-08-14 05:10:47 +00:00
} else if ( i + " [LS] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [LS] " . length ( ) ) . equals ( " [LS] " )
| | s . substring ( i , i + " [LS] " . length ( ) ) . equals ( " {LS} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [LS] " ;
2003-10-04 01:22:59 +00:00
currentType = TString . LS ;
2003-08-14 05:10:47 +00:00
} else if ( i + " [BP] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [BP] " . length ( ) ) . equals ( " [BP] " )
| | s . substring ( i , i + " [BP] " . length ( ) ) . equals ( " {BP} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [BP] " ;
2003-10-04 01:22:59 +00:00
currentType = TString . BP ;
2003-08-23 22:03:37 +00:00
} else if ( i + " [BLANK PAGE] " . length ( ) < = sl
& & ( s . substring ( i , i + " [BLANK PAGE] " . length ( ) ) . equals ( " [BLANK PAGE] " )
| | s . substring ( i , i + " [BLANK PAGE] " . length ( ) ) . equals ( " {BLANK PAGE} " ) ) ) {
thingy = " [BLANK PAGE] " ;
2003-10-04 01:22:59 +00:00
currentType = TString . BP ;
2003-08-16 16:13:53 +00:00
} else if ( i + " [ BP ] " . length ( ) < = sl
& & ( s . substring ( i , i + " [ BP ] " . length ( ) ) . equals ( " [ BP ] " )
| | s . substring ( i , i + " [ BP ] " . length ( ) ) . equals ( " { BP } " ) ) ) {
thingy = " { BP } " ; // found in TD3790E2.ACT
2003-10-04 01:22:59 +00:00
currentType = TString . BP ;
2003-08-16 16:13:53 +00:00
} else if ( i + " [ DD ] " . length ( ) < = sl
& & ( s . substring ( i , i + " [ DD ] " . length ( ) ) . equals ( " [ DD ] " )
| | s . substring ( i , i + " [ DD ] " . length ( ) ) . equals ( " { DD } " ) ) ) {
thingy = " { DD } " ; // found in TD3790E2.ACT
2003-10-04 01:22:59 +00:00
currentType = TString . DD ;
2003-08-14 05:10:47 +00:00
} else if ( i + " [?] " . length ( ) < = sl
2003-08-16 16:13:53 +00:00
& & ( s . substring ( i , i + " [?] " . length ( ) ) . equals ( " [?] " )
| | s . substring ( i , i + " [?] " . length ( ) ) . equals ( " {?} " ) ) ) {
2003-08-14 05:10:47 +00:00
thingy = " [?] " ;
2003-10-04 01:22:59 +00:00
currentType = TString . QUESTION ;
2003-08-16 16:13:53 +00:00
} else {
// We see comments appear not as [#COMMENT], but
// as [COMMENT] sometimes. We make special cases
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// for some English comments. There's no need to
// make this mechanism extensible, because you
// can easily edit the ACIP text so that it uses
// [#COMMENT] notation instead of [COMMENT].
2003-08-16 16:13:53 +00:00
String [ ] englishComments = new String [ ] {
" FIRST " , " SECOND " , // S5274I.ACT
" Additional verses added by Khen Rinpoche here are " , // S0216M.ACT
" ADDENDUM: The text of " , // S0216M.ACT
" END OF ADDENDUM " , // S0216M.ACT
" Some of the verses added here by Khen Rinpoche include: " , // S0216M.ACT
" Note that, in the second verse, the {YUL LJONG} was orignally {GANG LJONG}, \ nand is now recited this way since the ceremony is not only taking place in Tibet. " , // S0216M.ACT
" Note that, in the second verse, the {YUL LJONG} was orignally {GANG LJONG}, \ r \ nand is now recited this way since the ceremony is not only taking place in Tibet. " , // S0216M.ACT
" text missing " , // S6954E1.ACT
" INCOMPLETE " , // TD3817I.INC
" MISSING PAGE " , // S0935m.act
" MISSING FOLIO " , // S0975I.INC
" UNCLEAR LINE " , // S0839D1I.INC
" THE FOLLOWING TEXT HAS INCOMPLETE SECTIONS, WHICH ARE ON ORDER " , // SE6260A.INC
" @DATA INCOMPLETE HERE " , // SE6260A.INC
" @DATA MISSING HERE " , // SE6260A.INC
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
" LINE APPARENTLY MISSING THIS PAGE " , // TD4035I.INC
2003-08-16 16:13:53 +00:00
" DATA INCOMPLETE HERE " , // TD4226I2.INC
" DATA MISSING HERE " , // just being consistent
" FOLLOWING SECTION WAS NOT AVAILABLE WHEN THIS EDITION WAS \ nPRINTED, AND IS SUPPLIED FROM ANOTHER, PROBABLY THE ORIGINAL: " , // S0018N.ACT
" FOLLOWING SECTION WAS NOT AVAILABLE WHEN THIS EDITION WAS \ r \ nPRINTED, AND IS SUPPLIED FROM ANOTHER, PROBABLY THE ORIGINAL: " , // S0018N.ACT
" THESE PAGE NUMBERS RESERVED IN THIS EDITION FOR PAGES \ nMISSING FROM ORIGINAL ON WHICH IT WAS BASED " , // S0018N.ACT
" THESE PAGE NUMBERS RESERVED IN THIS EDITION FOR PAGES \ r \ nMISSING FROM ORIGINAL ON WHICH IT WAS BASED " , // S0018N.ACT
" PAGE NUMBERS RESERVED FROM THIS EDITION FOR MISSING \ nSECTION SUPPLIED BY PRECEDING " , // S0018N.ACT
" PAGE NUMBERS RESERVED FROM THIS EDITION FOR MISSING \ r \ nSECTION SUPPLIED BY PRECEDING " , // S0018N.ACT
" SW: OK " , // S0057M.ACT
" m:ok " , // S0057M.ACT
" A FIRST ONE \ nMISSING HERE? " , // S0057M.ACT
" A FIRST ONE \ r \ nMISSING HERE? " , // S0057M.ACT
" THE INITIAL PART OF THIS TEXT WAS INPUT BY THE SERA MEY LIBRARY IN \ nTIBETAN FONT AND NEEDS TO BE REDONE BY DOUBLE INPUT " , // S0195A1.INC
" THE INITIAL PART OF THIS TEXT WAS INPUT BY THE SERA MEY LIBRARY IN \ r \ nTIBETAN FONT AND NEEDS TO BE REDONE BY DOUBLE INPUT " , // S0195A1.INC
} ;
boolean foundOne = false ;
for ( int ec = 0 ; ec < englishComments . length ; ec + + ) {
if ( i + 2 + englishComments [ ec ] . length ( ) < = sl
& & ( s . substring ( i , i + 2 + englishComments [ ec ] . length ( ) ) . equals ( " [ " + englishComments [ ec ] + " ] " )
| | s . substring ( i , i + 2 + englishComments [ ec ] . length ( ) ) . equals ( " [ " + englishComments [ ec ] + " ] " ) ) ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " [# " + englishComments [ ec ] + " ] " ,
TString . COMMENT ) ) ;
2003-08-16 16:13:53 +00:00
startOfString = i + 2 + englishComments [ ec ] . length ( ) ;
i = startOfString - 1 ;
foundOne = true ;
break ;
}
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( ! foundOne & & i + 1 < sl & & s . charAt ( i + 1 ) = = '*' ) {
// Identify [*LINE BREAK?] as an English
// correction. Every correction not on this
2003-11-09 01:07:45 +00:00
// list is considered to be Tibetan.
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// FIXME: make this extensible via a config
// file or at least a System property (which
// could be a comma-separated list of these
// creatures.
// If "LINE" is in the list below, then [*
// LINE], [* LINE?], [*LINE], [*LINE?], [*
// LINE OUT ?], etc. will be considered
// English corrections. I.e., whitespace
// before and anything after doesn't prevent a
// match.
String [ ] englishCorrections = new String [ ] {
" LINE " , // KD0001I1.ACT
" DATA " , // KL0009I2.INC
" BLANK " , // KL0009I2.INC
" NOTE " , // R0001F.ACM
" alternate " , // R0018F.ACE
" 02101-02150 missing " , // R1003A3.INC
" 51501-51550 missing " , // R1003A52.ACT
" BRTAGS ETC " , // S0002N.ACT
" TSAN, ETC " , // S0015N.ACT
" SNYOMS, THROUGHOUT " , // S0016N.ACT
" KYIS ETC " , // S0019N.ACT
" MISSING " , // S0455M.ACT
" this " , // S6850I1B.ALT
" THIS " , // S0057M.ACT
} ;
int begin ;
for ( begin = i + 2 ; begin < sl ; begin + + ) {
if ( ! isWhitespace ( s . charAt ( begin ) ) )
break ;
}
int end ;
for ( end = i + 2 ; end < sl ; end + + ) {
if ( s . charAt ( end ) = = ']' )
break ;
}
int realEnd = end ;
if ( end < sl & & s . charAt ( end - 1 ) = = '?' )
- - realEnd ;
if ( end < sl & & begin < realEnd ) {
String interestingSubstring
= s . substring ( begin , realEnd ) ;
for ( int ec = 0 ; ec < englishCorrections . length ; ec + + ) {
if ( interestingSubstring . startsWith ( englishCorrections [ ec ] ) ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( i , i + 2 ) ,
TString . CORRECTION_START ) ) ;
al . add ( new TString ( s . substring ( i + 2 , realEnd ) ,
TString . LATIN ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( s . charAt ( end - 1 ) = = '?' ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( end - 1 , end + 1 ) ,
TString . POSSIBLE_CORRECTION ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
} else {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( end , end + 1 ) ,
TString . PROBABLE_CORRECTION ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
}
foundOne = true ;
startOfString = end + 1 ;
i = startOfString - 1 ;
break ;
}
}
}
}
2003-08-16 16:13:53 +00:00
if ( foundOne )
break ;
2003-08-14 05:10:47 +00:00
}
if ( null ! = thingy ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( thingy ,
currentType ) ) ;
2003-08-14 05:10:47 +00:00
startOfString = i + thingy . length ( ) ;
i = startOfString - 1 ;
} else {
if ( i + 1 < sl ) {
char nextCh = s . charAt ( i + 1 ) ;
if ( '*' = = nextCh ) {
2003-10-04 01:22:59 +00:00
currentType = TString . CORRECTION_START ;
2003-08-14 05:10:47 +00:00
bracketTypeStack . push ( new Integer ( currentType ) ) ;
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( i , i + 2 ) ,
TString . CORRECTION_START ) ) ;
currentType = TString . ERROR ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
startOfString = i + 2 ;
i = startOfString - 1 ;
2003-08-14 05:10:47 +00:00
break ;
} else if ( '#' = = nextCh ) {
2003-10-04 01:22:59 +00:00
currentType = TString . COMMENT ;
2003-08-14 05:10:47 +00:00
bracketTypeStack . push ( new Integer ( currentType ) ) ;
break ;
}
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// This is an error. Sometimes [COMMENTS APPEAR
// WITHOUT # MARKS]. Though "... [" could cause
// this too.
if ( waitingForMatchingIllegalClose ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Found a truly unmatched open bracket, [ or {, prior to this current illegal open bracket. " ,
TString . ERROR ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( null ! = errors ) {
2003-10-19 20:16:06 +00:00
errors . append ( " Offset " + i + ( ( numNewlines = = 0 ) ? " " : ( " or maybe " + ( i - numNewlines ) ) ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found a truly unmatched open bracket, [ or {, prior to this current illegal open bracket. \ n " ) ;
}
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
}
waitingForMatchingIllegalClose = true ;
2003-08-16 16:13:53 +00:00
if ( null ! = errors ) {
String inContext = s . substring ( i , i + Math . min ( sl - i , 10 ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( inContext . indexOf ( " \ r " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ r " ) ) ;
} else if ( inContext . indexOf ( " \ n " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ n " ) ) ;
} else {
if ( sl - i > 10 ) {
inContext = inContext + " ... " ;
}
2003-08-16 16:13:53 +00:00
}
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Found an illegal open bracket (in context, this is " + inContext + " ). Perhaps there is a [#COMMENT] written incorrectly as [COMMENT], or a [*CORRECTION] written incorrectly as [CORRECTION], or an unmatched open bracket? " ,
TString . ERROR ) ) ;
2003-10-19 20:16:06 +00:00
errors . append ( " Offset " + i + ( ( numNewlines = = 0 ) ? " " : ( " or maybe " + ( i - numNewlines ) ) ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found an illegal open bracket (in context, this is " + inContext + " ). Perhaps there is a [#COMMENT] written incorrectly as [COMMENT], or a [*CORRECTION] written incorrectly as [CORRECTION], or an unmatched open bracket? \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-16 16:13:53 +00:00
}
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-14 05:10:47 +00:00
}
2003-08-16 16:13:53 +00:00
break ; // end '[','{' case
2003-08-14 05:10:47 +00:00
case '@' :
// This definitely indicates a new token.
if ( startOfString < i ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
2003-08-14 05:10:47 +00:00
startOfString = i ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-14 05:10:47 +00:00
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// We look for {@N{AB}, @NN{AB}, ..., @NNNNNN{AB}},
// {@[N{AB}], @[NN{AB}], ..., @[NNNNNN{AB}]},
// {@N{AB}.N, @NN{AB}.N, ..., @NNNNNN{AB}.N}, {@N,
// @NN, ..., @NNNNNN}, and {@{AB}N, @{AB}NN,
// ... @{AB}NNNNNN} only, that is from one to six
// digits. Each of these folio marker format occurs
// in practice.
for ( int numdigits = 6 ; numdigits > = 1 ; numdigits - - ) {
// @NNN{AB} and @NNN{AB}.N cases:
2003-08-14 05:10:47 +00:00
if ( i + numdigits + 1 < sl
& & ( s . charAt ( i + numdigits + 1 ) = = 'A' | | s . charAt ( i + numdigits + 1 ) = = 'B' ) ) {
boolean allAreNumeric = true ;
for ( int k = 1 ; k < = numdigits ; k + + ) {
if ( ! isNumeric ( s . charAt ( i + k ) ) ) {
allAreNumeric = false ;
break ;
}
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( allAreNumeric ) {
// Is this "@012B " or "@012B.3 "?
int extra ;
if ( i + numdigits + 2 < sl & & s . charAt ( i + numdigits + 2 ) = = '.' ) {
if ( ! ( i + numdigits + 4 < sl & & isNumeric ( s . charAt ( i + numdigits + 3 ) )
& & ! isNumeric ( s . charAt ( i + numdigits + 4 ) ) ) ) {
String inContext = s . substring ( i , i + Math . min ( sl - i , 10 ) ) ;
if ( inContext . indexOf ( " \ r " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ r " ) ) ;
} else if ( inContext . indexOf ( " \ n " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ n " ) ) ;
} else {
if ( sl - i > 10 ) {
inContext = inContext + " ... " ;
}
}
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Found an illegal at sign, @ (in context, this is " + inContext + " ). This folio marker has a period, '.', at the end of it, which is illegal. " ,
TString . ERROR ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( null ! = errors )
2003-10-19 20:16:06 +00:00
errors . append ( " Offset " + i + ( ( numNewlines = = 0 ) ? " " : ( " or maybe " + ( i - numNewlines ) ) ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found an illegal at sign, @ (in context, this is " + inContext + " ). This folio marker has a period, '.', at the end of it, which is illegal. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
startOfString = i + numdigits + 3 ;
i = startOfString - 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
break ;
}
if ( i + numdigits + 4 < sl & & ( s . charAt ( i + numdigits + 4 ) = = '.' | | s . charAt ( i + numdigits + 4 ) = = 'A' | | s . charAt ( i + numdigits + 4 ) = = 'B' | | s . charAt ( i + numdigits + 4 ) = = 'a' | | s . charAt ( i + numdigits + 4 ) = = 'b' | | isNumeric ( s . charAt ( i + numdigits + 4 ) ) ) ) {
String inContext = s . substring ( i , i + Math . min ( sl - i , 10 ) ) ;
if ( inContext . indexOf ( " \ r " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ r " ) ) ;
} else if ( inContext . indexOf ( " \ n " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ n " ) ) ;
} else {
if ( sl - i > 10 ) {
inContext = inContext + " ... " ;
}
}
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Found an illegal at sign, @ (in context, this is " + inContext + " ). This folio marker is not followed by whitespace, as is expected. " ,
TString . ERROR ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( null ! = errors )
2003-10-19 20:16:06 +00:00
errors . append ( " Offset " + i + ( ( numNewlines = = 0 ) ? " " : ( " or maybe " + ( i - numNewlines ) ) ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found an illegal at sign, @ (in context, this is " + inContext + " ). This folio marker is not followed by whitespace, as is expected. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-11-09 01:07:45 +00:00
startOfString = i + 1 ; // FIXME: skip over more? test this code.
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
break ;
}
extra = 4 ;
} else {
extra = 2 ;
}
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( i , i + numdigits + extra ) ,
TString . FOLIO_MARKER ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
startOfString = i + numdigits + extra ;
i = startOfString - 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
break ;
}
}
// @{AB}NNN case:
if ( i + numdigits + 1 < sl
& & ( s . charAt ( i + 1 ) = = 'A' | | s . charAt ( i + 1 ) = = 'B' ) ) {
boolean allAreNumeric = true ;
for ( int k = 1 ; k < = numdigits ; k + + ) {
if ( ! isNumeric ( s . charAt ( i + 1 + k ) ) ) {
allAreNumeric = false ;
break ;
}
}
2003-08-14 05:10:47 +00:00
if ( allAreNumeric ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( i , i + numdigits + 2 ) ,
TString . FOLIO_MARKER ) ) ;
2003-08-14 05:10:47 +00:00
startOfString = i + numdigits + 2 ;
2003-08-16 16:13:53 +00:00
i = startOfString - 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-16 16:13:53 +00:00
break ;
}
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// @[NNN{AB}] case:
2003-08-16 16:13:53 +00:00
if ( i + numdigits + 3 < sl
& & s . charAt ( i + 1 ) = = '[' & & s . charAt ( i + numdigits + 3 ) = = ']'
& & ( s . charAt ( i + numdigits + 2 ) = = 'A' | | s . charAt ( i + numdigits + 2 ) = = 'B' ) ) {
boolean allAreNumeric = true ;
for ( int k = 1 ; k < = numdigits ; k + + ) {
if ( ! isNumeric ( s . charAt ( i + 1 + k ) ) ) {
allAreNumeric = false ;
break ;
}
}
if ( allAreNumeric ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( i , i + numdigits + 4 ) ,
TString . FOLIO_MARKER ) ) ;
2003-08-16 16:13:53 +00:00
startOfString = i + numdigits + 4 ;
i = startOfString - 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-14 05:10:47 +00:00
break ;
}
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// This case, @NNN, must come after the @NNN{AB} case.
2003-08-23 22:03:37 +00:00
if ( i + numdigits + 1 < sl & & ( s . charAt ( i + numdigits + 1 ) = = ' '
| | s . charAt ( i + numdigits + 1 ) = = '\n'
| | s . charAt ( i + numdigits + 1 ) = = '\r' ) ) {
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
boolean allAreNumeric = true ;
for ( int k = 1 ; k < = numdigits ; k + + ) {
if ( ! isNumeric ( s . charAt ( i + k ) ) ) {
allAreNumeric = false ;
break ;
}
}
if ( allAreNumeric ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( i , i + numdigits + 1 ) ,
TString . FOLIO_MARKER ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
startOfString = i + numdigits + 1 ;
i = startOfString - 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
break ;
}
}
2003-08-14 05:10:47 +00:00
}
if ( startOfString = = i ) {
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
String inContext = s . substring ( i , i + Math . min ( sl - i , 10 ) ) ;
if ( inContext . indexOf ( " \ r " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ r " ) ) ;
} else if ( inContext . indexOf ( " \ n " ) > = 0 ) {
inContext = inContext . substring ( 0 , inContext . indexOf ( " \ n " ) ) ;
} else {
if ( sl - i > 10 ) {
inContext = inContext + " ... " ;
}
}
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Found an illegal at sign, @ (in context, this is " + inContext + " ). @012B is an example of a legal folio marker. " ,
TString . ERROR ) ) ;
2003-08-16 16:13:53 +00:00
if ( null ! = errors )
2003-10-19 20:16:06 +00:00
errors . append ( " Offset " + i + ( ( numNewlines = = 0 ) ? " " : ( " or maybe " + ( i - numNewlines ) ) ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found an illegal at sign, @ (in context, this is " + inContext + " ). @012B is an example of a legal folio marker. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-14 05:10:47 +00:00
}
break ; // end '@' case
case '/' :
// This definitely indicates a new token.
if ( startOfString < i ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
2003-08-14 05:10:47 +00:00
startOfString = i ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-14 05:10:47 +00:00
}
if ( startSlashIndex > = 0 ) {
2003-08-18 02:38:54 +00:00
if ( startSlashIndex + 1 = = i ) {
/ * //NYA\\ appears in ACIP input, and I think
* it means / NYA / . We warn about // for this
2003-11-09 01:07:45 +00:00
* reason . \ \ causes a tsheg - bar error . * /
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Found //, which could be legal (the Unicode would be \\ u0F3C \\ u0F3D), but is likely in an illegal construct like //NYA \\ \\ . " ,
TString . ERROR ) ) ;
2003-08-18 02:38:54 +00:00
if ( errors ! = null ) {
2003-10-19 20:16:06 +00:00
errors . append ( " Offset " + i + ( ( numNewlines = = 0 ) ? " " : ( " or maybe " + ( i - numNewlines ) ) ) + " : "
2003-08-18 02:38:54 +00:00
+ " Found //, which could be legal (the Unicode would be \\ u0F3C \\ u0F3D), but is likely in an illegal construct like //NYA \\ \\ . \ n " ) ;
}
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
}
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( i , i + 1 ) ,
TString . END_SLASH ) ) ;
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-14 05:10:47 +00:00
startSlashIndex = - 1 ;
} else {
startSlashIndex = i ;
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( i , i + 1 ) ,
TString . START_SLASH ) ) ;
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-14 05:10:47 +00:00
}
break ; // end '/' case
case '(' :
case ')' :
// This definitely indicates a new token.
if ( startOfString < i ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
2003-08-14 05:10:47 +00:00
startOfString = i ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-14 05:10:47 +00:00
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// We do not support nesting like (NYA (BA)).
2003-08-14 05:10:47 +00:00
if ( startParenIndex > = 0 ) {
2003-08-16 16:13:53 +00:00
if ( ch = = '(' ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Found an illegal open parenthesis, (. Nesting of parentheses is not allowed. " ,
TString . ERROR ) ) ;
2003-08-16 16:13:53 +00:00
if ( null ! = errors )
2003-10-19 20:16:06 +00:00
errors . append ( " Offset " + i + ( ( numNewlines = = 0 ) ? " " : ( " or maybe " + ( i - numNewlines ) ) ) + " : "
2003-08-16 16:13:53 +00:00
+ " Found an illegal open parenthesis, (. Nesting of parentheses is not allowed. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-16 16:13:53 +00:00
} else {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( i , i + 1 ) , TString . END_PAREN ) ) ;
2003-08-14 05:10:47 +00:00
startParenIndex = - 1 ;
}
startOfString = i + 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-14 05:10:47 +00:00
} else {
2003-08-16 16:13:53 +00:00
if ( ch = = ')' ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Unexpected closing parenthesis, ), found. " ,
TString . ERROR ) ) ;
2003-08-16 16:13:53 +00:00
if ( null ! = errors )
2003-10-19 20:16:06 +00:00
errors . append ( " Offset " + i + ( ( numNewlines = = 0 ) ? " " : ( " or maybe " + ( i - numNewlines ) ) ) + " : "
2003-08-16 16:13:53 +00:00
+ " Unexpected closing parenthesis, ), found. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-16 16:13:53 +00:00
} else {
2003-08-14 05:10:47 +00:00
startParenIndex = i ;
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( i , i + 1 ) , TString . START_PAREN ) ) ;
2003-08-14 05:10:47 +00:00
}
startOfString = i + 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-14 05:10:47 +00:00
}
2003-08-16 16:13:53 +00:00
break ; // end '(',')' case
case '?' :
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( bracketTypeStack . empty ( ) | | i + 1 > = sl
| | ( s . charAt ( i + 1 ) ! = ']' & & s . charAt ( i + 1 ) ! = '}' ) ) {
2003-08-16 16:13:53 +00:00
// The tsheg bar ends here; new token.
if ( startOfString < i ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
2003-08-16 16:13:53 +00:00
}
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( i , i + 1 ) ,
TString . QUESTION ) ) ;
2003-08-16 16:13:53 +00:00
startOfString = i + 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-16 16:13:53 +00:00
} // else this is [*TR'A ?] or the like.
break ; // end '?' case
2003-08-14 05:10:47 +00:00
2003-08-16 16:13:53 +00:00
case '.' :
// This definitely indicates a new token.
if ( startOfString < i ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
2003-08-16 16:13:53 +00:00
startOfString = i ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-16 16:13:53 +00:00
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// . is used for a non-breaking tsheg, such as in
2003-08-24 06:40:53 +00:00
// {NGO.,} and {....,DAM}. We give a warning unless ,
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
// or ., or [A-Za-z] follows '.'.
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( i , i + 1 ) ,
TString . TIBETAN_PUNCTUATION ) ) ;
2003-08-24 06:40:53 +00:00
if ( ! ( i + 1 < sl
& & ( s . charAt ( i + 1 ) = = '.' | | s . charAt ( i + 1 ) = = ','
| | ( s . charAt ( i + 1 ) = = '\r' | | s . charAt ( i + 1 ) = = '\n' )
| | ( s . charAt ( i + 1 ) > = 'a' & & s . charAt ( i + 1 ) < = 'z' )
| | ( s . charAt ( i + 1 ) > = 'A' & & s . charAt ( i + 1 ) < = 'Z' ) ) ) ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " A non-breaking tsheg, '.', appeared, but not like \" ..., \" or \" ., \" or \" .dA \" or \" .DA \" . " ,
TString . WARNING ) ) ;
2003-08-16 16:13:53 +00:00
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
startOfString = i + 1 ;
2003-08-16 16:13:53 +00:00
break ; // end '.' case
2003-08-14 05:10:47 +00:00
// Classic tsheg bar enders:
case ' ' :
case '\t' :
case '\r' :
case '\n' :
case ',' :
case '*' :
case ';' :
case '`' :
case '#' :
2003-09-07 16:19:50 +00:00
case '%' :
case 'x' :
case 'o' :
2003-11-11 03:43:11 +00:00
case '^' :
2003-11-30 02:18:59 +00:00
case '&' :
2003-09-04 04:04:21 +00:00
2003-09-07 16:19:50 +00:00
boolean legalTshegBarAdornment = false ;
2003-08-14 05:10:47 +00:00
// The tsheg bar ends here; new token.
if ( startOfString < i ) {
2003-10-04 01:22:59 +00:00
if ( currentType = = TString . TIBETAN_NON_PUNCTUATION
2003-09-07 16:19:50 +00:00
& & isTshegBarAdornment ( ch ) )
legalTshegBarAdornment = true ;
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
2003-08-14 05:10:47 +00:00
}
2003-09-04 04:04:21 +00:00
// Insert a tsheg if necessary. ACIP files aren't
// careful, so "KA\r\n" and "GA\n" appear where "KA
// \r\n" and "GA \n" should appear.
if ( ( '\r' = = ch
2003-09-05 05:08:47 +00:00
| | ( '\n' = = ch & & i > 0 & & s . charAt ( i - 1 ) ! = '\r' ) )
2003-09-04 04:04:21 +00:00
& & ! al . isEmpty ( )
2003-11-11 03:43:11 +00:00
& & lastNonExceptionalThingWasNonPunctish ( al ) ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " " , TString . TIBETAN_PUNCTUATION ) ) ;
2003-09-05 05:08:47 +00:00
}
// "DANG,\nLHAG" is really "DANG, LHAG". But always? Not if you have "MDO,\n\nKA...".
if ( ( '\r' = = ch
| | ( '\n' = = ch & & i > 0 & & s . charAt ( i - 1 ) ! = '\r' ) )
& & ! al . isEmpty ( )
2003-11-11 03:43:11 +00:00
& & lastNonExceptionalThingWasNonPunctish ( al )
2003-10-04 01:22:59 +00:00
& & ( ( TString ) al . get ( al . size ( ) - 1 ) ) . getText ( ) . equals ( " , " )
2003-09-05 05:08:47 +00:00
& & s . charAt ( i - 1 ) = = ','
& & ( i + ( ( '\r' = = ch ) ? 2 : 1 ) < sl
& & ( s . charAt ( i + ( ( '\r' = = ch ) ? 2 : 1 ) ) ! = ch ) ) ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " " , TString . TIBETAN_PUNCTUATION ) ) ;
2003-09-04 04:04:21 +00:00
}
2003-11-11 03:43:11 +00:00
if ( '^' = = ch ) {
// "^ GONG SA" is the same as "^GONG SA" or
// "^\r\nGONG SA". But "^\n\nGONG SA" is
// different -- that has a true line break in the
// output between ^ and GONG. We give an error if
// ^ isn't followed by an alphabetical character.
boolean bad = false ;
if ( i + 1 < sl & & isAlpha ( s . charAt ( i + 1 ) ) ) {
// leave i alone
} else if ( i + 2 < sl & & ( ' ' = = s . charAt ( i + 1 )
| | '\r' = = s . charAt ( i + 1 )
| | '\n' = = s . charAt ( i + 1 ) )
& & isAlpha ( s . charAt ( i + 2 ) ) ) {
+ + i ;
} else if ( i + 3 < sl & & '\r' = = s . charAt ( i + 1 )
& & '\n' = = s . charAt ( i + 2 )
& & isAlpha ( s . charAt ( i + 3 ) ) ) {
i + = 2 ;
} else {
bad = true ;
}
if ( ! bad )
al . add ( new TString ( " ^ " , TString . TIBETAN_PUNCTUATION ) ) ;
else
al . add ( new TString ( " The ACIP {^} must precede a tsheg bar. " , TString . ERROR ) ) ;
} else {
// Don't add in a "\r\n" or "\n" unless there's a
// blank line.
boolean rn = false ;
boolean realNewline = false ;
if ( ( '\n' ! = ch & & '\r' ! = ch )
| | ( realNewline
= ( ( rn = ( '\n' = = ch & & i > = 3 & & s . charAt ( i - 3 ) = = '\r' & & s . charAt ( i - 2 ) = = '\n' & & s . charAt ( i - 1 ) = = '\r' ) )
| | ( '\n' = = ch & & i > = 1 & & s . charAt ( i - 1 ) = = '\n' ) ) ) ) {
for ( int h = 0 ; h < ( realNewline ? 2 : 1 ) ; h + + ) {
if ( isTshegBarAdornment ( ch ) & & ! legalTshegBarAdornment ) {
al . add ( new TString ( " The ACIP " + ch + " must be glued to the end of a tsheg bar, but this one was not " ,
TString . ERROR ) ) ;
} else {
al . add ( new TString ( rn ? s . substring ( i - 1 , i + 1 ) : s . substring ( i , i + 1 ) ,
( legalTshegBarAdornment
? TString . TSHEG_BAR_ADORNMENT
: TString . TIBETAN_PUNCTUATION ) ) ) ;
}
2003-09-07 16:19:50 +00:00
}
}
2003-11-11 03:43:11 +00:00
if ( '%' = = ch ) {
al . add ( new TString ( " The ACIP {%} is treated by this converter as U+0F35, but sometimes might represent U+0F14 in practice " ,
TString . WARNING ) ) ;
}
2003-11-09 23:15:58 +00:00
}
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-11-11 03:43:11 +00:00
break ; // end TIBETAN_PUNCTUATION | TSHEG_BAR_ADORNMENT case
2003-08-14 05:10:47 +00:00
default :
2003-08-16 16:13:53 +00:00
if ( ! bracketTypeStack . empty ( ) ) {
int stackTop = ( ( Integer ) bracketTypeStack . peek ( ) ) . intValue ( ) ;
2003-10-04 01:22:59 +00:00
if ( TString . CORRECTION_START = = stackTop & & '?' = = ch ) {
2003-08-16 16:13:53 +00:00
// allow it through...
break ;
}
}
2003-08-17 02:12:49 +00:00
if ( i + 1 = = sl & & 26 = = ( int ) ch )
2003-11-11 03:43:11 +00:00
// Silently allow the last character to be
// control-Z (sometimes printed as ^Z), which just
// marks end of file.
2003-08-17 02:12:49 +00:00
break ;
2003-08-14 05:10:47 +00:00
if ( ! ( isNumeric ( ch ) | | isAlpha ( ch ) ) ) {
if ( startOfString < i ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( startOfString , i ) ,
currentType ) ) ;
2003-08-14 05:10:47 +00:00
}
2003-08-23 22:03:37 +00:00
if ( ( int ) ch = = 65533 ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Found an illegal, unprintable character. " ,
TString . ERROR ) ) ;
2003-08-23 22:03:37 +00:00
if ( null ! = errors )
2003-10-19 20:16:06 +00:00
errors . append ( " Offset " + i + ( ( numNewlines = = 0 ) ? " " : ( " or maybe " + ( i - numNewlines ) ) ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found an illegal, unprintable character. \ n " ) ;
2003-08-23 22:03:37 +00:00
} else if ( '\\' = = ch ) {
2003-11-29 22:56:18 +00:00
int x = - 1 ;
if ( ! ThdlOptions . getBooleanOption ( " thdl.tib.text.disallow.unicode.character.escapes.in.acip " )
& & i + 5 < sl & & 'u' = = s . charAt ( i + 1 ) ) {
try {
if ( ! ( ( x = Integer . parseInt ( s . substring ( i + 2 , i + 6 ) , 16 ) ) > = 0x0000 & & x < = 0xFFFF ) )
x = - 1 ;
} catch ( NumberFormatException e ) {
// Though this is unlikely to be
// legal, we allow it through.
// (FIXME: warn.)
}
}
if ( x > = 0 ) {
al . add ( new TString ( new String ( new char [ ] { ( char ) x } ) ,
TString . UNICODE_CHARACTER ) ) ;
i + = " uXXXX " . length ( ) ;
startOfString = i + 1 ;
break ;
} else {
al . add ( new TString ( " Found a Sanskrit virama, \\ , but the converter currently doesn't treat these properly. Sorry! Please do complain to the maintainers. " ,
TString . ERROR ) ) ;
if ( null ! = errors )
errors . append ( " Offset " + i + ( ( numNewlines = = 0 ) ? " " : ( " or maybe " + ( i - numNewlines ) ) ) + " : "
+ " Found a Sanskrit virama, \\ , but the converter currently doesn't treat these properly. Sorry! Please do complain to the maintainers. \ n " ) ;
}
2003-08-23 22:03:37 +00:00
} else {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Found an illegal character, " + ch + " , with ordinal " + ( int ) ch + " . " ,
TString . ERROR ) ) ;
2003-08-23 22:03:37 +00:00
if ( null ! = errors )
2003-10-19 20:16:06 +00:00
errors . append ( " Offset " + i + ( ( numNewlines = = 0 ) ? " " : ( " or maybe " + ( i - numNewlines ) ) ) + " : "
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
+ " Found an illegal character, " + ch + " , with ordinal " + ( int ) ch + " . \ n " ) ;
}
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-14 05:10:47 +00:00
startOfString = i + 1 ;
2003-10-04 01:22:59 +00:00
currentType = TString . ERROR ;
2003-08-14 05:10:47 +00:00
} else {
// Continue through the loop.
2003-10-04 01:22:59 +00:00
if ( TString . ERROR = = currentType )
currentType = TString . TIBETAN_NON_PUNCTUATION ;
2003-08-14 05:10:47 +00:00
}
break ; // end default case
}
}
if ( startOfString < sl ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( s . substring ( startOfString , sl ) ,
currentType ) ) ;
2003-08-16 16:13:53 +00:00
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( waitingForMatchingIllegalClose ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " UNEXPECTED END OF INPUT " ,
TString . ERROR ) ) ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
if ( null ! = errors ) {
errors . append ( " Offset END: "
+ " Truly unmatched open bracket found. \ n " ) ;
}
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
}
2003-08-16 16:13:53 +00:00
if ( ! bracketTypeStack . empty ( ) ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Unmatched open bracket found. A " + ( ( TString . COMMENT = = currentType ) ? " comment " : " correction " ) + " does not terminate. " ,
TString . ERROR ) ) ;
2003-08-16 16:13:53 +00:00
if ( null ! = errors ) {
2003-08-24 06:40:53 +00:00
errors . append ( " Offset END: "
2003-10-04 01:22:59 +00:00
+ " Unmatched open bracket found. A " + ( ( TString . COMMENT = = currentType ) ? " comment " : " correction " ) + " does not terminate. \ n " ) ;
2003-08-14 05:10:47 +00:00
}
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-14 05:10:47 +00:00
}
2003-08-16 16:13:53 +00:00
if ( startSlashIndex > = 0 ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Slashes are supposed to occur in pairs, but the input had an unmatched '/' character. " ,
TString . ERROR ) ) ;
2003-08-16 16:13:53 +00:00
if ( null ! = errors )
errors . append ( " Offset END: "
+ " Slashes are supposed to occur in pairs, but the input had an unmatched '/' character. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-16 16:13:53 +00:00
}
if ( startParenIndex > = 0 ) {
2003-10-04 01:22:59 +00:00
al . add ( new TString ( " Parentheses are supposed to occur in pairs, but the input had an unmatched parenthesis. " ,
TString . ERROR ) ) ;
2003-08-16 16:13:53 +00:00
if ( null ! = errors )
errors . append ( " Offset END: "
+ " Unmatched open parenthesis, (, found. \ n " ) ;
2003-08-17 02:12:49 +00:00
if ( maxErrors > = 0 & & + + numErrors > = maxErrors ) return null ;
2003-08-16 16:13:53 +00:00
}
2003-08-14 05:10:47 +00:00
return al ;
}
2003-08-16 16:13:53 +00:00
2003-08-14 05:10:47 +00:00
/** See implementation. */
private static boolean isNumeric ( char ch ) {
return ch > = '0' & & ch < = '9' ;
}
Improved the ACIP scanner (the part of the converter that says, "This
is a correction, that's a comment, this is Tibetan, that's Latin
(English), that's Tibetan inter-tsheg-bar punctuation, etc.) It now
accepts more real-world ACIP files, i.e. it handles illegal
constructs. The error checking is more user-friendly. There are now
tests.
Added some tsheg bars that Peter E. Hauer of Linguasoft sent me to the
tests. Many thanks, Peter. I still need to implement rules that say,
"This is not Tibetan, it must be Sanskrit, because that letter doesn't
take a MA prefix."
2003-08-17 01:45:55 +00:00
/** See implementation. */
private static boolean isWhitespace ( char ch ) {
return ch = = ' ' | | ch = = '\t' | | ch = = '\r' | | ch = = '\n' ;
}
2003-09-07 16:19:50 +00:00
/** See implementation. */
private static boolean isTshegBarAdornment ( char ch ) {
return ( ch = = '%' | | ch = = 'o' | | ch = = 'x' ) ;
2003-11-11 03:43:11 +00:00
// ^ is a pre-adornment; these are post-adornments.
2003-09-07 16:19:50 +00:00
}
2003-08-14 05:10:47 +00:00
/** See implementation. */
private static boolean isAlpha ( char ch ) {
2003-08-16 16:13:53 +00:00
return ch = = '\'' // 23rd consonant
2003-08-14 05:10:47 +00:00
2003-08-16 16:13:53 +00:00
// combining punctuation, vowels:
2003-08-17 02:38:58 +00:00
| | ch = = 'm'
2003-08-16 16:13:53 +00:00
| | ch = = ':'
2003-11-09 01:07:45 +00:00
// FIXME: we must treat this guy like a vowel, a special vowel that numerals can take on. Until then, warn. See bug 838588 || ch == '\\'
2003-08-16 16:13:53 +00:00
| | ch = = '-'
| | ch = = '+'
2003-08-17 02:38:58 +00:00
| | ( ( ch > = 'A' & & ch < = 'Z' ) & & ch ! = 'X' & & ch ! = 'Q' & & ch ! = 'F' )
| | ch = = 'i'
| | ch = = 't'
| | ch = = 'h'
| | ch = = 'd'
| | ch = = 'n'
| | ch = = 's'
| | ch = = 'h' ;
2003-08-14 05:10:47 +00:00
}
}