Fixed phonetics code to strip post-suffix d (bug 800167 in

SourceForge).

  Implemented (somewhat kludgily) option for phonetics scheme to
  replace e with é iff it is the last letter of the last tsheg bar.
  This is required by the new THDL phonetics spec.

  New algorithm, per new THDL phonetics spec, for ba->wa processing.
  The heuristic is that it applies only to the last tsheg bar in
  multi-tsheg-bar words.  (Previously, ba always generated "?ba/wa?",
  which is maybe more correct but less attractive.)  This heuristic
  fails on, e.g., "tsheg bar".  Oh well.

  Rationalized format of phonetics file: > is used as separator in exceptions
  as well as rules.  (Previously, : was used in exceptions only.)
This commit is contained in:
a1tsal 2004-02-20 09:37:23 +00:00
parent 3910d355f9
commit 2e9ea92a3a
5 changed files with 136 additions and 72 deletions

View file

@ -96,20 +96,14 @@ x ng
; pronunciation (which may also contain spaces). A semicolon
; precedes a comment. Blank lines are OK.
ba : wa ; mind you, ba (pronounced ba) means cow. But that's much rarer than wa.
bo : wo
ba'i : wa'i
bo'i : wo'i
bar : ?bar/war? ; bar = "middle"; could be either, so supply both and let user sort it out
bor : ?bor/wor? ; bor = "cast away"; could be either, so supply both and let user sort it out
rdo rje : dorjé
mkha' 'gro : khandro
sku mnye : kumnyé
sprul sku : tulku
mtsho rgyal : tsogyèl
rta mgrin: tamdrin
dga' ldan : ganden
dge 'dun : gendün
a mdo : amdo
srid pa : sipa
pad ma : pèma
rdo rje > dorjé
mkha' 'gro > khandro
sku mnye > kumnyé
sprul sku > tulku
mtsho rgyal > tsogyèl
rta mgrin> tamdrin
dga' ldan > ganden
dge 'dun > gendün
a mdo > amdo
srid pa > sipa
pad ma > pèma

View file

@ -40,7 +40,7 @@ klad pa > l
glog > log
le'u > lé'u
pa'ang > pa'ang
ba'i > wa'i
bar ba'i > barwa'i
rta mgrin > tamdrin
; Other tests, to exercise particular rules in the grammar that aren't covered in the rules above

View file

@ -18,16 +18,26 @@
; compactly. For example, it would be difficult to capture the
; effects of preinitial consonants on tone (as in the scheme used
; in Joe Wilson's book, for instance).) Also note that not even the
; whole of the present scheme is implemented using these rules. In
; particular, the deletion of prefix and superscript consonants,
; whole of the present scheme is implemented using these rules. For
; example, the deletion of prefix and superscript consonants,
; and of wa-zur, are done in program code, not using the rules here.
; This makes e come out é only when the last letter in a "word" (*not*
; syllable). Our grammar engine is not nearly powerful enough to do
; this in a clean way.
<?Enable THDL final é kludge?>
; Miscellaneous prefix transformations
g. ; delete this (representing g prefix, used before root y only)
dby y ; must come before db->w, for dbyang
dbr r ; must come before db->w, for dbral
db w ; must come before by->j
; Removal of confusing 'h's
th t
ph p
tsh ts
; c and ch are both transcribed ch. To get this we need a kludge
; (involving x), because the rule c -> ch would apply recursively.
ch c
@ -42,10 +52,10 @@ my ny
; Retroflexes
kr tr
khr thr
khr tr
gr dr
pr tr
phr thr
phr tr
br dr
; Other bad behavior from R
@ -55,7 +65,7 @@ sr s
; Uniquely random case
zl d
; Umlaut of a, o, u followed by d, n, l, s
; Umlaut of a, o, u followed by d, n, l, s, and 'i
; Note: this must be done before suffix-stripping.
; Before actually doing the umlaut, we "hide" the n in ng, so that ng doesn't
; induce umlaut. This is gross; if we had a real grammar engine it wouldn't
@ -65,17 +75,24 @@ ad e
an en
al el
as e
a'i e
od ö
on ön
ol öl
os ö
o'i ö
ud ü
un ün
ul ül
us ü
u'i ü
; restore ng
x ng
; Stripping of 'i from e'i
; (It is stripped from a, o, u by umlaut rules, and from i by vowel-doubling rule.)
e'i e
; Stripping of suffix d, s, and ' from i and e
; Note: this has already been done by the umlaut rules for some cases,
; which don't need to be repeated here.
@ -112,22 +129,27 @@ ub up
; There is one exception per line. Each exception consists of
; the transliteration (which may be several syllables separated
; by spaces), followed by a space, a colon, a space, and the
; by spaces), followed by a space, a greater-than, a space, and the
; pronunciation (which may also contain spaces). A semicolon
; precedes a comment. Blank lines are OK.
ba : wa ; mind you, ba (pronounced ba) means cow. But that's much rarer than wa.
bo : wo
ba'i : wai
bo'i : woi
bar : ?bar/war? ; bar = "middle"; could be either, so supply both and let user sort it out
bor : ?bor/wor? ; bor = "cast away"; could be either, so supply both and let user sort it out
rdo rje : dorje
mkha' 'gro : khandro
sprul sku : tulku
rta mgrin: tamdrin
dga' ldan : ganden
dge 'dun : gendün
a mdo : amdo
blo bzang : lobzang
sbra nag zhol : banakzhöl
mkha' 'gro > khandro
sprul sku > tulku
rta mgrin > tamdrin
dga' ldan > ganden
dge 'dun > gendün
a mdo > amdo
bka' 'gyur > kangyur
rgyu 'bras > gyundré
ngos 'dzin > ngöndzin
chab mdo > chamdo
dpal ldan > penden
dpal 'bar > pembar
rig 'dzin > rindzin
skyabs 'gro > kyamdro
'bri ru > biru
sbra nag zhol > banakzhöl
rdo rje > dorje
o rgyan > orgyen
lha rje > lharjé
rgyal rtse > gyantsé

View file

@ -1,72 +1,120 @@
;
; These examples come from the draft (8/21/03) THDL Phonetics document
; These examples mostly come from the THDL Phonetics document (Jan 2004 draft)
;
lha sa > lhasa
dag pa > dakpa
ring po > ringpo
rin chen > rinchen
lab > lap
dum bu > dumbu
dmar po > marpo
ril bu > rilbu
sa skya pa > sakyapa
blo bzang > lobzang
blo bzang > lozang
rnying ma pa > nyingmapa
rdo rje > dorje
rdo rje > dorjé
dge lugs pa > gelukpa
gzhis ka rtse > zhikatse
mar me > marme
gzhis ka rtse > zhikatsé
mar me > marmé
dge bshes > geshé
bcu > chu
lce > che
rin chen bzang po > rinchenzangpo
gcig pa > chikpa
nag chu > nakchu
bka' rgyud pa > kagyüpa
'phag pa > pakpa
gser thang > sertang
khang tshan > khangtsen
lce > ché
rin chen bzang po > rinchenzangpo
bka' rgyud > kagyü
bsod nams> sönam
thub bstan > thupten
yul > yül
dus tshod > dütsö
bon po > bönpo
sde dge > degé
brgyad > gyé
dge rgan > gegen
ral pa can > relpachen
tshe ring > tsering
byes > jé
bstan 'dzin > tendzin
'jam dpal dbyangs > jampelyang
dge legs > gelek
kha btags > khatak
sngags pa > ngakpa
byang chub > jangchup
thub bstan > tupten
tabs > tap
bka' shag > kashak
sbra nag zhol > banakzhöl
thabs > thap
thabs > tap
lha sa ba > lhasawa
jo bo > jowo
dpa' bo > pawo
spyan ras gzig > chenrezik
gsal bar > selwar
; nga'i deb > ngé dep -- can't do this one, it depends on word segmentation
bar ba > barwa
spyan ras gzig > chenrezik
phyag > chak
sbyin bdag > jindak
smyong > nyong
dmyal ba > nyelwa
sgrol ma > drölma
rten 'brel > tendrel
'bras spungs > drepung
'phrin las > thrinle
dbang > wang
dbral > rel
dbyar kha > yarkha
zla ba > dawa
'phrin las > trinlé
srung ma > sungma
rdzun smra ba > dzünmawa
klad pa > lepa
glog > lok
zla ba > dawa
lha sa > lhasa
lho phyogs > lhochok
lhun grub > lhündrup
dbang > wang
dbyar kha > yarkha
dbral > rel
le'u > leu
khyi'u > khyiu
pa'ang > pang
ba'i > wai
gri'i > dri
'gro ba'i > drowé
rgyal bu'i > gyelbü
rin po che'i > rinpoché
bdag po'i > dakpö
le'u'i > leü
rta mgrin > tamdrin
; Other tests, to exercise particular rules in the grammar that aren't covered in the rules above
g.yon > yön
phyag > chak
bkra shis > trashi
khros ma > thröma
khros ma > tröma
sprul > trül
mri tam ga > mitamga
srid pa > sipa
pad ma > pema
pan chen > penchen
ral pa can > relpachen
thun > thün
thun > tün
dus gsum > düsum
sbed > be
ces > che
pa'i > pai
che'i > chei
gri'i > dri
po'i > poi
le'u'i > leui
rdzogs > dzok
thug pa > thukpa
'debs > dep
sbed > bé
ces > ché
btsan dbang > tsenwang
tshong khang > tsongkhang
rdzong > dzong
stabs > tap
thug pa > tukpa
debs > dep
sib sib > sipsip
lobs pa > loppa
grub > drup
kla col > lachöl
spyan snga ba > chenngawa
sems dpa'i > sempé
bon po'i > bönpö
rdzogs > dzok
; Other random tests
phreng > treng
; Test of second-suffix d removal. Made-up word because I don't know real ones.
rand > ren
; Test that we don't spazz out on single-letter words.
a > a
ai > ai

Binary file not shown.