[CODEC] Beider Morse Phonetic Matching Bug and questions

Michael Tobias Wed, 11 Jun 2014 00:09:26 -0700

Does anybody have a working knowledge of the coding of the Beider Morse
Phonetic Matching in the Apache Commons Codec?


 

My recent tests using Solr suggest there is a discrepancy between Steve
Morse and Alexander Beider's algorithm and the algorithm currently live in
Solr (and hence the Commons Codec).

 

I know that the source code for BMPM issued by Steve has changed several
times over the years, and I thought at first it might be that the version
used in the Commons Codec is an old version that has subsequently been
overtaken.  Should the version of the BMPM algorithm not be listed in the
Commons Codec documentation? How should version changes to the algorithm be
implemented? The algorithm is quite static now so this is probably not so
important now but surely it should be DOCUMENTED???

 

My tests now indicate that the discrepancies are NOT a version problem as
testing against a very old version 2.00 of the BMPM source code issued on 18
June 2009 still exhibits the same problem.

 

Using just a single test term the results are not good. The only saving
grace is that the most widely used version is 

 

nameType="GENERIC" ruleType="APPROX"

 

and that is a close (but not perfect) match at least for this ONE test word.

 

For the name Abram, all with languageSet="auto"

 

GENERIC APPROX - fails - misses a few tokens

Should create tokens: abram abrom avram avrom obram obrom ovram ovrom abran
abron obran obron Ybram Ybrom

Solr creates: abram abrom avram avrom obram obrom ovram ovrom abran abron
obran obron

 

GENERIC EXACT - good!

Should create tokens: abram abran

Solr creates: abram abran

 

ASHKENAZI APPROX: - fails dreadfully!

Should create tokens: abram abrom avram avrom obram obrom ovram ovrom Ybram
Ybrom ombram ombrom imbram imbrom

Solr creates: abrAm AvrAm BbrAm

 

ASHKENAZI EXACT: - good!

Should create tokens: abram

Solr creates: abram

 

SEPHARDIC APPROX: - good!

Should create tokens: abram bram abran bran avram vram

Solr creates: abram bram abran bran avram vram

 

SEPHARDIC EXACT: - good!

Should create tokens: abram abran avram

Solr creates: abram abran avram 

 

I would appreciate it if somebody with knowledge of the programming of this
functionality could investigate.

 

For the worst case I attach here a debug trace of the calculation of the
Ashkenazi Approx tokens straight from Steve Morse' implementation. It looks
like some of the final rules are not being implemented properly, or at all.
The language codes in parenthesis vary from BMPM version to version but the
resulting tokens have not changed from version 2.00 up to the current 3.02

 

Thanks

 

Michael

 

 

 

applying language rules from (rulesany) to abram using languages 2012

char codes = [#61]a [#62]b [#72]r [#61]a [#6d]m

applying rule #225
   pattern=a
   lcontext=
   rcontext=[bcdgkpstwzż]
   subst=(A|B[128])
   result=(A[2012]|B[128])

applying rule #229
   pattern=b
   lcontext=
   rcontext=
   subst=b
   result=(Ab[2012]|Bb[128])

applying rule #245
   pattern=r
   lcontext=
   rcontext=
   subst=r
   result=(Abr[2012]|Bbr[128])

applying rule #228
   pattern=a
   lcontext=
   rcontext=
   subst=A
   result=(AbrA[2012]|BbrA[128])

applying rule #240
   pattern=m
   lcontext=
   rcontext=
   subst=m
   result=(AbrAm[2012]|BbrAm[128])

after language rules: (AbrAm[2012]|BbrAm[128])


applying final rules from (exactapproxcommon plus approxcommon) to
AbrAm[2012]
no rules match for phonetic item 0 at position 0: A
no rules match for phonetic item 0 at position 1: Ab
no rules match for phonetic item 0 at position 2: Abr
no rules match for phonetic item 0 at position 3: AbrA
no rules match for phonetic item 0 at position 4: AbrAm

applying final rules from (exactapproxcommon plus approxcommon) to
BbrAm[128]
no rules match for phonetic item 1 at position 0: B
no rules match for phonetic item 1 at position 1: Bb
no rules match for phonetic item 1 at position 2: Bbr
no rules match for phonetic item 1 at position 3: BbrA
no rules match for phonetic item 1 at position 4: BbrAm

applying final rules from (approxany) to AbrAm[2012]
after applying final rule #97 to phonetic item #0 at position 0:
(a[2012]|o[2012]|Y[16]) pattern=A lcontext= rcontext= subst=(a|o|Y[16])
after applying final rule #0 to phonetic item #0 at position 1:
(ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16]) pattern=b lcontext= rcontext=
subst=(b|v[1024])
no rules match for phonetic item 0 at position 2:
(ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16])r
after applying final rule #93 to phonetic item #0 at position 3:
(abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
]|ovro[1024]|Ybra[16]|Ybro[16]) pattern=A lcontext= rcontext=[fklmnprst]$
subst=(a|o)
no rules match for phonetic item 0 at position 4:
(abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
]|ovro[1024]|Ybra[16]|Ybro[16])m

applying final rules from (approxany) to BbrAm[128]
after applying final rule #22 to phonetic item #1 at position 0:
(o[2012]|om[128]|im[128]) pattern=B lcontext= rcontext=[bp]
subst=(o|om[128]|im[128])
after applying final rule #0 to phonetic item #1 at position 1:
(ob[2012]|ov[1024]|omb[128]|imb[128]) pattern=b lcontext= rcontext=
subst=(b|v[1024])
no rules match for phonetic item 1 at position 2:
(ob[2012]|ov[1024]|omb[128]|imb[128])r
after applying final rule #93 to phonetic item #1 at position 3:
(obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
]|imbro[128]) pattern=A lcontext= rcontext=[fklmnprst]$ subst=(a|o)
no rules match for phonetic item 1 at position 4:
(obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
]|imbro[128])m

 

 

 

resulting tokens:


(abram|abrom|avram|avrom|obram|obrom|ovram|ovrom|Ybram|Ybrom|ombram|ombrom|i
mbram|imbrom)

[CODEC] Beider Morse Phonetic Matching Bug and questions

Reply via email to