Does anybody have a working knowledge of the coding of the Beider Morse Phonetic Matching in the Apache Commons Codec?
My recent tests using Solr suggest there is a discrepancy between Steve Morse and Alexander Beider's algorithm and the algorithm currently live in Solr (and hence the Commons Codec). I know that the source code for BMPM issued by Steve has changed several times over the years, and I thought at first it might be that the version used in the Commons Codec is an old version that has subsequently been overtaken. Should the version of the BMPM algorithm not be listed in the Commons Codec documentation? How should version changes to the algorithm be implemented? The algorithm is quite static now so this is probably not so important now but surely it should be DOCUMENTED??? My tests now indicate that the discrepancies are NOT a version problem as testing against a very old version 2.00 of the BMPM source code issued on 18 June 2009 still exhibits the same problem. Using just a single test term the results are not good. The only saving grace is that the most widely used version is nameType="GENERIC" ruleType="APPROX" and that is a close (but not perfect) match at least for this ONE test word. For the name Abram, all with languageSet="auto" GENERIC APPROX - fails - misses a few tokens Should create tokens: abram abrom avram avrom obram obrom ovram ovrom abran abron obran obron Ybram Ybrom Solr creates: abram abrom avram avrom obram obrom ovram ovrom abran abron obran obron GENERIC EXACT - good! Should create tokens: abram abran Solr creates: abram abran ASHKENAZI APPROX: - fails dreadfully! Should create tokens: abram abrom avram avrom obram obrom ovram ovrom Ybram Ybrom ombram ombrom imbram imbrom Solr creates: abrAm AvrAm BbrAm ASHKENAZI EXACT: - good! Should create tokens: abram Solr creates: abram SEPHARDIC APPROX: - good! Should create tokens: abram bram abran bran avram vram Solr creates: abram bram abran bran avram vram SEPHARDIC EXACT: - good! Should create tokens: abram abran avram Solr creates: abram abran avram I would appreciate it if somebody with knowledge of the programming of this functionality could investigate. For the worst case I attach here a debug trace of the calculation of the Ashkenazi Approx tokens straight from Steve Morse' implementation. It looks like some of the final rules are not being implemented properly, or at all. The language codes in parenthesis vary from BMPM version to version but the resulting tokens have not changed from version 2.00 up to the current 3.02 Thanks Michael applying language rules from (rulesany) to abram using languages 2012 char codes = [#61]a [#62]b [#72]r [#61]a [#6d]m applying rule #225 pattern=a lcontext= rcontext=[bcdgkpstwzż] subst=(A|B[128]) result=(A[2012]|B[128]) applying rule #229 pattern=b lcontext= rcontext= subst=b result=(Ab[2012]|Bb[128]) applying rule #245 pattern=r lcontext= rcontext= subst=r result=(Abr[2012]|Bbr[128]) applying rule #228 pattern=a lcontext= rcontext= subst=A result=(AbrA[2012]|BbrA[128]) applying rule #240 pattern=m lcontext= rcontext= subst=m result=(AbrAm[2012]|BbrAm[128]) after language rules: (AbrAm[2012]|BbrAm[128]) applying final rules from (exactapproxcommon plus approxcommon) to AbrAm[2012] no rules match for phonetic item 0 at position 0: A no rules match for phonetic item 0 at position 1: Ab no rules match for phonetic item 0 at position 2: Abr no rules match for phonetic item 0 at position 3: AbrA no rules match for phonetic item 0 at position 4: AbrAm applying final rules from (exactapproxcommon plus approxcommon) to BbrAm[128] no rules match for phonetic item 1 at position 0: B no rules match for phonetic item 1 at position 1: Bb no rules match for phonetic item 1 at position 2: Bbr no rules match for phonetic item 1 at position 3: BbrA no rules match for phonetic item 1 at position 4: BbrAm applying final rules from (approxany) to AbrAm[2012] after applying final rule #97 to phonetic item #0 at position 0: (a[2012]|o[2012]|Y[16]) pattern=A lcontext= rcontext= subst=(a|o|Y[16]) after applying final rule #0 to phonetic item #0 at position 1: (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16]) pattern=b lcontext= rcontext= subst=(b|v[1024]) no rules match for phonetic item 0 at position 2: (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16])r after applying final rule #93 to phonetic item #0 at position 3: (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024 ]|ovro[1024]|Ybra[16]|Ybro[16]) pattern=A lcontext= rcontext=[fklmnprst]$ subst=(a|o) no rules match for phonetic item 0 at position 4: (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024 ]|ovro[1024]|Ybra[16]|Ybro[16])m applying final rules from (approxany) to BbrAm[128] after applying final rule #22 to phonetic item #1 at position 0: (o[2012]|om[128]|im[128]) pattern=B lcontext= rcontext=[bp] subst=(o|om[128]|im[128]) after applying final rule #0 to phonetic item #1 at position 1: (ob[2012]|ov[1024]|omb[128]|imb[128]) pattern=b lcontext= rcontext= subst=(b|v[1024]) no rules match for phonetic item 1 at position 2: (ob[2012]|ov[1024]|omb[128]|imb[128])r after applying final rule #93 to phonetic item #1 at position 3: (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128 ]|imbro[128]) pattern=A lcontext= rcontext=[fklmnprst]$ subst=(a|o) no rules match for phonetic item 1 at position 4: (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128 ]|imbro[128])m resulting tokens: (abram|abrom|avram|avrom|obram|obrom|ovram|ovrom|Ybram|Ybrom|ombram|ombrom|i mbram|imbrom)