For the record... Gary Gregory Senior Software Engineer Rocket Software 3340 Peachtree Road, Suite 820 . Atlanta, GA 30326 . USA Tel: +1.404.760.1560 Email: ggreg...@seagullsoftware.com Web: seagull.rocketsoftware.comĀ
> -----Original Message----- > From: F Mue [mailto:webmas...@genealogie-konzepte.net] > Sent: Tuesday, February 22, 2011 13:58 > To: Gary Gregory > Subject: Re: [codec] Testing Cologne Phonetic > > Hi Gary, > > my understanding of applying the algorithm in Wikipedia to the word > "deutsch": > > Step 1: Letter-by-letter coding > d -> 2 > e -> 0 > u -> 0 > t -> 8 (D, T: before C, S, Z) > s -> 8 > c -> 8 (C: after S, Z) > h -> - > > Step 2: Removing multiple, consecutive digits > 200888 -> 208 > > Step 3: Removing all "0" digits except the leading one > 208 -> 28 > > So in my opinion the result of the Apache implementation is correct, and > the PHP result is wrong. > > Well, the result is showing me that I can't trust both PHP > implementations (magdev.de as well as the implementation I am using in > my family name dataset - which also produces the result 288). This means > I either have to rewrite the PHP implementation or wait for a reliable > Apache Commons implementation :-) > > > Franz > > > > Am 22.02.2011 18:43, schrieb Gary Gregory: > >> -----Original Message----- > >> From: F Mue [mailto:webmas...@genealogie-konzepte.net] > >> Sent: Tuesday, February 22, 2011 10:37 > >> To: Gary Gregory > >> Subject: Re: [codec] Testing Cologne Phonetic > >> > >> Hi Gary, > >> > >> I don't think a re-write or modification would be a big issue. Of course > >> I would reuse the skeleton of the old code. The major part is about > >> going through the algorithm and figuring out what rules to apply in what > >> order. Maybe I have enough time left in March to do that. > >> > >> The real problem probably in my opinion is how to make sure the code is > >> correct, i. e. find valid test data (including test results) ... the > >> same problem you have :-) > > Hi Franz, > > > > Yes, that's the problem, finding baseline data! > > > > For other encoders, I thought about using a database's SOUNDEX function > (for example) to generate some data for comparison. But I do not think any > DBs implement the Cologne Phonetic algorithm. > > > >> It might be easier for me to try the implementation in > >> http://www.magdev.de/text_colognephonetic/ > >> But it's alpha code. I can't be sure it's producing correct code. Well, > >> I could try to compare the results of that implementation to my code > >> results of the current release 0.11.2 of my family names data... > > This unit test data works for us except for "deutsch", where we get 28 > instead of the PHP unit test which expects 288. > > > > It looks like a bug in our code. From my reading of the Wikipedia table, > the code should indeed be 288. Can you confirm that please? > > > > Thank you again, > > Gary > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org