FW: [codec] Testing Cologne Phonetic

Gary Gregory Tue, 22 Feb 2011 11:45:17 -0800

For the record...

Gary Gregory
Senior Software Engineer
Rocket Software
3340 Peachtree Road, Suite 820 . Atlanta, GA 30326 . USA
Tel: +1.404.760.1560
Email: ggreg...@seagullsoftware.com
Web: seagull.rocketsoftware.com




> -----Original Message-----
> From: F Mue [mailto:webmas...@genealogie-konzepte.net]
> Sent: Tuesday, February 22, 2011 13:58
> To: Gary Gregory
> Subject: Re: [codec] Testing Cologne Phonetic
> 
> Hi Gary,
> 
> my understanding of applying the algorithm in Wikipedia to the word
> "deutsch":
> 
> Step 1: Letter-by-letter coding
>     d -> 2
>     e -> 0
>     u -> 0
>     t -> 8 (D, T: before C, S, Z)
>     s -> 8
>     c -> 8 (C: after S, Z)
>     h -> -
> 
> Step 2: Removing multiple, consecutive digits
>     200888 -> 208
> 
> Step 3: Removing all "0" digits except the leading one
>     208 -> 28
> 
> So in my opinion the result of the Apache implementation is correct, and
> the PHP result is wrong.
> 
> Well, the result is showing me that I can't trust both PHP
> implementations (magdev.de as well as the implementation I am using in
> my family name dataset - which also produces the result 288). This means
> I either have to rewrite the PHP implementation or wait for a reliable
> Apache Commons implementation :-)
> 
> 
> Franz
> 
> 
> 
> Am 22.02.2011 18:43, schrieb Gary Gregory:
> >> -----Original Message-----
> >> From: F Mue [mailto:webmas...@genealogie-konzepte.net]
> >> Sent: Tuesday, February 22, 2011 10:37
> >> To: Gary Gregory
> >> Subject: Re: [codec] Testing Cologne Phonetic
> >>
> >> Hi Gary,
> >>
> >> I don't think a re-write or modification would be a big issue. Of course
> >> I would reuse the skeleton of the old code. The major part is about
> >> going through the algorithm and figuring out what rules to apply in what
> >> order. Maybe I have enough time left in March to do that.
> >>
> >> The real problem probably in my opinion is how to make sure the code is
> >> correct, i. e. find valid test data (including test results) ... the
> >> same problem you have :-)
> > Hi Franz,
> >
> > Yes, that's the problem, finding baseline data!
> >
> > For other encoders, I thought about using a database's SOUNDEX function
> (for example) to generate some data for comparison. But I do not think any
> DBs implement the Cologne Phonetic algorithm.
> >
> >> It might be easier for me to try the implementation in
> >>     http://www.magdev.de/text_colognephonetic/
> >> But it's alpha code. I can't be sure it's producing correct code. Well,
> >> I could try to compare the results of that implementation to my code
> >> results of the current release 0.11.2 of my family names data...
> > This unit test data works for us except for "deutsch", where we get 28
> instead of the PHP unit test which expects 288.
> >
> > It looks like a bug in our code. From my reading of the Wikipedia table,
> the code should indeed be 288. Can you confirm that please?
> >
> > Thank you again,
> > Gary
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

FW: [codec] Testing Cologne Phonetic

Reply via email to