> -----Original Message----- > From: Oliver Heger [mailto:oliver.he...@oliver-heger.de] > Sent: Tuesday, January 25, 2011 15:19 > To: Commons Developers List > Subject: Re: [codec] Large test data set! > > Am 25.01.2011 21:01, schrieb Gary Gregory: > > Hi All: > > > > I just found a data set that I would like to integrate with [codec] to > test the language package: > > > > http://sourceforge.net/projects/familynamephon/ > > > > The test data file contains 837K German names (37MB) in a text file and > encodings in Cham (?) phonetics, Cologne phonetics, Metaphone, and Soundex. > > > > I have no idea how long it would take to run a test for our language > encoders on this but I imagine making it an optional unit test. How do you > do THAT in Maven? > > > > The data is covered (I think, I do not read German) by this license: > http://www.opendatacommons.org/licenses/odbl/1.0/ > > Being a native German speaker I can confirm that the license is actually > the Open Database License which can be found at the URL you provided.
Can we include the data file in our tests? The PDF describing the file? Thank you, Gary > > Cham phonetics seems to be a special algorithm for encoding names. [1] > contains more background information about it (unfortunately also in > German). According to this page the name stems from a region in Bavaria. > You can find a PHP implementation of this algorithm in [2]. > > HTH > Oliver > > [1] http://www.genealogie-konzepte.net/chamer-phonetik > [2] http://www.genealogie-konzepte.net/chamer-phonetik/implementierung > > > > > Thoughts? > > Gary Gregory > > Senior Software Engineer > > Rocket Software > > 3340 Peachtree Road, Suite 820 * Atlanta, GA 30326 * USA > > Tel: +1.404.760.1560 > > Email: ggreg...@seagullsoftware.com<mailto:ggreg...@seagullsoftware.com> > > Web: seagull.rocketsoftware.com<http://www.seagull.rocketsoftware.com/> > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org