2014-10-27 12:32 GMT+01:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br>:
> Hi Benedikt! > > Just let me know if you need help with the bootstraping of the new > project. > Yes, please :) > I'll give folks some more time to share their thoughts about this and create the new project then. > > > Maybe we should even announce this on announce@. There my be other > projects interested in a library like this (for example Apache Tika [1]) > Good idea! Should we drop a note there once the project has been created > or after we already have some code in there? > The latter seems appropriate to me. > > Thanks!Bruno > > > From: Benedikt Ritter <brit...@apache.org> > To: Commons Developers List <dev@commons.apache.org>; Bruno P. Kinoshita > <brunodepau...@yahoo.com.br> > Sent: Monday, October 27, 2014 5:45 AM > Subject: Re: [sandbox] New sandbox component > > No objections from my site. I think this is a good idea. Just let me know > if you need help with the bootstraping of the new project. Maybe we should > even announce this on announce@. There my be other projects interested in > a library like this (for example Apache Tika [1]) > > Benedikt > > [1] http://tika.apache.org/ > > > > 2014-10-27 0:41 GMT+01:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br>: > > Hello all, > At the moment I'm working with data matching and record linkage, and had > to port some existing string comparison algorithms found in several open > source projects (fuzzy-search-tools, simmetrics, lingpipe, [lang], [codec]). > At that time I noticed LANG-591 [1], which suggests a more complex > levenshtein distance algorithm. There are several other algorithms too > (damerau-levenshtein, jaro, jaro-wrinkler, jaccard, bitap, q-gram, soundex, > metaphone). Instead of trying to put them all in, say, [lang], I'd like to > experiment with a new [text] component in the sandbox, if there are no > objections. > I will take a look at the existing code and its license, but most of these > algorithms have good Wiki pages with pseudo code available; as well as > academic papers. > Maybe this component could be useful for other projects like [lang], > Lucene, larsga/Duke, and Talend Open Studio. And even though my initial use > case for this would be string comparison, I think it could support other > use cases too. > Thoughts on this? Anyone else interested on such a component? > Thanks!Bruno > [1] https://issues.apache.org/jira/browse/LANG-591 > > > > -- > > http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter > > -- > > <http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter> > > <http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter> > http://people.apache.org/~britter/ > http://www.systemoutprintln.de/ > http://twitter.com/BenediktRitter > http://github.com/britter >