Le 27/10/2014 08:45, Benedikt Ritter a écrit : > No objections from my site. I think this is a good idea. Just let me know > if you need help with the bootstraping of the new project. Maybe we should > even announce this on announce@. There my be other projects interested in a > library like this (for example Apache Tika [1]) > > Benedikt > > [1] http://tika.apache.org/ > > 2014-10-27 0:41 GMT+01:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br>: > >> Hello all, >> At the moment I'm working with data matching and record linkage, and had >> to port some existing string comparison algorithms found in several open >> source projects (fuzzy-search-tools, simmetrics, lingpipe, [lang], [codec]).
There is also an implementation of the Meyer algorithm in [collections], package org.apache.commons.collections4.sequence. best regards, Luc >> At that time I noticed LANG-591 [1], which suggests a more complex >> levenshtein distance algorithm. There are several other algorithms too >> (damerau-levenshtein, jaro, jaro-wrinkler, jaccard, bitap, q-gram, soundex, >> metaphone). Instead of trying to put them all in, say, [lang], I'd like to >> experiment with a new [text] component in the sandbox, if there are no >> objections. >> I will take a look at the existing code and its license, but most of these >> algorithms have good Wiki pages with pseudo code available; as well as >> academic papers. >> Maybe this component could be useful for other projects like [lang], >> Lucene, larsga/Duke, and Talend Open Studio. And even though my initial use >> case for this would be string comparison, I think it could support other >> use cases too. >> Thoughts on this? Anyone else interested on such a component? >> Thanks!Bruno >> [1] https://issues.apache.org/jira/browse/LANG-591 > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org