Hello all, 
At the moment I'm working with data matching and record linkage, and had to 
port some existing string comparison algorithms found in several open source 
projects (fuzzy-search-tools, simmetrics, lingpipe, [lang], [codec]).
At that time I noticed LANG-591 [1], which suggests a more complex levenshtein 
distance algorithm. There are several other algorithms too 
(damerau-levenshtein, jaro, jaro-wrinkler, jaccard, bitap, q-gram, soundex, 
metaphone). Instead of trying to put them all in, say, [lang], I'd like to 
experiment with a new [text] component in the sandbox, if there are no 
objections. 
I will take a look at the existing code and its license, but most of these 
algorithms have good Wiki pages with pseudo code available; as well as academic 
papers. 
Maybe this component could be useful for other projects like [lang], Lucene, 
larsga/Duke, and Talend Open Studio. And even though my initial use case for 
this would be string comparison, I think it could support other use cases too.
Thoughts on this? Anyone else interested on such a component? 
Thanks!Bruno
[1] https://issues.apache.org/jira/browse/LANG-591 

Reply via email to