Hi Benedikt!
> Just let me know if you need help with the bootstraping of the new project.
Yes, please :)
> Maybe we should even announce this on announce@. There my be other projects
> interested in a library like this (for example Apache Tika [1])
Good idea! Should we drop a note there once the project has been created or
after we already have some code in there?
Thanks!Bruno
From: Benedikt Ritter <[email protected]>
To: Commons Developers List <[email protected]>; Bruno P. Kinoshita
<[email protected]>
Sent: Monday, October 27, 2014 5:45 AM
Subject: Re: [sandbox] New sandbox component
No objections from my site. I think this is a good idea. Just let me know if
you need help with the bootstraping of the new project. Maybe we should even
announce this on announce@. There my be other projects interested in a library
like this (for example Apache Tika [1])
Benedikt
[1] http://tika.apache.org/
2014-10-27 0:41 GMT+01:00 Bruno P. Kinoshita <[email protected]>:
Hello all,
At the moment I'm working with data matching and record linkage, and had to
port some existing string comparison algorithms found in several open source
projects (fuzzy-search-tools, simmetrics, lingpipe, [lang], [codec]).
At that time I noticed LANG-591 [1], which suggests a more complex levenshtein
distance algorithm. There are several other algorithms too
(damerau-levenshtein, jaro, jaro-wrinkler, jaccard, bitap, q-gram, soundex,
metaphone). Instead of trying to put them all in, say, [lang], I'd like to
experiment with a new [text] component in the sandbox, if there are no
objections.
I will take a look at the existing code and its license, but most of these
algorithms have good Wiki pages with pseudo code available; as well as academic
papers.
Maybe this component could be useful for other projects like [lang], Lucene,
larsga/Duke, and Talend Open Studio. And even though my initial use case for
this would be string comparison, I think it could support other use cases too.
Thoughts on this? Anyone else interested on such a component?
Thanks!Bruno
[1] https://issues.apache.org/jira/browse/LANG-591
--
http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter