Hi all, as disucssed, we'd like to create a new component which is focused on algorithms for string/text processing.
We (= Bruno and I) would like to create this new component with git as primary vcs right away, which will make Commons Text the second Commons component to use git. Please let me know if you have objections against this. I'll open an INFRA ticket for the new git repo, this weekend. Thanks! Benedikt 2014-10-27 12:57 GMT+01:00 Benedikt Ritter <brit...@apache.org>: > > > 2014-10-27 12:32 GMT+01:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br> > : > >> Hi Benedikt! >> > Just let me know if you need help with the bootstraping of the new >> project. >> Yes, please :) >> > > I'll give folks some more time to share their thoughts about this and > create the new project then. > > >> >> > Maybe we should even announce this on announce@. There my be other >> projects interested in a library like this (for example Apache Tika [1]) >> Good idea! Should we drop a note there once the project has been created >> or after we already have some code in there? >> > > The latter seems appropriate to me. > > >> >> Thanks!Bruno >> >> >> From: Benedikt Ritter <brit...@apache.org> >> To: Commons Developers List <dev@commons.apache.org>; Bruno P. >> Kinoshita <brunodepau...@yahoo.com.br> >> Sent: Monday, October 27, 2014 5:45 AM >> Subject: Re: [sandbox] New sandbox component >> >> No objections from my site. I think this is a good idea. Just let me know >> if you need help with the bootstraping of the new project. Maybe we should >> even announce this on announce@. There my be other projects interested >> in a library like this (for example Apache Tika [1]) >> >> Benedikt >> >> [1] http://tika.apache.org/ >> >> >> >> 2014-10-27 0:41 GMT+01:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br >> >: >> >> Hello all, >> At the moment I'm working with data matching and record linkage, and had >> to port some existing string comparison algorithms found in several open >> source projects (fuzzy-search-tools, simmetrics, lingpipe, [lang], [codec]). >> At that time I noticed LANG-591 [1], which suggests a more complex >> levenshtein distance algorithm. There are several other algorithms too >> (damerau-levenshtein, jaro, jaro-wrinkler, jaccard, bitap, q-gram, soundex, >> metaphone). Instead of trying to put them all in, say, [lang], I'd like to >> experiment with a new [text] component in the sandbox, if there are no >> objections. >> I will take a look at the existing code and its license, but most of >> these algorithms have good Wiki pages with pseudo code available; as well >> as academic papers. >> Maybe this component could be useful for other projects like [lang], >> Lucene, larsga/Duke, and Talend Open Studio. And even though my initial use >> case for this would be string comparison, I think it could support other >> use cases too. >> Thoughts on this? Anyone else interested on such a component? >> Thanks!Bruno >> [1] https://issues.apache.org/jira/browse/LANG-591 >> >> >> >> -- >> >> http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter >> >> -- >> >> <http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter> >> >> <http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter> >> http://people.apache.org/~britter/ >> http://www.systemoutprintln.de/ >> http://twitter.com/BenediktRitter >> http://github.com/britter >> > -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter