Hi, the git repo for [text] is ready and I've done the initial bootstraping already. I've also created a new component in the SANDBOX jira project. The first issue is to extract algorithms from [lang] [1]. I remember people saying, that theere is code in codec too. Please feel free to create tickets for this.
Bruno already has some code that may fit into [text] [2]. I've given it a brief review an here are few things which caught my eye: - Inclusion of Talend code into [text] is not possible (the is code licensed by www.talend.com) - spellchecker package: nice idea, which I haven't thought about before. Further more I could imagine a hyphenation package. Both should be locale dependend. - Looking at EditDistance [3] I'm not sure we need T extends Number, if the only possible values for T are Integer and Double. Maybe we only need an IntegerEditDistance and a DoubleEditDistance. Regarding the last point: I'm currently not fond that there is a common interface fot EditingDistance algorithms. For example Levenshtein has the optional threshold parameter, which Jaro-Winkler has not (at least judging from the implementation in [lang]). Fuzzy Distance needs a locale for uncapitalizing. I think finding an interface that fits them all will be difficult to accomplish... But we'll see :-) Regards, Benedikt [1] https://issues.apache.org/jira/browse/SANDBOX-483 [2] https://github.com/kinow/text/tree/master/src/main/java/text/string_metric [3] https://github.com/kinow/text/blob/master/src/main/java/text/string_metric/EditDistance.java -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter