Hi,

the git repo for [text] is ready and I've done the initial bootstraping
already. I've also created a new component in the SANDBOX jira project. The
first issue is to extract algorithms from [lang] [1]. I remember people
saying, that theere is code in codec too. Please feel free to create
tickets for this.

Bruno already has some code that may fit into [text] [2]. I've given it a
brief review an here are few things which caught my eye:

- Inclusion of Talend code into [text] is not possible (the is code
licensed by www.talend.com)
- spellchecker package: nice idea, which I haven't thought about before.
Further more I could imagine a hyphenation package. Both should be locale
dependend.
- Looking at EditDistance [3] I'm not sure we need T extends Number, if the
only possible values for T are Integer and Double. Maybe we only need an
IntegerEditDistance and a DoubleEditDistance.

Regarding the last point: I'm currently not fond that there is a common
interface fot EditingDistance algorithms. For example Levenshtein has the
optional threshold parameter, which Jaro-Winkler has not (at least judging
from the implementation in [lang]). Fuzzy Distance needs a locale for
uncapitalizing. I think finding an interface that fits them all will be
difficult to accomplish... But we'll see :-)

Regards,
Benedikt

[1] https://issues.apache.org/jira/browse/SANDBOX-483
[2]
https://github.com/kinow/text/tree/master/src/main/java/text/string_metric
[3]
https://github.com/kinow/text/blob/master/src/main/java/text/string_metric/EditDistance.java

-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Reply via email to