Great! Thanks for taking care of this Benedikt! There are four open issues.
TEXT-2: Add Jaccard Index and Jaccard DistanceTEXT-10: A more complex Levenshtein distanceTEXT-14: Create a generic class that calculates a distance based on a similarity scoreTEXT-16: Improve HumanNameParser I've collected a list of features that I think could be a good contribution to [text], but I'll re-read and filter the list before creating new issues. Things like working with syllables, different ways to convert text to vectors (useful for NLP and text analytics), text alignment (left, right, justify), Burrows–Wheeler transform, etc. But I think prior to a 1.0 release, TEXT-16 is the only pending issue. What do you think? CheersBruno From: Benedikt Ritter <brit...@apache.org> To: Commons Developers List <dev@commons.apache.org> Sent: Sunday, April 26, 2015 10:14 PM Subject: Re: [TEXT] Review of the code base 2015-04-26 11:52 GMT+02:00 Benedikt Ritter <brit...@apache.org>: > > > 2015-04-19 17:29 GMT+02:00 Benedikt Ritter <brit...@apache.org>: > >> >> >> 2015-04-19 17:24 GMT+02:00 Benedikt Ritter <brit...@apache.org>: >> >>> Hello Bruno, >>> >>> 2015-04-19 12:15 GMT+02:00 Bruno P. Kinoshita < >>> brunodepau...@yahoo.com.br>: >>> >>>> Thanks for the thorough code review Benedikt! Comments inline. >>>> >>>> > 1. IP clearance! There are several comments talking about code >>>> adapted from >>>> >PHP libraries. This needs to be listed in NOTICE.txt. For an example >>>> see >>>> >the NOTICE.txt in [lang]. >>>> Filed https://issues.apache.org/jira/browse/SANDBOX-497 >>>> And first try to fix it: >>>> >>>> >>>> https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=commitdiff;h=refs/heads/SANDBOX-497 >>>> Does that look correct? I also checked with the author before porting >>>> it to the initial Java version, and told him about the [text] component - >>>> https://github.com/jasonpriem/HumanNameParser.php/issues/11 >>> >>> >>> I've adjusted the changes a bit and merged it into master. >>> >>> >>>> >>>> > 2. The names package needs work. >>>> Completely agree. We can try to make it thread safe too. This will take >>>> a little longer. >>>> File new issue for tracking it >>>> https://issues.apache.org/jira/browse/SANDBOX-498 >>> >>> >>> I've pushed a new branch containing a proposal how to fix this. Please >>> review. >>> >>> >>>> >>>> >>>> > 3. There are a some useless JavaDocs, which can be dropped. For >>>> example: >>>> > (...)> Better rename the variable to middleName and drop the comment. >>>> What about renaming the variables but leaving the comments? >>>> >>> >>> For I've corrected this for the names package in the SANDBOX-498 branch. >>> >>> >>>> >>>> > 4. Several classes have no tests. The overall test coverage looks >>>> good, so >>>> >this may not be a problem. >>>> Before the initial release I can start a development cycle just to add >>>> new tests for parts that seem important - like code with somewhat high [+5] >>>> cyclomatic complexity and uncovered branches. Just to make sure that the >>>> initial release will have an acceptable quality and less hidden bugs :-) >>>> >>> >>> Nice! >>> >>> >>>> > 5. I don't think the shade configuration is correct. To me it looks >>>> like >>>> >commons-text depends on [lang] and also shades some classes. Do we >>>> really >>>> >need the explicit dependency? Furthermore, why do we need to shade so >>>> many >>>> >classes? Why do we need anything beside StringUtils? >>>> My mistake. I copied the [functor] configuration without much thinking. >>>> IIRC, this dependency was added because of the names package. But it >>>> shouldn't be too complicate to replace it. It is mainly used for checking >>>> empty/blank values. >>>> >>>> So let's drop the [lang] dependency and use something like `$VAR != >>>> null && ! "".equals($VAR)` instead? >>>> >>> >>> I think we also need StringUtils in the similarity package. We should >>> handle this after we have merged the changes made to the names package. >>> >>> >>>> Thanks again for reviewing the code! >>>> >>> >>> No problem, thanks for pushing this forward. >>> >>> Here are some more things we need to take care of before 1.0: >>> - JIRA project >>> >> >> Requested a JIRA project: >> https://issues.apache.org/jira/servicedesk/customer/portal/1/INFRA-9477 >> > > Here is the JIRA project [1]. I'll find out, how we can move the Commons > Text issues from the Sandbox project to the new project. > I've migrated all issues from the Sandbox project to the new Commons Text project. I've also updated the project documentation. > > Benedikt > > [1] https://issues.apache.org/jira/browse/TEXT > > >> >> >>> - Rename similarity package to distance package? We have five Distance >>> implementations but only one Similarity. Furthermore we have FuzzyScore. Is >>> it a distance or a similarity? It should be renamed accordingly. >>> - ConsineSimilariy uses Vectors modeled as Map<CharSequence, Integer>. >>> Does it make sense to introduce a new class called Vector? >>> >>> Benedikt >>> >>> >>>> >>>> All the best,Bruno >>>> >>>> >>>> From: Benedikt Ritter <brit...@apache.org> >>>> To: Commons Developers List <dev@commons.apache.org> >>>> Sent: Sunday, April 19, 2015 9:01 PM >>>> Subject: [TEXT] Review of the code base >>>> >>>> Hi all, >>>> >>>> I've looked through the code base of [text] and I already did some clean >>>> up. However there are a few things, that need more work IMHO: >>>> >>>> 1. IP clearance! There are several comments talking about code adapted >>>> from >>>> PHP libraries. This needs to be listed in NOTICE.txt. For an example see >>>> the NOTICE.txt in [lang]. >>>> >>>> 2. The names package needs work. Currently the HumanNameParser takes a >>>> Name >>>> object (which is just a wrapper around a String) as parameter. The parse >>>> result is then saved in fields (e.g. firstName). This makes the parser >>>> unusable after it has parsed a name. I think the API should be reworked >>>> such as: >>>> - The constructor of the parser takes configuration options which can be >>>> reused for several names to parse >>>> - the parse method takes a string as parameter, containing a name >>>> - the parse method returns an immutable Name objects which has getters >>>> for >>>> firstName, lastName etc. >>>> >>>> 3. There are a some useless JavaDocs, which can be dropped. For example: >>>> /** >>>> * Middle name. >>>> */ >>>> private String middle; >>>> >>>> Better rename the variable to middleName and drop the comment. >>>> >>>> 4. Several classes have no tests. The overall test coverage looks good, >>>> so >>>> this may not be a problem. >>>> >>>> 5. I don't think the shade configuration is correct. To me it looks like >>>> commons-text depends on [lang] and also shades some classes. Do we >>>> really >>>> need the explicit dependency? Furthermore, why do we need to shade so >>>> many >>>> classes? Why do we need anything beside StringUtils? >>>> >>>> Nevertheless, this library looks very good. Kudos to Bruno for pushing >>>> this >>>> forward! >>>> >>>> Regards, >>>> Benedikt >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> http://people.apache.org/~britter/ >>>> http://www.systemoutprintln.de/ >>>> http://twitter.com/BenediktRitter >>>> http://github.com/britter >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> -- >>> http://people.apache.org/~britter/ >>> http://www.systemoutprintln.de/ >>> http://twitter.com/BenediktRitter >>> http://github.com/britter >>> >> >> >> >> -- >> http://people.apache.org/~britter/ >> http://www.systemoutprintln.de/ >> http://twitter.com/BenediktRitter >> http://github.com/britter >> > > > > -- > http://people.apache.org/~britter/ > http://www.systemoutprintln.de/ > http://twitter.com/BenediktRitter > http://github.com/britter > -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter