This all sounds reasonable. G
-------- Original message -------- From: Benedikt Ritter <benerit...@gmail.com> Date:01/22/2014 05:15 (GMT-05:00) To: Commons Developers List <dev@commons.apache.org> Subject: Re: [LANG] New class called StringAlgorithms? Hello, 2014/1/22 Henri Yandell <flame...@gmail.com> > On Mon, Jan 20, 2014 at 8:01 AM, Benedikt Ritter <brit...@apache.org> > wrote: > > > 2014/1/18 Oliver Heger <oliver.he...@oliver-heger.de> > > > > > > > > > > > Am 18.01.2014 17:40, schrieb Emmanuel Bourg: > > > > Le 18/01/2014 16:04, Benedikt Ritter a écrit : > > > > > > > >> About putting this into codec: I still don't think this is a good > fit > > > for > > > >> this contribution. Codec is about, well decoding and encoding stuff. > > > Jaro > > > >> Winkler and Levenshtein Distance are more like scores or metrics > that > > > help > > > >> in comparing strings. > > > > > > > > The point is, string metrics and soundex algorithm are often used to > > > > find similarities between words. That's a bit odd to have them in > > > > separate packages. That being said, string metrics doesn't look like > a > > > > good fit for codec since it doesn't encode anything. > > > > > > From a logic PoV I agree with Emmanuel that a separate Text component > > > would make sense. It could also contain other stuff like special search > > > algorithms or trie implementations. > > > > > > From an organizational PoV I also understand Gary: It is unlikely that > > > we have the energy and man power to keep such a new component alive - > > > except someone steps up now? > > > > > > So I am on the fence. In past we have always tried to keep [lang] very > > > focused and lean. > > > > > > > Well these string distance metrics could be seen as an addition to > > java.lang.String. In this regard a StringDistanceMetrics class would fit > > into [lang]. > > > I don't recall why we sent things like Soundex and Metaphone from Lang to > Codec but not Levenstein. There was lots of debate and I'm guessing it was > because of the API not being transformative on the input but instead > comparative. I think that still holds. > Makes sense > > My thinking - keep it simple for 3.3, figure out bigger picture for 4.0 if > simple was too simple. > > What I'm tempted to think about is splitting up StringUtils in 4.0. Make it > more manageable and easier to find methods in. At 188 methods I think this > is worth considering. > Makes sense. > > I would be tempted by "StringCompare.getLevensteinDistance(...)". > countMatches(String, String) would join them. Maybe all the > startsWith/endsWith methods. Thinking out loud. Premature though for 3.3 :) > > For now I'm in favour of putting jaroWinkler in StringUtils and putting off > the bigger question of StringUtils being so big. Removing the two > Levenstein methods will see a change of 188 to 186 methods - no real impact > to anybody. > Yes. I'd prefer this solution, since I want to give contributors the feeling that their contributions end up in trunk and are ready for use in ucoming releases. If yu cntribute stuff that never ends up in a release that will frustrate you. So keeping things simple and figure out the big picture for 4.0 is a good idea. Benedikt > > Hen >