2014-02-26 7:10 GMT-03:00 Norbert Hartl <norb...@hartl.name>: > > Am 26.02.2014 um 09:50 schrieb Pharo4Stef <pharo4s...@free.fr>: > > > We can have an information retrieval API for aproximate string matching, > i.e. Levenshtein distance (already implemented, various versions), Hamming > distance, both are the most used and simplest edit distances. > Then you have Longest common subsequence, Longest common substring (they > are implemented in a package called "Fuzz", #longestCommonSubsequenceWith: > ). Also there is the shift-or adapted for approximate matches (also > implemented), fuzzy phrasing is another world also. Many applications use > Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and > Smith-Waterman, but they call them "aligners" :) but you don't want to code > the optimized version in Smalltalk, some say it could take years. > All edit distances out there have specific requirements and no one is > better than another for all cases. For example Jaro-Winkler is useful for > one-word short strings. > > > I’m not sure that all these edit distances should be part of the String > core api. > Now what would be good is to have a chapter describing them. This chapter > would work well with the bioSmalltalk one :) > > I’m pretty sure they shouldn’t. Most of these are most likely for special > applications. So a perfect candidate for a string extension package. A real > modular entity that could load each of them individually would be perfect > but we don’t have the proper tools, yet. Unless of course every of those > algorithms is composed of multiple classes and would fit naturally in a > package. >
Absolutely for a separate package for information retrieval algorithms. >From what I've seen, some algorithms require optimization through dynamic programming (automatas, matrices, etc) and that would lead to multiple classes, assuming you don't want to get dirty String class. > But the most important prerequisite would be to make a separate package > out of it. Did I understand that right that those are part of biosmalltalk? > No. Those algorithms are spread over different packages in repositories like SqueakSource, Cincom Store, etc. Hernán