--- Begin Message ---
What fuzzy-string matching tools & packages are available today?

-cam

On Wed, Feb 26, 2014 at 9:09 AM, Hernán Morales Durand <
hernan.mora...@gmail.com> wrote:

>
>
>
> 2014-02-26 7:10 GMT-03:00 Norbert Hartl <norb...@hartl.name>:
>
>>
>> Am 26.02.2014 um 09:50 schrieb Pharo4Stef <pharo4s...@free.fr>:
>>
>>
>> We can have an information retrieval API for aproximate string matching,
>> i.e. Levenshtein distance (already implemented, various versions), Hamming
>> distance, both are the most used and simplest edit distances.
>> Then you have Longest common subsequence, Longest common substring (they
>> are implemented in a package called "Fuzz", #longestCommonSubsequenceWith:
>> ). Also there is the shift-or adapted for approximate matches (also
>> implemented), fuzzy phrasing is another world also. Many applications use
>> Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and
>> Smith-Waterman, but they call them "aligners" :) but you don't want to code
>> the optimized version in Smalltalk, some say it could take years.
>> All edit distances out there have specific requirements and no one is
>> better than another for all cases. For example Jaro-Winkler is useful for
>> one-word short strings.
>>
>>
>> I’m not sure that all these edit distances should be part of the String
>> core api.
>> Now what would be good is to have a chapter describing them. This chapter
>> would work well with the bioSmalltalk one :)
>>
>> I’m pretty sure they shouldn’t. Most of these are most likely for special
>> applications. So a perfect candidate for a string extension package. A real
>> modular entity that could load each of them individually would be perfect
>> but we don’t have the proper tools, yet. Unless of course every of those
>> algorithms is composed of multiple classes and would fit naturally in a
>> package.
>>
>
> Absolutely for a separate package for information retrieval algorithms.
> From what I've seen, some algorithms require optimization through dynamic
> programming (automatas, matrices, etc) and that would lead to multiple
> classes, assuming you don't want to get dirty String class.
>
>
>> But the most important prerequisite would be to make a separate package
>> out of it. Did I understand that right that those are part of biosmalltalk?
>>
>
> No. Those algorithms are spread over different packages in repositories
> like SqueakSource, Cincom Store, etc.
>
> Hernán
>
>
>

--- End Message ---

Reply via email to