On Thu, 5 Oct 2006, Bowie Bailey wrote: > >I would think the problem with computing distances is preventing false >matches on normal words. Consider these: > > hunt > shot > dice > fits > >These are all a distance of 1 from words you might want to look for.
Heh, good examples. While that's true, you probably wouldn't score something based on just one or two hits. It'd be sort of like weighted bayes -- each hit would get scored based on the length of the target word and the closeness of the hit, and that would create a composite score for the message. You would then do SA scoring based on this composite. So, to use a completely spurious example, let's assume that each word points equal to (length of target / distance). So this message: "We went on a hunt, and then shot some dice." Might have three hits: hunt, shot, and dice. (Although I'm not sure you'd want to match either shot or dice.) That would give it a score of 12. The example message earlier had at least six misspelled words: orrgy, pornn, bitchees, annal, orgiees, and ssex. That would give a composite score of 26. Your breaking point would be somewhere between 12 and 26. In reality, I think you'd want to use an exponential scale for the length, so that a six-letter word counts for (e.g.) twice as much as a five-letter word. If we did that -- word score = (length of target ^ 2 / distance) -- then the innocent sample message would score a 48 and the guilty sample would score a 118, yet a larger gap (magnitude-wise). Another option would be to use a combination of Levenshtein distance and an algorithm like metaphone for representing the pronunciation of a word. So levenshtein(metaphone("orgy", "orrgy")) == 0, but levenshtein(metaphone("dice", "dick")) == 1. Chris St. Pierre Unix Systems Administrator Nebraska Wesleyan University