Chris St. Pierre wrote: > One thing I've wondered/thought about is using the Levenshtein > difference between the words in an email and a list of spam words > (ideally pulled from the bayes db). In this case, all of the > misspelled words in that sample have a L-distance of 1 from the real > word -- in other words, they're *very* close. > > I think the problem would be that this would consume tons of > resources. Anything else, though, would be susceptible to other typo > attacks. For instance, say you took each email, and replaced all > doubled letters with single letters, it wouldn't be long before you > were getting spam advertising "analr bictches" or the like.
I would think the problem with computing distances is preventing false matches on normal words. Consider these: hunt shot dice fits These are all a distance of 1 from words you might want to look for. -- Bowie