Chris St. Pierre wrote:
> One thing I've wondered/thought about is using the Levenshtein
> difference between the words in an email and a list of spam words
> (ideally pulled from the bayes db).  In this case, all of the
> misspelled words in that sample have a L-distance of 1 from the real
> word -- in other words, they're *very* close.
> 
> I think the problem would be that this would consume tons of
> resources.  Anything else, though, would be susceptible to other typo
> attacks.  For instance, say you took each email, and replaced all
> doubled letters with single letters, it wouldn't be long before you
> were getting spam advertising "analr bictches" or the like.

I would think the problem with computing distances is preventing false
matches on normal words.  Consider these:

    hunt
    shot
    dice
    fits

These are all a distance of 1 from words you might want to look for.

-- 
Bowie

Reply via email to