RE: double letter porn

Bowie Bailey Thu, 05 Oct 2006 13:41:06 -0700

Chris St. Pierre wrote:
> One thing I've wondered/thought about is using the Levenshtein
> difference between the words in an email and a list of spam words
> (ideally pulled from the bayes db).  In this case, all of the
> misspelled words in that sample have a L-distance of 1 from the real
> word -- in other words, they're *very* close.
> 
> I think the problem would be that this would consume tons of
> resources.  Anything else, though, would be susceptible to other typo
> attacks.  For instance, say you took each email, and replaced all
> doubled letters with single letters, it wouldn't be long before you
> were getting spam advertising "analr bictches" or the like.


I would think the problem with computing distances is preventing false
matches on normal words.  Consider these:

    hunt
    shot
    dice
    fits

These are all a distance of 1 from words you might want to look for.

-- 
Bowie

RE: double letter porn

Reply via email to