Re: double letter porn

Chris St. Pierre Tue, 10 Oct 2006 08:59:13 -0700

If anyone's curious, I did some followup research on the ideas below
and found them to be, generally, totally unfeasable.

I downloaded the TREC corpus and generated a list of words that
commonly appeared in spam.  I used the top 1000 most common words of
greater than four letters in the TREC spam that were NOT in the top
1000 most common >4 letter words in the TREC ham.

I then did two sets of tests on a few sample hams and spams, and the
results convinced me that it was not even necessary to run the tests
on the whole corpus.

For each message, I compared each word of greater than four letters
with each word in my spam wordlist with the Wagner-Fischer distance, a
slightly modified Levenshtein distance.  With W-F, I was able to give
greater weight to letter replacements, so "viagna" would be further
from "viagra" than, say, "viagrra."  I also compared the Metaphone
representation of each word of >4 letters with the Metaphone hashes of
each word in my spam wordlist, again with Wagner-Fischer.  I discarded
those distances that were too high and then computed a score for each
message with the following formula:

<metaphone_length> ^ 2 / (<metaphone_distance> + 1) + 
     <word_length> ^ 2 / (<distance> + 1)

I ran this on the first ten spams and hams in the corpus.  The mean
score for spams was 365.7 and the median was 12.5; the mean score for
hams was 3715.565 and the median was 1103.6.  More than anything, the
results seem to indicate the length of the message rather than the
spamminess.

Processor time was also a problem; the largest message scanned took
over 23 minutes to process.  The quickest was under 3 seconds, but the
average was around 45 seconds, with ham taking much longer to process
than spam.

Running either test individually -- the plain text W-F distance or the
metaphone W-F distance -- did not show an appreciable improvement in
the accuracy of the algorithm, although the processing time improved.

It's too bad this won't work, although if someone else wants to take a
crack at it, I'd be happy to share my code, word lists, etc.

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University

On Thu, 5 Oct 2006, Chris St. Pierre wrote:

>One thing I've wondered/thought about is using the Levenshtein
>difference between the words in an email and a list of spam words
>(ideally pulled from the bayes db).  In this case, all of the
>misspelled words in that sample have a L-distance of 1 from the real
>word -- in other words, they're *very* close.
>
>I think the problem would be that this would consume tons of
>resources.  Anything else, though, would be susceptible to other typo
>attacks.  For instance, say you took each email, and replaced all
>doubled letters with single letters, it wouldn't be long before you
>were getting spam advertising "analr bictches" or the like.
>
>Chris St. Pierre
>Unix Systems Administrator
>Nebraska Wesleyan University
>
>On Wed, 4 Oct 2006, Eric A. Hall wrote:
>
>>
>>On 10/4/2006 5:57 PM, Richard Doyle wrote:
>>> I've been getting lots of porn site spam containing words with doubled
>>> letters, like this one:
>>
>>> Can anybody suggest a rule or ruleset to catch these double-letter
>>> obfuscations? I'm using Spamassassin 3.1.4.
>>
>>You'd probably need to write a plug-in that used some kind of
>>typo-matching logic to find porno words.
>>
>>Would be a good plug-in actually. Get busy :)
>>
>>-- 
>>Eric A. Hall                                        http://www.ehsco.com/
>>Internet Core Protocols          http://www.oreilly.com/catalog/coreprot/
>>
>

Re: double letter porn

Reply via email to