If anyone's curious, I did some followup research on the ideas below and found them to be, generally, totally unfeasable.
I downloaded the TREC corpus and generated a list of words that commonly appeared in spam. I used the top 1000 most common words of greater than four letters in the TREC spam that were NOT in the top 1000 most common >4 letter words in the TREC ham. I then did two sets of tests on a few sample hams and spams, and the results convinced me that it was not even necessary to run the tests on the whole corpus. For each message, I compared each word of greater than four letters with each word in my spam wordlist with the Wagner-Fischer distance, a slightly modified Levenshtein distance. With W-F, I was able to give greater weight to letter replacements, so "viagna" would be further from "viagra" than, say, "viagrra." I also compared the Metaphone representation of each word of >4 letters with the Metaphone hashes of each word in my spam wordlist, again with Wagner-Fischer. I discarded those distances that were too high and then computed a score for each message with the following formula: <metaphone_length> ^ 2 / (<metaphone_distance> + 1) + <word_length> ^ 2 / (<distance> + 1) I ran this on the first ten spams and hams in the corpus. The mean score for spams was 365.7 and the median was 12.5; the mean score for hams was 3715.565 and the median was 1103.6. More than anything, the results seem to indicate the length of the message rather than the spamminess. Processor time was also a problem; the largest message scanned took over 23 minutes to process. The quickest was under 3 seconds, but the average was around 45 seconds, with ham taking much longer to process than spam. Running either test individually -- the plain text W-F distance or the metaphone W-F distance -- did not show an appreciable improvement in the accuracy of the algorithm, although the processing time improved. It's too bad this won't work, although if someone else wants to take a crack at it, I'd be happy to share my code, word lists, etc. Chris St. Pierre Unix Systems Administrator Nebraska Wesleyan University On Thu, 5 Oct 2006, Chris St. Pierre wrote: >One thing I've wondered/thought about is using the Levenshtein >difference between the words in an email and a list of spam words >(ideally pulled from the bayes db). In this case, all of the >misspelled words in that sample have a L-distance of 1 from the real >word -- in other words, they're *very* close. > >I think the problem would be that this would consume tons of >resources. Anything else, though, would be susceptible to other typo >attacks. For instance, say you took each email, and replaced all >doubled letters with single letters, it wouldn't be long before you >were getting spam advertising "analr bictches" or the like. > >Chris St. Pierre >Unix Systems Administrator >Nebraska Wesleyan University > >On Wed, 4 Oct 2006, Eric A. Hall wrote: > >> >>On 10/4/2006 5:57 PM, Richard Doyle wrote: >>> I've been getting lots of porn site spam containing words with doubled >>> letters, like this one: >> >>> Can anybody suggest a rule or ruleset to catch these double-letter >>> obfuscations? I'm using Spamassassin 3.1.4. >> >>You'd probably need to write a plug-in that used some kind of >>typo-matching logic to find porno words. >> >>Would be a good plug-in actually. Get busy :) >> >>-- >>Eric A. Hall http://www.ehsco.com/ >>Internet Core Protocols http://www.oreilly.com/catalog/coreprot/ >> >