On Mon, 15 Dec 2003 14:38:59 -0600, Brad Koehn <[EMAIL PROTECTED]> writes:

> Any spammer worth his salt runs his message through SA and other
> popular anti-spam tools as best he can. Most of SA is relatively
> static and slow to respond to changes in message content. The problem
> comes in a few areas like checks against Received headers (since the
> spammer may be using thousands of zombies to send the message to your
> MTA) and Bayes filters, which are tailored to each recipient and the
> spammer cannot access.
> 
> 
> If a smart spammer wanted to try to make it past the Bayes filters,
> he'd set up a spamtrap, gather spam, and run his message against the
> Bayes tokens it gathers. Of course, that only partially reduces the
> likelihood of making it in (since it doesn't know which good tokens to
> use for each recipient), but at least he can avoid the same tokens
> other spammers are using.

There are only so many ways to sell a printer or toner cartridges. Its
also hard to sell drugs without either giving their name or giving
some description of what they do. If SA can cover them, the spammers
are pretty much stuck with circumlocations that will turn into
nonsense, or giving a URL/image and little else.

> Of course, running it through SA won't help against the dynamic
> collaborative filters like razor, pyzor, dcc, etc. without varying the
> message. I suppose the spammer could send the message to his own spam
> trap (after the dynamic collaborators have crunched it) and see how it
> came out. He could also tweak the message after it's been in flight
> for a while, hopefully breaking the signatures of the dynamic
> collaborators.

I have ideas on that. Here are a few for making it hard to modify a
message. Take the origional email and first, remove all non-wordsd and
non-dictionary words and all words less than 5 characters long. Then
concatenate the words together. Now apply a rolling-hash algorithm to
break it up into pieces that are, say, on average 50 characters
long. Hash the pieces individually to form the signature.

The idea of a rolling hash is whenever hash(substring(i-30,i))%50 ==
0, you split a string at the point i. Take each of those cut-apart
pieces and then record their length and truncated MD5 hash as the
signature, which is just a list of lengths&md5's.

This has the property that any localized change at one position in the
email can affect at most the piece it is in and the next piece, and
thus change a limited number of outputs in the signature. Furthermore,
since the split algorithm breaks into pieces that will average 50
bytes long, with good probability, if there is any 120-character
sequence in common, anywhere, between two spams, it would detect a
match in one of the components in the signature.

Also, because of the preprocessing, this scheme should be robust
against adding signature poisoning --- they'd practically have to add
in a random longish english word every other line, or write each email
individualy.

Its not implemented yet, but its also on my queue to do after the
automata matcher. When I do it, I'm planning on running that over my
existing spam archive just to see if it can identify any unknown but
popular phrases or patterns.

> A really smart spammer would examine the algorithms, and design
> algorithms of his own to morph the message enough to defeat them. 

Exactly. SA is itself a recipee for messages that can bypass it.

Scott


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to