On Feb 12, 2014, at 1:15 PM, John Hardin <jhar...@impsec.org> wrote:

> Bayes.

Well, yes and no.  Bayes isn't very good about detecting this kind of thing per 
se because it's full of random crap... in fact, they specifically pull text 
from innocuous things like web reviews, movie reviews, news articles, etc. in 
the hopes that it contains a lot of hammy tokens that will negate the spammy 
ones.  On the other hand, there's no real good way of detecting "lots of 
garbage filler text" without a natural language algorithm that could 
heuristically determine whether the primary content (as determined by subject, 
etc.) is related to the filler... and I don't think any such algorithms exist.  
Bayes provides a way of distilling the garbage into tokens and sifting through 
it objectively, so it's the best option, but I wouldn't say it's a method of 
"detecting" this kind of thing.

That said, this particular spam template is interspersed with some sort of 
hashcode which is repeated a number of times.  It could be possible to write a 
rule that matches a long (20-30 chars) alphanumeric string and count 
repetitions; if the same long string is repeated more than (say) 10 times, 
there's a good bet it's an embedded spammy hashcode.

I'd write an example rule but I don't know how to store regexp matches from one 
test to see if they match another test... that is, writing a regexp and using 
tflags multiple on it would be fine if we wanted it to hit on 10 or more long 
strings even if those strings don't match, but if we want to see if there are 
10 or more repeated long strings that are identical, we have to store it 
somehow, and I don't know how to do that with SA.

If SA allows backreferences (since Perl does) then something like the following 
MIGHT work, though I suspect it would be a horrible CPU hog:

rawbody AC_REPEATED_HASHCODE            
/(\s[A-Za-z0-9]{25,}\s)(?:(?:\s*\w+)+\1){10}

This will look for a 25-character string, and look for 10 more repetitions of 
that string surrounded by an arbitrary number of words.  This is untested so I 
don't know if it'll work for sure, and I suspect it wouldn't be very friendly 
to the CPU.  The previous method of matching a string, storing it, and looking 
for repetitions of that string, would be preferable, but I don't know how to do 
that with SA.

--- Amir


Reply via email to