On Feb 12, 2014, at 1:15 PM, John Hardin <jhar...@impsec.org> wrote: > Bayes.
Well, yes and no. Bayes isn't very good about detecting this kind of thing per se because it's full of random crap... in fact, they specifically pull text from innocuous things like web reviews, movie reviews, news articles, etc. in the hopes that it contains a lot of hammy tokens that will negate the spammy ones. On the other hand, there's no real good way of detecting "lots of garbage filler text" without a natural language algorithm that could heuristically determine whether the primary content (as determined by subject, etc.) is related to the filler... and I don't think any such algorithms exist. Bayes provides a way of distilling the garbage into tokens and sifting through it objectively, so it's the best option, but I wouldn't say it's a method of "detecting" this kind of thing. That said, this particular spam template is interspersed with some sort of hashcode which is repeated a number of times. It could be possible to write a rule that matches a long (20-30 chars) alphanumeric string and count repetitions; if the same long string is repeated more than (say) 10 times, there's a good bet it's an embedded spammy hashcode. I'd write an example rule but I don't know how to store regexp matches from one test to see if they match another test... that is, writing a regexp and using tflags multiple on it would be fine if we wanted it to hit on 10 or more long strings even if those strings don't match, but if we want to see if there are 10 or more repeated long strings that are identical, we have to store it somehow, and I don't know how to do that with SA. If SA allows backreferences (since Perl does) then something like the following MIGHT work, though I suspect it would be a horrible CPU hog: rawbody AC_REPEATED_HASHCODE /(\s[A-Za-z0-9]{25,}\s)(?:(?:\s*\w+)+\1){10} This will look for a 25-character string, and look for 10 more repetitions of that string surrounded by an arbitrary number of words. This is untested so I don't know if it'll work for sure, and I suspect it wouldn't be very friendly to the CPU. The previous method of matching a string, storing it, and looking for repetitions of that string, would be preferable, but I don't know how to do that with SA. --- Amir