Re: SpamAssassin Ruleset Generation

Matt Kettler Tue, 06 Oct 2009 18:47:49 -0700

poifgh wrote:
> I have a question about - understanding how are rulesets generated for
> spamassassin.
>
> For example - consider the rule in 20_drugs.cf : 
> header SUBJECT_DRUG_GAP_C       Subject =~
> /\bc.{0,2}i.{0,2}a.{0,2}l.{0,2}i.{0,2}s\b/i
> describe SUBJECT_DRUG_GAP_C     Subject contains a gappy version of 'cialis'
>
> Who generated the regular expression
> "/\bc.{0,2}i.{0,2}a.{0,2}l.{0,2}i.{0,2}s\b/i"
>   
Man, that's a good question. I wrote a large chunk of the rules in
20_drugs.cf, but not that one. ( I wrote the stuff near the bottom that
uses meta rules. ie:  __DRUGS_ERECTILE1 through DRUGS_MANYKINDS,
originally distributed as a separate set called antidrug.cf). As I
recall, there were 2 other people making drug rules, but it's been a
LONG time, and I forget who did it. Those rules were written in the
2004-2006 time frame when pharmacy spams were just hammering the heck
outa everyone.


> a. Is it done manually with people writing regex to see how efficiently they
> capture spams?
>   
Yes. Many hours of reading spams, studying them, testing various regex
tweaks, checking for false positives, etc, etc.

mass-check is your friend for this kind of stuff.

One post from when I was developing this as a stand-alone set:

http://mail-archives.apache.org/mod_mbox/spamassassin-users/200404.mbox/%3c6.0.0.22.0.20040428132346.029d9...@opal.evi-inc.com%3e

Note: the comcast link mentioned in that message should be considered
DEAD. The antidrug set is no longer maintained separately from the
mailline ruleset, and hasn't been for years.


If you want to break the rules down a bit, here's some tips:

The rules are in general designed to detect common methods to obscure
text by inserting spaces, punctuation, etc between letters, and possibly
substituting some of the letters for other similar looking characters.
(W4R3Z style, etc)

The simple format would be to think of it in groupings. You end up using
a repeating pattern of (some representation of a character)(some kind of
"gap" sequence)(character)(gap)...etc.

.{0,2} is a "gap sequence", although not one I prefer. I prefer
[_\W]{0,3} in most cases because it's a bit less FP-prone, but risks
missing things using small lower-case letters to gap.

You also get replacements for characters in some of those, like [A4]
instead of just A. Or, more elaborately..  [a4\xe0-\...@]

So this mess:

body __DRUGS_ERECTILE1  
/(?:\b|\s)[_\W]{0,3}(?:\\\/|V)[_\W]{0,3}[ij1!|l\xEC\xED\xEE\xEF][_\W]{0,3}[a40\xe0-\...@][_\w]{0,3}[xyz]?[gj][_\W]{0,3}r[_\W]{0,3}[a40\xe0-\...@][_\w]{0,3}x?[_\W]{0,3}(?:\b|\s)/i


Could be broken down:

(?:\b|\s)   - preamble, detecting space or word boundary.
[_\W]{0,3}   - gap
(?:\\\/|V)   - V
[_\W]{0,3}   - gap
[ij1!|l\xEC\xED\xEE\xEF] - I
[_\W]{0,3}   - gap
[a40\xe0-\...@]   - A
[_\W]{0,3}   - gap
[xyz]?[gj]   - G (with optional extra garbage before it)
[_\W]{0,3}   - gap
r            - just R :-)
[_\W]{0,3}   - gap
[a40\xe0-\...@] -A
[_\W]{0,3}   - gap
x?           - optional garbage
[_\W]{0,3}   - gap
(?:\b|\s)    - suffix, detecting space or word boundary.

Which detects weird spacings and substitutions in the word Viagra.


> But how are the rules generated themselves? 
>   
Mostly meatware, except the sought rules others have mentioned.
> Thnx
>

Re: SpamAssassin Ruleset Generation

Reply via email to