poifgh wrote: > I have a question about - understanding how are rulesets generated for > spamassassin. > > For example - consider the rule in 20_drugs.cf : > header SUBJECT_DRUG_GAP_C Subject =~ > /\bc.{0,2}i.{0,2}a.{0,2}l.{0,2}i.{0,2}s\b/i > describe SUBJECT_DRUG_GAP_C Subject contains a gappy version of 'cialis' > > Who generated the regular expression > "/\bc.{0,2}i.{0,2}a.{0,2}l.{0,2}i.{0,2}s\b/i" > Man, that's a good question. I wrote a large chunk of the rules in 20_drugs.cf, but not that one. ( I wrote the stuff near the bottom that uses meta rules. ie: __DRUGS_ERECTILE1 through DRUGS_MANYKINDS, originally distributed as a separate set called antidrug.cf). As I recall, there were 2 other people making drug rules, but it's been a LONG time, and I forget who did it. Those rules were written in the 2004-2006 time frame when pharmacy spams were just hammering the heck outa everyone.
> a. Is it done manually with people writing regex to see how efficiently they > capture spams? > Yes. Many hours of reading spams, studying them, testing various regex tweaks, checking for false positives, etc, etc. mass-check is your friend for this kind of stuff. One post from when I was developing this as a stand-alone set: http://mail-archives.apache.org/mod_mbox/spamassassin-users/200404.mbox/%3c6.0.0.22.0.20040428132346.029d9...@opal.evi-inc.com%3e Note: the comcast link mentioned in that message should be considered DEAD. The antidrug set is no longer maintained separately from the mailline ruleset, and hasn't been for years. If you want to break the rules down a bit, here's some tips: The rules are in general designed to detect common methods to obscure text by inserting spaces, punctuation, etc between letters, and possibly substituting some of the letters for other similar looking characters. (W4R3Z style, etc) The simple format would be to think of it in groupings. You end up using a repeating pattern of (some representation of a character)(some kind of "gap" sequence)(character)(gap)...etc. .{0,2} is a "gap sequence", although not one I prefer. I prefer [_\W]{0,3} in most cases because it's a bit less FP-prone, but risks missing things using small lower-case letters to gap. You also get replacements for characters in some of those, like [A4] instead of just A. Or, more elaborately.. [a4\xe0-\...@] So this mess: body __DRUGS_ERECTILE1 /(?:\b|\s)[_\W]{0,3}(?:\\\/|V)[_\W]{0,3}[ij1!|l\xEC\xED\xEE\xEF][_\W]{0,3}[a40\xe0-\...@][_\w]{0,3}[xyz]?[gj][_\W]{0,3}r[_\W]{0,3}[a40\xe0-\...@][_\w]{0,3}x?[_\W]{0,3}(?:\b|\s)/i Could be broken down: (?:\b|\s) - preamble, detecting space or word boundary. [_\W]{0,3} - gap (?:\\\/|V) - V [_\W]{0,3} - gap [ij1!|l\xEC\xED\xEE\xEF] - I [_\W]{0,3} - gap [a40\xe0-\...@] - A [_\W]{0,3} - gap [xyz]?[gj] - G (with optional extra garbage before it) [_\W]{0,3} - gap r - just R :-) [_\W]{0,3} - gap [a40\xe0-\...@] -A [_\W]{0,3} - gap x? - optional garbage [_\W]{0,3} - gap (?:\b|\s) - suffix, detecting space or word boundary. Which detects weird spacings and substitutions in the word Viagra. > But how are the rules generated themselves? > Mostly meatware, except the sought rules others have mentioned. > Thnx >