Giampaolo Tomassoni writes:
> > -----Original Message-----
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, June 18, 2008 12:10 PM
> > To: John GALLET
> > Cc: users@spamassassin.apache.org
> > Subject: Re: [Rule Set proposal] French Rules
> > 
> > ...omissis...
> >
> > by the way, if you're reasonably perl-capable, it might be worthwhile
> > using the algorithm I use to generate the JM_SOUGHT ruleset for english
> > spam: http://taint.org/tag/rule-discovery
> > 
> > you just give it a corpus of spam samples and it generates the rules
> > for
> > you.  The code is in SpamAssassin SVN.
> > 
> > --j.
> 
> Nah, that's great!
> 
> I regret I can only occasionally read interesting messages due to my own
> time constraints. I could have read about this set of scripts weeks ago,
> otherwise...
> 
> How this code is supposed to be used? I see these scripts in rule-dev:
> maildir-scan-headers, seek-phrases-in-corpus, seek-phrases-in-log and
> strip-high-scorers-from-log.
> 
> Give us a brief description of their work and usage.

Basically, you collect 2 corpora:

1. a big corpus of ham samples, stuff that you do not want to match.

2. a smaller corpus of spam samples.

You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
the patterns; you can then write rules based on these.

Alternatively run "mass-check" and "seek-phrases-in-log" directly as that
script does, to get a bit more control (and generate real SpamAssassin
rules).  That's what the JM_SOUGHT scripts do.  See below:

  http://taint.org/x/2008/seekrules_run

that script also calls "mk_meta_rule", which is here:
http://taint.org/x/2008/mk_meta_rule

--j.

Reply via email to