On Thu, 5 Mar 2009, decoder wrote:
John Hardin wrote:
Would there be any benefit to having an offline version - i.e.
something that evaluates the log or a corpus to generate new meta
rules, that could be added onto the default ruleset? For instance:
cron @ 0200:
sa_meta_eval > /etc/mail/spamassassin/metarules.cf
/etc/init.d/spamassassin restart
This is definetly a good idea. You can create the SVM model offline from
a logfile only, if it includes the rules that scored and the ham/spam
status.
From my /var/log/maillog:
Mar 1 04:22:22 ga spamd[30536]: spamd: result: Y 46 -
BAYES_99,BAYES_POISON_02,DNS_FROM_RFC_ABUSE,DNS_FROM_RF
C_POST,FORGED_MUA_OUTLOOK,FORGED_OUTLOOK_HTML,FORGED_OUTLOOK_TAGS,FORGED_RCVD_HELO,FREEMAIL_FROM,FROM_ILLEGAL_
CHARS,HTML_40_50,HTML_FONT_INVISIBLE,HTML_MESSAGE,L_SOME_STD_PROBS,MIME_BOUND_DD_DIGITS,MIME_HTML_ONLY,MIME_HT
ML_ONLY_MULTI,MISSING_MIMEOLE,RBL_PSBL_01,RCVD_BY_IP,RCVD_DOUBLE_IP_SPAM,RCVD_HELO_IP_MISMATCH,RCVD_NUMERIC_HE
LO,SARE_RECV_IP_FROMIP1,SPF_SOFTFAIL,SUBJ_ILLEGAL_CHARS,UNPARSEABLE_RELAY,UPPERCASE_50_75
scantime=9.4,size=3150,user=root,uid=99,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=40282,mid=<KGNPKZIWNMHBPXAQXUKDUC
k.uqkkajgreg_ji...@msn.com>,bayes=1,autolearn=disabled
Unfortunately, only using the log won't let you address FPs and FNs, so in
addition to the log you'd need to be able to scan corpa. I'd suggest that
you do both, and have it prefer the per-message spam/ham status from the
corpa over the spam/ham status from the log (matching by MSGID of course).
However, you cannot generate metarules with SVMs, for that purpose you
need a different learning algorithm (for example bayes, or decision
trees).
However, SVM classification is very cheap, so once you created the model
offline, you can use it online really quickly with a plugin.
Then perhaps we're looking at two different but related tools, a plugin
for SVMs and an offline static meta rule generator. They may be
complimentary, or they may be different ways to achieve similar results.
Personally I know I'd be more comfortable (at least at this point) running
an offline metarule generator as part of my nightly bayes training script
than I would be in adding another plugin, which is why I brought it up.
Add to that, the offline meta rule generator would be useful in older SA
installs that might not support a plugin written to the current API...
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Failure to plan ahead on someone else's part does not constitute
an emergency on my part. -- David W. Barts in a.s.r
-----------------------------------------------------------------------
3 days until Daylight Saving Time begins in U.S. - Spring Forward