[SAtalk] bayes, spamd, and future of per-user/per-system bayes

Liudvikas Bukys Wed, 30 Oct 2002 12:50:11 -0800

Bug?  The bayes code in 2.50 doesn't get invoked from spamd because there
is no hook from handle_user to [re]open the bayes databases.
I have to think this is an oversight, but I thought I'd better ask.
* Should spamd do this?


The learn code is a bit slow and if the authors are open to code
submissions I'll submit something to help.  Having experimented heavily
with Graham- inspired code, I have to say it's really fast, works
really well (for me) at having tolerable false negatives but NO false
positives (but then my corpora have grown to 10000 spam and 10000 ham),
and it also fills two important gaps for me:
        - fixes the persistent false positives I get
          (e.g. all ietf-announce traffic scores 6.6 consistently)
        - has a comprehensible end-user interface (just provide examples)
* Conclusion: Bayes -- good.

Regarding the future, I am hoping that where this is going is the
ability to support multiple layers of bayes databases; a system-wide
one (probably large corpus, hence large database), PLUS per-user ones
(probably small corpus, but important to the user).
* Any intention here among the main developers?  Open to new code?

Regarding Bayes scoring, if Graham's formula becomes an option, you
might as well have only two possible scores, as the output is extremely
bimodal.  Robinson's formula spreads the probability distribution more
widely.  Extrema of +-4.0 score seem really high, but I haven't seen
any bad results.  You'd have to treat Robinson and Graham formulas as
different tests altogether for GA score determination.  If the training
corpus is loaded up with a lot of cases that SA gets wrong, then the GA
may train to the Bayes scores to the right levels.  If the training
corpus is too easy, then it will score Bayes too low because it'll seem
that Bayes has little to contribute.  I'll bet that leaving Bayes out
of GA is the right answer for now, but it would be interesting to see
what it comes up with.
* That's my two cents.


P.S.  Food for thought.  I have noticed some recent spam including
unusual random vocabulary words in the message -- the spammers must be
watching and experimenting too?  A lot of this, and simple vocabulary
based Bayes will weaken (especially if spammers get their extreme-words
from a default distribution!) -- I think the next ply will include
phrases or other more semantic features -- but there are space tradeoffs.


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] bayes, spamd, and future of per-user/per-system bayes

Reply via email to