Re: [SAtalk] bayes, spamd, and future of per-user/per-system bayes

Justin Mason Wed, 30 Oct 2002 14:10:45 -0800

Liudvikas Bukys said:

> Bug?  The bayes code in 2.50 doesn't get invoked from spamd because there
> is no hook from handle_user to [re]open the bayes databases.
> I have to think this is an oversight, but I thought I'd better ask.
> * Should spamd do this?


This should now be fixed in CVS...

> The learn code is a bit slow and if the authors are open to code
> submissions I'll submit something to help.  Having experimented heavily
> with Graham- inspired code, I have to say it's really fast, works
> really well (for me) at having tolerable false negatives but NO false
> positives (but then my corpora have grown to 10000 spam and 10000 ham),
> and it also fills two important gaps for me:
>       - fixes the persistent false positives I get
>         (e.g. all ietf-announce traffic scores 6.6 consistently)

so 2.6 with BAYES_0* score added  ;)

>       - has a comprehensible end-user interface (just provide examples)
> * Conclusion: Bayes -- good.

Well, the code is currently optimised towards slow learning, but fast
checking.  

Actually I think the recompute_all_probabilities() step is what takes so
long, and there's a few places I call that implicitly -- like when the
learner process finishes.  They could be made explicit; then the
instructions to users would be

        - train on your ham collection

        - train on your spam

        - and run sa-rebuild-probs (or whatever)

> Regarding the future, I am hoping that where this is going is the
> ability to support multiple layers of bayes databases; a system-wide
> one (probably large corpus, hence large database), PLUS per-user ones
> (probably small corpus, but important to the user).
> * Any intention here among the main developers?  Open to new code?

Hmm -- that sounds like a very good idea.  Yes, I'd be open to that; that
would also allow a separate Graham-formula object to be used at the same
time.  Matt noted good results from the Graham method, too.

> Regarding Bayes scoring, if Graham's formula becomes an option, you
> might as well have only two possible scores, as the output is extremely
> bimodal.  Robinson's formula spreads the probability distribution more
> widely.  Extrema of +-4.0 score seem really high, but I haven't seen
> any bad results.  You'd have to treat Robinson and Graham formulas as
> different tests altogether for GA score determination.

Yes -- Graham scores would have to be lower, IMO.

Also I agree that for now, evolving the bayes scores with GA is probably
not a good idea, since they seem to skew in various directions based on
how much you've trained and what your corpus is like.

In my tests, the GA gives it quite low scores despite its good results,
which I haven't figured out yet.  I think the GA is not letting
high-confidence rules get high enough scores...

> P.S.  Food for thought.  I have noticed some recent spam including
> unusual random vocabulary words in the message -- the spammers must be
> watching and experimenting too?  A lot of this, and simple vocabulary
> based Bayes will weaken (especially if spammers get their extreme-words
> from a default distribution!) -- I think the next ply will include
> phrases or other more semantic features -- but there are space tradeoffs.

yes, definitely; some mails have included a paragraph of text from nonspam
sources appended to the bottom.  This is definitely an attempt to
bayes-bust (to coin a phrase ;)

But note that Bayes-scanning some headers will pick up some very useful
spamsigns; in particular, spam relayed through your secondary MX is common, but
this is not common in ham [*], so it's a good spamsign which SpamAssassin
cannot detect heuristically (without configuration).  also there's some
spamsign headers that it catches, like so:

  0.994       58        1   H:Received:smtp.easydns.com
  0.995       43        0   H:Received:TIPUTIL2
  0.995       49        0   N:H:X-Info:NNNNNN
  0.994       73        2   N:H:X-MailingID:NNNNN

[*]: although it's not so good when your primary MX *is* actually unreachable
;).

--j.


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] bayes, spamd, and future of per-user/per-system bayes

Reply via email to