Liudvikas Bukys said: > Bug? The bayes code in 2.50 doesn't get invoked from spamd because there > is no hook from handle_user to [re]open the bayes databases. > I have to think this is an oversight, but I thought I'd better ask. > * Should spamd do this?
This should now be fixed in CVS... > The learn code is a bit slow and if the authors are open to code > submissions I'll submit something to help. Having experimented heavily > with Graham- inspired code, I have to say it's really fast, works > really well (for me) at having tolerable false negatives but NO false > positives (but then my corpora have grown to 10000 spam and 10000 ham), > and it also fills two important gaps for me: > - fixes the persistent false positives I get > (e.g. all ietf-announce traffic scores 6.6 consistently) so 2.6 with BAYES_0* score added ;) > - has a comprehensible end-user interface (just provide examples) > * Conclusion: Bayes -- good. Well, the code is currently optimised towards slow learning, but fast checking. Actually I think the recompute_all_probabilities() step is what takes so long, and there's a few places I call that implicitly -- like when the learner process finishes. They could be made explicit; then the instructions to users would be - train on your ham collection - train on your spam - and run sa-rebuild-probs (or whatever) > Regarding the future, I am hoping that where this is going is the > ability to support multiple layers of bayes databases; a system-wide > one (probably large corpus, hence large database), PLUS per-user ones > (probably small corpus, but important to the user). > * Any intention here among the main developers? Open to new code? Hmm -- that sounds like a very good idea. Yes, I'd be open to that; that would also allow a separate Graham-formula object to be used at the same time. Matt noted good results from the Graham method, too. > Regarding Bayes scoring, if Graham's formula becomes an option, you > might as well have only two possible scores, as the output is extremely > bimodal. Robinson's formula spreads the probability distribution more > widely. Extrema of +-4.0 score seem really high, but I haven't seen > any bad results. You'd have to treat Robinson and Graham formulas as > different tests altogether for GA score determination. Yes -- Graham scores would have to be lower, IMO. Also I agree that for now, evolving the bayes scores with GA is probably not a good idea, since they seem to skew in various directions based on how much you've trained and what your corpus is like. In my tests, the GA gives it quite low scores despite its good results, which I haven't figured out yet. I think the GA is not letting high-confidence rules get high enough scores... > P.S. Food for thought. I have noticed some recent spam including > unusual random vocabulary words in the message -- the spammers must be > watching and experimenting too? A lot of this, and simple vocabulary > based Bayes will weaken (especially if spammers get their extreme-words > from a default distribution!) -- I think the next ply will include > phrases or other more semantic features -- but there are space tradeoffs. yes, definitely; some mails have included a paragraph of text from nonspam sources appended to the bottom. This is definitely an attempt to bayes-bust (to coin a phrase ;) But note that Bayes-scanning some headers will pick up some very useful spamsigns; in particular, spam relayed through your secondary MX is common, but this is not common in ham [*], so it's a good spamsign which SpamAssassin cannot detect heuristically (without configuration). also there's some spamsign headers that it catches, like so: 0.994 58 1 H:Received:smtp.easydns.com 0.995 43 0 H:Received:TIPUTIL2 0.995 49 0 N:H:X-Info:NNNNNN 0.994 73 2 N:H:X-MailingID:NNNNN [*]: although it's not so good when your primary MX *is* actually unreachable ;). --j. ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk