On Mon, 2009-05-04 at 02:02 -0400, Micah Anderson wrote:
> Dave Walker <davewal...@ubuntu.com> writes:
> > Micah Anderson wrote:
> > > I got a phish message that was understood by bayes as:
> > >
> > > -2.6 BAYES_00               BODY: Bayesian spam probability is 0 to 1%
> > >                             [score: 0.0000]
> > >
> > > So I traiend with spamc -L spam but even after that I am still getting
> > > BAYES_00. Shouldn't the training have bumped that score up?

Yeah, generally speaking. However, every now and then, I too see some
particular spam that resists any training. That is, later spam still
scores (regarding Bayes) way too neutral for my taste. ;)


> > To see what is really going on run "$ spamassassin -D <
> > /path/to/the/email > /dev/null", and see if you can learn anything as to
> > why it's not working as expected.
> 
> Indeed, when I do this, I find these bayes related log entries:
> 
> [13244] dbg: bayes: corpus size: nspam = 6798614, nham = 19136735
> [13244] dbg: bayes: tok_get_all: token count: 175
> [13244] dbg: bayes: score = 0

Use -D bayes rather than plain -D and check the entire output. I just
hope that your (Pg?)SQL BayesStore backend dumps the tokens just like
the DBM one does.

Since it is phish -- specifically targeted at your site? Any chance they
managed to use a lot of internal, really hammy looking tokens? Possibly
even originating from inside your network?


> This shows me that I have no idea what these magic things are :) Does
> this tell you anything useful? 

> 0.000          0    6798614          0  non-token data: nspam
> 0.000          0   19136753          0  non-token data: nham

That's quite a lot of ham compared to the spam... Does that really
reflect your mail instream?


19 M hams learned and an SQL Bayes storage backend. Site wide. Do you
trust your users? Any chance some of them are training badly? At worst
even "rescuing" phish from the spam folder, falling for it, responding
with their credentials and *learning* the phish, believing it was
classified incorrectly?

The latter sure will explain why you got a hard time re-training *one*
sample as spam -- if you got it learned >1 times as ham...


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to