On Mon, 2009-05-04 at 02:02 -0400, Micah Anderson wrote: > Dave Walker <davewal...@ubuntu.com> writes: > > Micah Anderson wrote: > > > I got a phish message that was understood by bayes as: > > > > > > -2.6 BAYES_00 BODY: Bayesian spam probability is 0 to 1% > > > [score: 0.0000] > > > > > > So I traiend with spamc -L spam but even after that I am still getting > > > BAYES_00. Shouldn't the training have bumped that score up?
Yeah, generally speaking. However, every now and then, I too see some particular spam that resists any training. That is, later spam still scores (regarding Bayes) way too neutral for my taste. ;) > > To see what is really going on run "$ spamassassin -D < > > /path/to/the/email > /dev/null", and see if you can learn anything as to > > why it's not working as expected. > > Indeed, when I do this, I find these bayes related log entries: > > [13244] dbg: bayes: corpus size: nspam = 6798614, nham = 19136735 > [13244] dbg: bayes: tok_get_all: token count: 175 > [13244] dbg: bayes: score = 0 Use -D bayes rather than plain -D and check the entire output. I just hope that your (Pg?)SQL BayesStore backend dumps the tokens just like the DBM one does. Since it is phish -- specifically targeted at your site? Any chance they managed to use a lot of internal, really hammy looking tokens? Possibly even originating from inside your network? > This shows me that I have no idea what these magic things are :) Does > this tell you anything useful? > 0.000 0 6798614 0 non-token data: nspam > 0.000 0 19136753 0 non-token data: nham That's quite a lot of ham compared to the spam... Does that really reflect your mail instream? 19 M hams learned and an SQL Bayes storage backend. Site wide. Do you trust your users? Any chance some of them are training badly? At worst even "rescuing" phish from the spam folder, falling for it, responding with their credentials and *learning* the phish, believing it was classified incorrectly? The latter sure will explain why you got a hard time re-training *one* sample as spam -- if you got it learned >1 times as ham... -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}