On Tue, 2009-03-31 at 06:20 -0700, John Hardin wrote: > On Tue, 31 Mar 2009, Lucio Chiappetti wrote: > > > users MAY forward spam which goes through to an area where a daily > > crontab picks it up for sa-learn ... we've been happy with the entire > > arrangement since a couple of years)
You should teach all your users to at least dump spam that slipped through to the training spool. Simply shrugging it off and deleting the messages is harmful, since it might even result in those messages being *learned* as ham. That's almost guaranteed to bias your Bayes. > > What looks suspicious to me is BAYES_00. Most other spam has BAYES_99. > > Indeed. That is likely the primary cause of your problem. > > > and a slightly variable text in (bad) italian (with spelling and grammar > > errors) stating that "80% of the people in your [country|city|region...] > > is unhappy with their monthly income" and offering a job for internet > > advertising. > > That should be excellent bayes fodder. You're getting lots of them. Train 'em. > > 0.000 0 31125 0 non-token data: nspam > > 0.000 0 239162 0 non-token data: nham > > 0.000 0 310271 0 non-token data: ntokens > > > > I'm not sure how to interpret those numbers. > > Your bayes is trained with a strong bias towards ham. It should be more > the other way, since the raw volume of email is biased towards spam. Agreed, I guess you should train more spam. > > But then what is the best way to force bayes to "change its mind" from > > 00 to 99 (or at least above 50) on this sort of spam, other than waiting > > it catches up on the few user submissions (myself, I won't be doing > > other submission since my procmail filter diverts them to /dev/null) ? Bad attitude. :) You are catching these en mass with the procmail recipe. Don't discard them, but rather dump them into a dedicated folder. Quick review, then submit for training. Discarding them won't help. Not you, not anyone else. Not Bayes to get off of that fugly low score. If you can identify them easily, why not benefit from that? Also, since you are able to write a procmail recipe for them, writing a custom SA rule is just as easy. Score it a point or two... guenther -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}