On Fri, 2011-01-28 at 18:10 +0000, Dominic Benson wrote: > Recently, in order to balance the ham/spam ratio given to sa-learn, I > have started to pass mail submitted by authenticated users to sa-learn > --ham. > The thinking here is that users would generally want to receive mail > that they send, and many messages will either be replies or replied to, > so this is likely to have a fair amount in common with legitimate mail > coming in. > The existing bayes training was from auto-learn, on 60k ham and 360k > spam; since starting to do this, nearly twice as much ham as spam has > been learned. > > I haven't seen any mention of this strategy on-list or on the web, so > I'm interested in whether (a) anyone else does this, and (b) is there a > good reason not to do it that I haven't thought of?
This topic does come up occasionally. It has been discussed, some caveats to be aware of have been mentioned -- but IIRC no one ever came back to report about some substantial changes. Or whether or not it worked for them. Besides some good points (err, caveats) already raised... Given your numbers of ham and spam, you seem to be under the impression that the ratio should be 1:1 for best results. While there are no hard numbers I know of, that most likely is not the best ratio to aim at. Though I do see how the docs might imply that. A training ratio commonly advised is *not* 1:1 spam vs ham. But to have both numbers in the neighborhood of your actual in-stream. If you have 10 times more spam than ham, this means you can (and probably should) learn more spam. Personally, I have seen ratios of 50:1 or even higher, that just work perfectly. Why is that? Probably, because there is a rather limited set of hammy tokens. But an unequally higher set of spammy tokens. The latter grows rapidly when obfuscation techniques enter the scene. Also, spam changes *much* faster over time than ham. The latter can be assumed to be almost static over a couple years. In other words, there likely is nothing wrong with your initial spam to ham ratio of 6:1, and even 12:1 later. Unless you notice a significant raise in your ham's Bayes score, there's probably no need for such a pro-active counter measure. I guess in most cases it won't hurt. But in most cases, it isn't worth the effort either. Since you mentioned replies -- sure, true, they most likely are ham. ;) However, odds are the relevant tokens already do score hammy. The additional training of sent mail is unlikely to make much of a difference. And most of the replies themselves are likely to be auto-learned as ham anyway, no? Hmm, this turned out longer than the "additional note" I anticipated... -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}