On Fri, 2009-05-15 at 19:27 +0100, Jeremy Morton wrote: > Karsten Bräckelmann wrote:
> > Backscatter. These types of arbitrarily phrased "I changed my email > > address" auto-responses are pretty much impossible to catch. > > I feared as much. > > Since BAYES_00 is a strong sign for ham, I would have at least given it > > a low negative score, not positive. This is particular important, since > > you severely lowered the required_score to a mere 3.0. > > It may be a strong sign for ham, but it's also giving way too much > credit to a lot of spam (or at least unwanted backscatter) I'm getting > that would otherwise be rejected. I'll move it to -0.1 but I don't want > it being a strong indicator of ham. On the topic of backscatter, generally, not specific to this sample but the bulk of non-delivery notices and stuff: You are using vbounce? You did read those files, in particular, how to handle them? It is strongly recommended to filter vbounce identified backscatter into a dedicated folder, and not raising the score in the hope to treat 'em as spam. > > Moreover, this is not spam. Thus I recommend you pretty much ignore the > > Bayes score here. Don't change the rule's score based on backscatter, > > but ham and spam hits, if need be. Let me stress this point. Do NOT customize your Bayes scores based on backscatter, but exclusively on ham and spam. > It's unwanted e-mail, so it's pretty close to spam in my book. Just > because it's some moron who bounced a message instead of someone > explicitly spamming me doesn't make it much better. So is malware spreading through email. Yet it isn't spam, however close. And there are better tools, specifically designed to catch them. > > Your Bayes *might* be skewed. Hard to tell from that sample. Do you > > train it, manually? Would Spanish be a language you do get in ham? > > No and no. Please do not complain about bad Bayes results, if you don't take care of it and train it properly. :) > But all the character glyphs in the message could be used in > English or French which I might get ham messages in, so it can't be > ruled out on those grounds. We're talking Bayes here, so we're talking tokens. Not chars. Think of them as words. > > Also, again -- you are suffering from your catch-all! See my previous > > post (in one of your various threads) for some thoughts regarding this. > > Yeah, I'm the greatest lamenter of my decision to catch-all years ago, > but there's really no realistic way I'm gonna be able to go back on > that. I've probably registered with various sites using over 100 > usernames now. I'm just gonna have to live with that. Well, I've given hints for custom rules, to catch the catch-all bulk not possibly used by you. Take it or leave it. *shrug* I just noted that all your recent samples are backscatter to never ever used addresses, consisting of arbitrary auto-response strings in various foreign languages. IMHO, frankly, it appears to me you are trying to use SA to combat a design inherent problem that's better be solved on a much lower level. -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}