On Fri, 2007-12-07 at 23:36 +0100, Stefan Jakobs wrote: > On Friday 07 December 2007 20:42, Karsten Bräckelmann wrote:
> > Let's assume, one of them happens to be Swedish. And even though the > > entire communication is English, that ignorant bastard dares to have his > > real name at the bottom of his mail -- which includes Swedish chars. > > > > Do you hear that flushing sound of catching spam? > > Do you mean: If I have one false positive I should throw my spam filter in a > trash can? No. And I am not talking a single FP either. My point is, that above approach is prone to hit hard on a lot of totally legitimate mail. There is a *huge* difference between Cyrillic or even Chinese or Japanese symbols -- and sub types of latin. > > Swedish chars are a superset of English chars. As are German and many > > others. To see that this is not an artificial, made up example please > > have a look at my real name. :) > > Ok. My fault I mistook charsets with country codes. But replace se with ru or > ch or greek7. The result is the same. You want one charset to be considered > as "not ham" and you have to give the whole list to the parameter. And I > think it is a long and ugly to read list (see: > http://www.iana.org/assignments/character-sets) Yes, that list indeed is ugly. However, that is *not* what we are talking about. The list of valid locales for ok_locales can be found in the docs -- and totals 6, including en... > I only want to say that there can be a situation in which you only know that > you don't want to consider the XXX charset as an indicator for ham. Despite its name, ok_locales is *not* about certain charsets being "an indicator for ham". The opposite is true. It does not assign a negative score. All it does is assigning a positive score for charsets "not in the ok list". > > Anyway, this whole example is non-realistic as is. As Matt pointed out > > in a later post, we are talking character sets here, not languages. In > > the world of ok_locales, there is no distinction between en and se, > > which is just en to ok_locales... > > As I say I got confused with it (and be it maybe still). > Other question: How does Spamassassin know which charset it should use. > Provides it a list of all charsets and compares or does it try it to find the > information in the header of the mail or ...? Unfortunately, I don't know either. Although I'd like to... As per my counter example above, I do not want CHARSET_FARAWAY and friends to score on mail, just because a fellow hacker happens to have his original name in his sig or From: header. And it probably doesn't come as a surprise, that the example actually is real life. ;) Maybe the devs can briefly explain how the charset is being determined. Or at least, where exactly in the code one could find it... guenther - who is too lazy to dig through all the code right now :) -- char *t="[EMAIL PROTECTED]"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}