On Fri, 2007-12-07 at 23:36 +0100, Stefan Jakobs wrote:
> On Friday 07 December 2007 20:42, Karsten Bräckelmann wrote:

> > Let's assume, one of them happens to be Swedish. And even though the
> > entire communication is English, that ignorant bastard dares to have his
> > real name at the bottom of his mail -- which includes Swedish chars.
> >
> > Do you hear that flushing sound of catching spam?
> 
> Do you mean: If I have one false positive I should throw my spam filter in a 
> trash can?

No. And I am not talking a single FP either.

My point is, that above approach is prone to hit hard on a lot of
totally legitimate mail. There is a *huge* difference between Cyrillic
or even Chinese or Japanese symbols -- and sub types of latin.


> > Swedish chars are a superset of English chars. As are German and many
> > others. To see that this is not an artificial, made up example please
> > have a look at my real name. :)
> 
> Ok. My fault I mistook charsets with country codes. But replace se with ru or 
> ch or greek7. The result is the same. You want one charset to be considered 
> as "not ham" and you have to give the whole list to the parameter. And I 
> think it is a long and ugly to read list (see: 
> http://www.iana.org/assignments/character-sets)

Yes, that list indeed is ugly. However, that is *not* what we are
talking about. The list of valid locales for ok_locales can be found in
the docs -- and totals 6, including en...


> I only want to say that there can be a situation in which you only know that 
> you don't want to consider the XXX charset as an indicator for ham.

Despite its name, ok_locales is *not* about certain charsets being "an
indicator for ham". The opposite is true. It does not assign a negative
score. All it does is assigning a positive score for charsets "not in
the ok list".


> > Anyway, this whole example is non-realistic as is. As Matt pointed out
> > in a later post, we are talking character sets here, not languages. In
> > the world of ok_locales, there is no distinction between en and se,
> > which is just en to ok_locales...
> 
> As I say I got confused with it (and be it maybe still).

> Other question: How does Spamassassin know which charset it should use. 
> Provides it a list of all charsets and compares or does it try it to find the 
> information in the header of the mail or ...?

Unfortunately, I don't know either. Although I'd like to...

As per my counter example above, I do not want CHARSET_FARAWAY and
friends to score on mail, just because a fellow hacker happens to have
his original name in his sig or From: header. And it probably doesn't
come as a surprise, that the example actually is real life. ;)


Maybe the devs can briefly explain how the charset is being determined.
Or at least, where exactly in the code one could find it...

  guenther  - who is too lazy to dig through all the code right now :)


-- 
char *t="[EMAIL PROTECTED]";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to