On Sat, 2007-12-08 at 02:05 +0100, Stefan Jakobs wrote: > On Saturday 08 December 2007 01:15, Karsten Bräckelmann wrote:
> > > Ok. My fault I mistook charsets with country codes. But replace se with > > > ru or ch or greek7. The result is the same. You want one charset to be > > > considered as "not ham" and you have to give the whole list to the > > > parameter. And I think it is a long and ugly to read list (see: > > > http://www.iana.org/assignments/character-sets) > > > > Yes, that list indeed is ugly. However, that is *not* what we are > > talking about. The list of valid locales for ok_locales can be found in > > the docs -- and totals 6, including en... > > Only 6? Yes, I found it in the docs. (Yeah, I know: RTFM before you ask > around). I appologize, with only 6 charsets it is not useful to have a > not_ok_locales option. You just looked at the wrong docs... ;) Basically, the coarse distinction ok_locales boils down to from a users point of view is "can I decipher that?". As in, I don't speak Chinese, and I got a hard time telling apart Chinese from Japanese. I don't speak Swedish either, but I do recognize the symbols. And with some luck, I'll even understand a couple words... [1] > > > I only want to say that there can be a situation in which you only know > > > that you don't want to consider the XXX charset as an indicator for ham. > > > > Despite its name, ok_locales is *not* about certain charsets being "an > > indicator for ham". The opposite is true. It does not assign a negative > > score. All it does is assigning a positive score for charsets "not in > > the ok list". > > Maybe I should have said: "an indicator for NOT spam" ? Sh.., there are too > many double negations and I'm too tired for that. not spam == ham Do you actually mean "not an indicator for ham/spam/anything"? Cause that's what ok_locales is -- whatever is in that list is being treated neutral, neither taken as an indicator for ham nor spam. Anything that is *not* in that list, however, is an indicator for spam. It's a rather twisted logic. You don't define what's good or bad (that again would be a black/whitelist), you leave out what's bad... > > Maybe the devs can briefly explain how the charset is being determined. > > Or at least, where exactly in the code one could find it... Matt, also, I got a feeling, that logic is what the OP is actually about. He does not want to leave out what he wants to be scored on. But (positively) define it. guenther [1] As someone who has dealt with user filed bug reports in bugzilla extensively, I know, there is a chance to grok the general topic even if you don't know the language. -- char *t="[EMAIL PROTECTED]"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}