On Thu, 2009-03-26 at 17:22 -0700, Kenneth Porter wrote: > I'd like to score anything in Windows-1251 fairly high, as I don't expect > to get anything legitimate in that charset. How can I read the charset > declared in a Subject header, or in a MIME part, for matching in a rule?
ok_locales en # all Western char sets in general > The only tools I see are ok_locales and CHARSET_FARAWAY, but those seem > like heavy hammers as they blacklist everything and then require me to > whitelist what I want. I'd rather the reverse: let me list which codepages > to reject. There aren't many. Can you read any but the western ones? Then add it. Oh, and yes, western includes all those language specific stuff like German, French, Finland, etc chars. > I tried this rule but it's not firing and I'm not sure why: > > describe KP_CYRILLIC Cyrillic code page > header KP_CYRILLIC Subject =~ /Windows-1251/ > score KP_CYRILLIC 0.1 Even with the :raw suffix, this will NOT trigger on the encoding only, but ALSO when talking about code pages... -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}