Re: Blacklisting Cyrillic

Karsten Bräckelmann Thu, 26 Mar 2009 18:30:36 -0700

On Thu, 2009-03-26 at 17:22 -0700, Kenneth Porter wrote:
> I'd like to score anything in Windows-1251 fairly high, as I don't expect 
> to get anything legitimate in that charset. How can I read the charset 
> declared in a Subject header, or in a MIME part, for matching in a rule?


ok_locales en    # all Western char sets in general

> The only tools I see are ok_locales and CHARSET_FARAWAY, but those seem 
> like heavy hammers as they blacklist everything and then require me to 
> whitelist what I want. I'd rather the reverse: let me list which codepages 
> to reject.

There aren't many. Can you read any but the western ones? Then add it.
Oh, and yes, western includes all those language specific stuff like
German, French, Finland, etc chars.

> I tried this rule but it's not firing and I'm not sure why:
> 
> describe KP_CYRILLIC Cyrillic code page
> header   KP_CYRILLIC Subject =~ /Windows-1251/
> score    KP_CYRILLIC 0.1

Even with the :raw suffix, this will NOT trigger on the encoding only,
but ALSO when talking about code pages...


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Blacklisting Cyrillic

Reply via email to