On Fri, 2013-09-20 at 14:20 -0400, Kevin A. McGrail wrote:
> > > Anyone have some examples of rules designed to catch words by content in
> > > UTF-8 encoded messages?  I'm doing some work on improving this.

> Right now, I'm just having problems with really putting a nail in the 
> coffin of spams using UTF8 from and Subjects.

Using UTF-8 encoded headers (or body) is absolutely no sign of spam
whatsoever. Have a look at this mail's headers. I know you know, but
your wording was just unfortunate.


> From: "=?utf-8?B?RNGWcmVjdCDOknV5?=" <wholes...@wholesalefirst-munged.co>
> Subject: =?utf-8?B?VG9wIM6ScmFuZHMgQXQgV2hvbGVzYWxlIM6hctGWY9GWbmc=?=

What exactly is your problem? These match your sample.

  header FOO_FROM  From =~ /Dіrect Βuy/
  header FOO_SUBJ  Subject =~ /Top Βrands At Wholesale Ρrіcіng/

This one, though, doesn't.

  header BAR_FROM  From =~ /Direct Buy/

Confused yet? The From header rules look identical, you say?

Indeed, they do. Look identical. They aren't. The patterns are UTF-8
encoded, the latter one I typed in manually based on "what I see". The
first set of patterns are straight *copied* from a UTF-8 capable MUA.

A hex dump visualizes the differences.

  00000000   44 D1 96 72  65 63 74 20  CE 92 75 79               D..rect ..uy

  00000000   54 6F 70 20  CE 92 72 61  6E 64 73 20  41 74 20 57  Top ..rands At 
W
  00000010   68 6F 6C 65  73 61 6C 65  20 CE A1 72  D1 96 63 D1  holesale 
..r..c.
  00000020   96 6E 67                                            .ng

So yeah, there are UTF-8 chars injected not part of ASCII, but looking
identical to the ASCII char they are recognized as by the reader.

  D196  CYRILLIC SMALL LETTER BELORUSSIAN-UKRAINIAN I (U+0456)
  CE92  GREEK CAPITAL LETTER BETA (U+0392)
  CEA1  GREEK CAPITAL LETTER RHO (U+03A1)


That analysis done -- again, what exactly is your problem?

Matching a fixed string? Be sure to copy the UTF-8 encoded non-ASCII
chars, rather than typing in visually equivalent chars. SA can handle
UTF-8 strings in rules at least since SA 3.2 on Perl 5.8.x.

Matching specific words with either ASCII or non-ASCII chars? Hardcoded
custom rules, or better M::SA::Plugin::ReplaceTags rules.

Matching *any* UTF-8 non-ASCII char? Not a good idea. (I know you never
would, just for completeness in this post.)


> As of yet, I'm not using normalize_charset and researching what hits 
> things the best.  Most of these still look REALLY spammy from a pathway 
> analysis though.

Never used normalize_charset myself. But from a glimpse at the docs,

  "detect character sets and normalize message content to Unicode."

it appears that option would only make sense with non-ASCII content that
is NOT UTF-8 encoded, to use UTF-8 encoded rules.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to