On Jul 23, 2014, at 11:45 AM, Amir 'CG' Caspi <ceph...@3phase.com> wrote:

> On 2014-07-02 15:04, Amir Caspi wrote:
>> For what it's worth, I just received a spam that basically is the same
>> as what Philip complained about.  I've posted a spample here:
>> http://pastebin.com/Y2YGwL49
> [...]
>> I'm wondering if we shouldn't write a rule looking for lots of
>> &#x0[0-9]{3}; patterns... say, 500 of them in one email.  Or, would we
>> expect legitimate emails to have these?
> 
> So, to follow up on this... over the past couple of weeks I've been getting a 
> lot more FNs than normal, and almost every single one of these is an "encoded 
> character" spam like the example above.  Bayes training does appear to work, 
> in that many of these FNs are already at BAYES_999... but there aren't enough 
> other rules hit to cause the FNs to cross the 5.0 threshold.  (Other, similar 
> spams do cross the threshold, usually due to RAZOR and/or PYZOR hits.)
> 
> Since these are basically unicode character encodings, is there a move to 
> translate all charsets to UTF-8 (or some other fixed standard) before 
> applying body and/or URI rules?  That would, presumably, help with trying to 
> catch these.
> 
> I'm definitely considering writing a rule to catch &#x0[0-9]{3}; patterns.  
> I'm definitely worried it could cause FPs, but are there common circumstances 
> where legitimate emails would include dozens to hundreds of these?  (The 
> latest FNs only include a few dozen, not the hundreds seen in the spample 
> above.)
> 
> Otherwise, I'm not sure what "template" rule I could write to catch these 
> things, and they're increasing in frequency (with more and more being missed 
> as FNs).
> 
> Thanks.
> 
> -- Amir
> 


In text/plain with CTE of ‘7bit’ or ‘8bit’ it’s meaningless to use Unicode HTML 
entity encodings.  It’s obviously not HTML.

If you want Unicode in text/plain, it should be in base64 or quoted-printable 
CTE.

-Philip

Reply via email to