On Jul 23, 2014, at 11:45 AM, Amir 'CG' Caspi <ceph...@3phase.com> wrote:
> On 2014-07-02 15:04, Amir Caspi wrote: >> For what it's worth, I just received a spam that basically is the same >> as what Philip complained about. I've posted a spample here: >> http://pastebin.com/Y2YGwL49 > [...] >> I'm wondering if we shouldn't write a rule looking for lots of >> �[0-9]{3}; patterns... say, 500 of them in one email. Or, would we >> expect legitimate emails to have these? > > So, to follow up on this... over the past couple of weeks I've been getting a > lot more FNs than normal, and almost every single one of these is an "encoded > character" spam like the example above. Bayes training does appear to work, > in that many of these FNs are already at BAYES_999... but there aren't enough > other rules hit to cause the FNs to cross the 5.0 threshold. (Other, similar > spams do cross the threshold, usually due to RAZOR and/or PYZOR hits.) > > Since these are basically unicode character encodings, is there a move to > translate all charsets to UTF-8 (or some other fixed standard) before > applying body and/or URI rules? That would, presumably, help with trying to > catch these. > > I'm definitely considering writing a rule to catch �[0-9]{3}; patterns. > I'm definitely worried it could cause FPs, but are there common circumstances > where legitimate emails would include dozens to hundreds of these? (The > latest FNs only include a few dozen, not the hundreds seen in the spample > above.) > > Otherwise, I'm not sure what "template" rule I could write to catch these > things, and they're increasing in frequency (with more and more being missed > as FNs). > > Thanks. > > -- Amir > In text/plain with CTE of ‘7bit’ or ‘8bit’ it’s meaningless to use Unicode HTML entity encodings. It’s obviously not HTML. If you want Unicode in text/plain, it should be in base64 or quoted-printable CTE. -Philip