Re: More text/plain questions

Amir Caspi Wed, 02 Jul 2014 14:05:28 -0700

On Jul 2, 2014, at 12:58 PM, David F. Skoll <d...@roaringpenguin.com> wrote:


> I don't think so.  Any MUA that tried to convert "&#x0435;" to a
> Unicode character in a text/plain part with implicit US-ASCII charset
> and 7bit content transfer encoding is broken.  An MUA should diplay
> exactly "&#x0435;" in this situation.  It's a different story for
> text/html parts, of course.

For what it's worth, I just received a spam that basically is the same as what 
Philip complained about.  I've posted a spample here:

http://pastebin.com/Y2YGwL49

There _is_ a text/html part, and that's what's displaying in my MUA (Apple 
Mail).

Sadly, as can be seen from the spample, the score doesn't quite reach 5.0 ... 
Bayes training could help since it only scored BAYES_50, but I'm wondering if 
this character encoding is designed to sidestep Bayes -- how does Bayes treat 
these for tokens?  If you randomize the characters being replaced (from 
plaintext to encoded), then there are lots of combinations for any given word, 
which then means each combination is a different token, no?  I don't know if 
spammers are taking the "care" to randomize the letter replacement, but if so, 
does this scheme actually "foil" Bayes due to each permutation being considered 
a different token?  If so, is there a way to mitigate that?

I'm wondering if we shouldn't write a rule looking for lots of &#x0[0-9]{3}; 
patterns... say, 500 of them in one email.  Or, would we expect legitimate 
emails to have these?

Is there also a rule for UTF8-encoded Subject line?  If so, it didn't pop.

--- Amir

Re: More text/plain questions

Reply via email to