On Jul 2, 2014, at 12:37 PM, John Hardin <jhar...@impsec.org> wrote:

> On Wed, 2 Jul 2014, Philip Prindeville wrote:
> 
>> Given that it’s text/plain with an implicit charset=“us-ascii” and an 
>> implicit content-transfer-encoding of 7bit, the sequence &#x[0-9A-F]{4} 
>> doesn’t really parse into a 16-bit character, would it? That would be a 
>> broken MUA that made such a leap...
> 
> Nope. The content-transfer-encoding is only for the *transfer* part of the 
> process. Once the content reaches the MUA that content can be further parsed 
> by the MUA according to other encoding rules, such as these escape sequences 
> for Unicode characters. That's perfectly valid. How else would you send, for 
> example, a c-cedille in spanish text via a 7-bit-clean channel?

This is a trick question, right?

You do that with base64 or quoted-printable, which are the interoperable 
standards.

You don’t pick some implicit encoding which no one else has agreed upon.


> 
>> Wouldn’t that normally render as the character ‘&’, ‘#’, ‘x’, etc. rather 
>> than the unicode16 or UTF-8 character with that hex value?
> 
> I'd only expect that in a very old MUA (i.e. that does not support Unicode), 
> or display of the raw message content at user request.


How is it supposed to guess what the encoding implicitly means?  We have the 
MIME spec so that all of this is formally specified.


> 
>> I wouldn’t want a message where someone gives a couple of examples of 
>> encoding &#x0400 for instance being flagged as SPAM, but if the text is 20% 
>> or more of these sequences then I would say that’s SPAM-sign.
> 
> That's valid 7-bit encoding for transfer. It's relying on the user's MUA to 
> convert the encoded Unicode values to glyphs for display.

No, 7-bit CTE means it’s 7-bit content. Period.

If you want 8-bit or 16-bit or 32-bit content over a 7-bit CHANNEL, you use a 
7-bit safe encoding like base64 or quoted-printable.

Citing RFC-2045:

6.  Content-Transfer-Encoding Header Field

   Many media types which could be usefully transported via email are
   represented, in their "natural" format, as 8bit character or binary
   data.  Such data cannot be transmitted over some transfer protocols.
   For example, RFC 821 (SMTP) restricts mail messages to 7bit US-ASCII
   data with lines no longer than 1000 characters including any trailing
   CRLF line separator.

   It is necessary, therefore, to define a standard mechanism for
   encoding such data into a 7bit short line format.  Proper labelling
   of unencoded material in less restrictive formats for direct use over
   less restrictive transports is also desireable.  This document
   specifies that such encodings will be indicated by a new "Content-
   Transfer-Encoding" header field.  This field has not been defined by
   any previous standard.

…

6.2.  Content-Transfer-Encodings Semantics

   …

   The quoted-printable and base64 encodings transform their input from
   an arbitrary domain into material in the "7bit" range, thus making it
   safe to carry over restricted transports.  The specific definition of
   the transformations are given below.

   The proper Content-Transfer-Encoding label must always be used.
   Labelling unencoded data containing 8bit characters as "7bit" is not
   allowed, nor is labelling unencoded non-line-oriented data as
   anything other than "binary" allowed.

   …

   NOTE ON THE RELATIONSHIP BETWEEN CONTENT-TYPE AND CONTENT-TRANSFER-
   ENCODING: It may seem that the Content-Transfer-Encoding could be
   inferred from the characteristics of the media that is to be encoded,
   or, at the very least, that certain Content-Transfer-Encodings could
   be mandated for use with specific media types.  There are several
   reasons why this is not the case. First, given the varying types of
   transports used for mail, some encodings may be appropriate for some
   combinations of media types and transports but not for others.  (For
   example, in an 8bit transport, no encoding would be required for text
   in certain character sets, while such encodings are clearly required
   for 7bit SMTP.)

So you can’t infer the content-type from the content-transfer-encoding or 
vice-versa.

And RFC-2046:

4.1.2.  Charset Parameter

   A critical parameter that may be specified in the Content-Type field
   for "text/plain" data is the character set.  This is specified with a
   "charset" parameter, as in:

     Content-type: text/plain; charset=iso-8859-1

   Unlike some other parameter values, the values of the charset
   parameter are NOT case sensitive.  The default character set, which
   must be assumed in the absence of a charset parameter, is US-ASCII.

so you can’t render Unicode or UTF-8 or ISO-8859-X characters because the 
charset is implicitly US-ASCII and doesn’t have any characters beyond 01111111 
binary.

In short, it’s not Unicode unless it EXPLICITLY SAYS UNICODE.

And see also RFC-2152, which I won’t quote here.

Lastly, RFC-3629:

8.  MIME registration


   This memo serves as the basis for registration of the MIME charset
   parameter for UTF-8, according to [RFC2978].  The charset parameter
   value is "UTF-8". 

and again, as there is no explicit charset parameter, it is implied to be 
US-ASCII.



> 
> I would say that's more a case of those characters shouldn't be present if 
> the language is en-us than an encoding issue. The presence of lots of those 
> is either a sign that the text isn't English, or is obfuscated. How do you 
> reliably tell the language of the message?


A lot of MUA’s leave out the language, whereas none should omit a CHARSET or 
C-T-E.


> 
> It would probably be a good idea to add those sequences to the replacetags 
> letter REs so that the FUZZY rules will catch them.

Reply via email to