On Jul 2, 2014, at 12:37 PM, John Hardin <jhar...@impsec.org> wrote:
> On Wed, 2 Jul 2014, Philip Prindeville wrote: > >> Given that it’s text/plain with an implicit charset=“us-ascii” and an >> implicit content-transfer-encoding of 7bit, the sequence &#x[0-9A-F]{4} >> doesn’t really parse into a 16-bit character, would it? That would be a >> broken MUA that made such a leap... > > Nope. The content-transfer-encoding is only for the *transfer* part of the > process. Once the content reaches the MUA that content can be further parsed > by the MUA according to other encoding rules, such as these escape sequences > for Unicode characters. That's perfectly valid. How else would you send, for > example, a c-cedille in spanish text via a 7-bit-clean channel? This is a trick question, right? You do that with base64 or quoted-printable, which are the interoperable standards. You don’t pick some implicit encoding which no one else has agreed upon. > >> Wouldn’t that normally render as the character ‘&’, ‘#’, ‘x’, etc. rather >> than the unicode16 or UTF-8 character with that hex value? > > I'd only expect that in a very old MUA (i.e. that does not support Unicode), > or display of the raw message content at user request. How is it supposed to guess what the encoding implicitly means? We have the MIME spec so that all of this is formally specified. > >> I wouldn’t want a message where someone gives a couple of examples of >> encoding Ѐ for instance being flagged as SPAM, but if the text is 20% >> or more of these sequences then I would say that’s SPAM-sign. > > That's valid 7-bit encoding for transfer. It's relying on the user's MUA to > convert the encoded Unicode values to glyphs for display. No, 7-bit CTE means it’s 7-bit content. Period. If you want 8-bit or 16-bit or 32-bit content over a 7-bit CHANNEL, you use a 7-bit safe encoding like base64 or quoted-printable. Citing RFC-2045: 6. Content-Transfer-Encoding Header Field Many media types which could be usefully transported via email are represented, in their "natural" format, as 8bit character or binary data. Such data cannot be transmitted over some transfer protocols. For example, RFC 821 (SMTP) restricts mail messages to 7bit US-ASCII data with lines no longer than 1000 characters including any trailing CRLF line separator. It is necessary, therefore, to define a standard mechanism for encoding such data into a 7bit short line format. Proper labelling of unencoded material in less restrictive formats for direct use over less restrictive transports is also desireable. This document specifies that such encodings will be indicated by a new "Content- Transfer-Encoding" header field. This field has not been defined by any previous standard. … 6.2. Content-Transfer-Encodings Semantics … The quoted-printable and base64 encodings transform their input from an arbitrary domain into material in the "7bit" range, thus making it safe to carry over restricted transports. The specific definition of the transformations are given below. The proper Content-Transfer-Encoding label must always be used. Labelling unencoded data containing 8bit characters as "7bit" is not allowed, nor is labelling unencoded non-line-oriented data as anything other than "binary" allowed. … NOTE ON THE RELATIONSHIP BETWEEN CONTENT-TYPE AND CONTENT-TRANSFER- ENCODING: It may seem that the Content-Transfer-Encoding could be inferred from the characteristics of the media that is to be encoded, or, at the very least, that certain Content-Transfer-Encodings could be mandated for use with specific media types. There are several reasons why this is not the case. First, given the varying types of transports used for mail, some encodings may be appropriate for some combinations of media types and transports but not for others. (For example, in an 8bit transport, no encoding would be required for text in certain character sets, while such encodings are clearly required for 7bit SMTP.) So you can’t infer the content-type from the content-transfer-encoding or vice-versa. And RFC-2046: 4.1.2. Charset Parameter A critical parameter that may be specified in the Content-Type field for "text/plain" data is the character set. This is specified with a "charset" parameter, as in: Content-type: text/plain; charset=iso-8859-1 Unlike some other parameter values, the values of the charset parameter are NOT case sensitive. The default character set, which must be assumed in the absence of a charset parameter, is US-ASCII. so you can’t render Unicode or UTF-8 or ISO-8859-X characters because the charset is implicitly US-ASCII and doesn’t have any characters beyond 01111111 binary. In short, it’s not Unicode unless it EXPLICITLY SAYS UNICODE. And see also RFC-2152, which I won’t quote here. Lastly, RFC-3629: 8. MIME registration This memo serves as the basis for registration of the MIME charset parameter for UTF-8, according to [RFC2978]. The charset parameter value is "UTF-8". and again, as there is no explicit charset parameter, it is implied to be US-ASCII. > > I would say that's more a case of those characters shouldn't be present if > the language is en-us than an encoding issue. The presence of lots of those > is either a sign that the text isn't English, or is obfuscated. How do you > reliably tell the language of the message? A lot of MUA’s leave out the language, whereas none should omit a CHARSET or C-T-E. > > It would probably be a good idea to add those sequences to the replacetags > letter REs so that the FUZZY rules will catch them.