Re: Numbered SGML entities in header addresses

Derek Martin Thu, 11 Apr 2019 17:14:14 -0700

On Sun, Apr 07, 2019 at 11:13:53PM +0200, felixs wrote:
> On Fri, Apr 05, 2019 at 11:24:26AM -0700, Ian Zimmerman wrote:
> > I think this is the first time I got hit by the next stage of
> > browserisation: on a mailing list, a From: line that looks like
> > 
> > From: "Foo Bari&#236;" <foo-ba...@gmail.com>
>
> > where the entity refers to the character U0107 in Unicode code point


FWIW, the quoted entity is the latin-1 character 'ì', not the
character 'ć'.  The latter would be &#263, not
236...  Seems the last two digits were transposed somehow.

> And if you add
> 
> set charset="utf-8"

You should simply never set charset.  Ever.  If you need to, it's
either because your system is misconfigured (so fix that instead), or
your multi-lingual text input configuration is sufficiently
complicated that you already know plenty enough about it to ignore
what I just said.  =8^)  Setting charset is vastly more likely to
cause problems, because setting it almost guarantees that you don't
know what you're doing, and you're Doing It Wrong™.

[I've long been tempted to lobby for the removal of this variable
since it causes more confusion than it solves, and recommendations
I've seen on this list over the years to set it have been universally
wrong (or at least completely ineffectual), no exceptions.]  

But clearly it won't help at all in this case.  The problematic string
isn't a binary representation of a unicode character. It's an HTML
entity, and HTML entities in recipient headers is not supported by any
of the RFCs, AFAIK (although new ones are added all the time, so it's
hard to be sure)...  So the fact that it's there is because some
misguided web-based e-mail software thinks ignoring e-mail RFCs is
cool (or more likely, just does not understand i18n).

At any rate, nothing will fix this short of Mutt providing explicit
support for it, which IMO it should not do, or writing a script that
can convert it, to be used as a display filter.  This is bound to be
more trouble than it's worth...  I'm guessing the least obnoxious
approach would be to find a script that converts plain text into
minimally formatted HTML, and then view the resulting thing in w3m or
some such. But such a thing would likely escape the HTML entities it
found in the text, in some fashion,  since it's assuming that it's
plain text...  Alternatively you'd have to parse the whole file
looking for HTML entities, and then convert them to the appropriate
character for the locale you're using.  Blech.

This may be one of the rarer cases where it's actually easier to
contact the sender and get them to do something more reasonable (and
standards-compliant) instead of working around their brokenness.

-- 
Derek D. Martin    http://www.pizzashack.org/   GPG Key ID: 0xDFBEAD02
-=-=-=-=-
This message is posted from an invalid address.  Replying to it will result in
undeliverable mail due to spam prevention.  Sorry for the inconvenience.

pgpxTxyMA7fyt.pgp
Description: PGP signature

Re: Numbered SGML entities in header addresses

Reply via email to