On Thu, Apr 11, 2019 at 07:12:57PM -0500, Derek Martin wrote: > On Sun, Apr 07, 2019 at 11:13:53PM +0200, felixs wrote: > > On Fri, Apr 05, 2019 at 11:24:26AM -0700, Ian Zimmerman wrote: > > > I think this is the first time I got hit by the next stage of > > > browserisation: on a mailing list, a From: line that looks like > > > > > > From: "Foo Bariì" <foo-ba...@gmail.com> > > > > > where the entity refers to the character U0107 in Unicode code point > > FWIW, the quoted entity is the latin-1 character 'ì', not the > character 'ć'. The latter would be ć, not > 236... Seems the last two digits were transposed somehow. > > > And if you add > > > > set charset="utf-8" > > You should simply never set charset. Ever. If you need to, it's > either because your system is misconfigured (so fix that instead), or > your multi-lingual text input configuration is sufficiently > complicated that you already know plenty enough about it to ignore > what I just said. =8^) Setting charset is vastly more likely to > cause problems, because setting it almost guarantees that you don't > know what you're doing, and you're Doing It Wrong™. (...)
Thanks. I had already posted a follow-up on my first message. > But clearly it won't help at all in this case. The problematic string > isn't a binary representation of a unicode character. It's an HTML > entity, and HTML entities in recipient headers is not supported by any > of the RFCs, AFAIK (although new ones are added all the time, so it's > hard to be sure)... So the fact that it's there is because some > misguided web-based e-mail software thinks ignoring e-mail RFCs is > cool (or more likely, just does not understand i18n). Event though, call them HTML entities, call them something else, they are ASCII characters and as such they are a subset of utf-8. That is the very reason why they are displayed by mutt as they are displayed. Who said that they are binary representations? I talked about hexadecimal representation being converted into integer, to make use of chr() in my python function example. Maybe I cannot follow now... > > At any rate, nothing will fix this short of Mutt providing explicit > support for it, which IMO it should not do, or writing a script that > can convert it, to be used as a display filter. This is bound to be > more trouble than it's worth... I'm guessing the least obnoxious > approach would be to find a script that converts plain text into > minimally formatted HTML, and then view the resulting thing in w3m or > some such. But such a thing would likely escape the HTML entities it > found in the text, in some fashion, since it's assuming that it's > plain text... Alternatively you'd have to parse the whole file > looking for HTML entities, and then convert them to the appropriate > character for the locale you're using. Blech. Sure, a waste of time. (...) Cheers, felixs