On Wed, 29 Jun 2016 12:13 am, Random832 wrote: > On Tue, Jun 28, 2016, at 00:31, Rustom Mody wrote: >> GG downgrades posts containing unicode if it can, thereby increasing >> reach to recipients with unicode-broken clients > > That'd be entirely reasonable, except for the excessively broad > application of "if it can". > > Certainly it _can_ do it all the time. Just replace anything that > doesn't fit with question marks or hex notation or \N{NAME} or some > human readable pseudo-representation a la unidecode. It could have done > any of those with the Hindi that you threw in to try to confound it, (or > it could have chosen ISCII, which likewise lacks arrow characters, as > the encoding to downgrade to).
Are you suggesting that email clients and newsreaders should silently mangle the text of your message behind your back? Because that's what it sounds like you're saying. I understand technical limitations. If I'm using a client that can't cope with anything but (say) ISCII or Latin-1, then I'm all out of luck if I want to write an email containing Greek or Cyrillic. I get that. But if the client allows me to type Greek or Cyrillic into the editor, and then accepts that message for sending, and it mangles it into "question marks or hex notation or \N{NAME}" (for example), that's a disgrace and completely unacceptable. Yes, software *is capable of doing so*, in the same way that software is capable of deleting all the vowels from your post, or replacing the word "medieval" with "medireview": http://northernplanets.blogspot.com.au/2007/01/medireview.html This is not a good idea. > It should pick an encoding which it expects recipients to support and > which contains *all* of the characters in the message, That would be UTF-8. That's a no-brainer. Why would you use any other encoding? If you use UTF-8, it just works. It supports the *entire* Unicode character set, which is a superset of virtually all code pages and encodings you are likely to encounter in practice. (No, your software probably isn't running on a 1980s vintage Atari, and if you're in Japan using TRON you've got your own software.) And your text widget or editor surely supports Unicode, because if it didn't, the user couldn't type those Hindi or Greek letters. So there's an obvious, sensible algorithm: - take the user's Unicode text, and encode it to UTF-8 In pseudo-code: content = text.encode('utf-8') And there's the actual algorithm used by mail clients and newsreaders: - take the user's Unicode text, and try encoding it as a variety of different encodings (US-ASCII, Latin-1, maybe a few others); if they fail, then fall back to UTF-8 Or in pseudo-code: list_of_encodings = ['US-ASCII', 'Latin-1', ...] for encoding in list_of_encodings: try: content = text.encode(encoding) break except UnicodeEncodingError: pass else: content = text.encode('utf-8') Why would you write the second instead of the first? It's just *dumb code*. Maybe 20 year old applications could be excused for thinking that this newfangled Unicode thing should be the last resort instead of the code page system, but its 2016 now and code pages are just holding us back. This is *especially* egregious since UTF-8 text containing only ASCII characters is (by design) indistinguishable from US-ASCII, so even if there is some application out there from 1980 that can only cope with ASCII, your UTF-8 email will be perfectly readable to the degree that it only uses "plain text". > as proper > characters and not as pseudo-representations, and downgrade to that if > and only if such an encoding can be found. For most messages, it can use > US-ASCII. For most of the remainder it can use some ISO-8859 or > Windows-125x encoding. There's never any need to downgrade to a non-Unicode encoding, at least not by default. Well, maybe in Asia, I don't know how well Asian software supports Unicode. -- Steven “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list