On Mon, Jul 16, 2018 at 08:56:11PM +0100, Rhodri James wrote: > The problem everyone is having with you, Marko, is that you are > using the terminology incorrectly. [...] When you call UTF-32 a > variable-width encoding, you are incorrect.
But please don't overlook that the "terminology" is in fact rather specialized jargon, far less common than even most computer jargon. Unless you're uncommonly familiar with the subject matter, you simply don't have this vocabulary. Under the circumstances it seems not horribly unreasonable to expect such a person to consider the bytes required to represent a glyph as an encoding's width, and you as "experts" rightly should expect, let's call them lay people, to make this mistake and adjust for it, or politely correct it, without the condescension. > You are of course welcome to use whatever terminology you personally > like, like Humpty Dumpty. However when you point to a duck and say > "That's a gnu," people are likely to stop taking you seriously. Shouldn't experts "be generous in what they accept, but conservative in what they emit?" If your goal here is to educate, and come to a common understanding, rather than to simply prove how superior (the generic) you are, then perhaps both you and the community would be better served if you strived to understand Marko's points, rather than just point out how horribly wrong he is? The tone here is often extremely adversarial, which I think mostly serves to incite others to respond adversarialy. I certainly know I've fallen into that trap more than once, myself. I work primarily in Unix environments, and I daresay the way Unix treats text as bytes--barring certain very specialized applications, which require knowledge of what bytes correspond to what units of linguistic representations, like reversing strings (which FWIW I've never found a use for, other than academic ones)--works just fine. You can--and I do (or have, at least)--write non-ASCII unicode strings as bytes in your Python-2.7 code, or read them from a file, or whatever other input your program desires, and send them to whatever terminal or GUI program you want to, and they will appear as they should to the user, provided the system is configured appropriately (which these days mostly means configured to use UTF-8, and which these days is generally the case). It's reasonable to assume users either know what encoding their systems are using, or don't have a clue but won't change it, so it will always be "right." And if the system is configured correctly, and you sensibly used UTF-8 encoded byte strings in your program, but the system is configured in some other encoding, it's a fairly trivial matter to use iconv to convert to the system's encoding (which I have also done, but perhaps not in Python--I can't recall), assuming the data can be converted (and if not you're kinda screwed anyway). In the overwhelming majority of cases, this gets you everything you need, and the language internally understanding Unicode (especially if that understanding requires more work from the programmer to deal with it) mostly gets you very little. Yes, of course there are specific applications for which that intelligence is neccessary, and in those cases it should be made use of. The rest of the time--the overwhelming majority of the time--it's just superfluous complexity. So, sure, in uncommon cases knowing about Unicode may reduce (but not eliminate) complications dealing with different languages, but in the common cases it may only serve to make more work for the programmer. I don't know about you, but I prefer to do less, if less is required. If these features exist because Windows needs them in order to reliably get the common cases right, then maybe, just maybe, Unix really did get it right after all.
pgpsYXkWk0Rss.pgp
Description: PGP signature
-- https://mail.python.org/mailman/listinfo/python-list