On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote: > Evidently (and completely inadvertently) this exchange has just > illustrated one of the inadmissable assumptions: > > "unicode as a medium is universal in the same way that ASCII used to be"
Ironically, your post was not Unicode. Seriously. I am 100% serious. Your post was sent using a legacy encoding, Windows-1252, also known as CP-1252, which is most certainly *not* Unicode. Whatever software you used to send the message correctly flagged it with a charset header: Content-Type: text/plain; charset=windows-1252 Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle encodings correctly (or at all!), it screws up the encoding then sends a reply with no charset line at all. This is one bug that cannot be blamed on Google Groups -- or on Unicode. > I wrote a number of ellipsis characters ie codepoint 2026 as in: Actually you didn't. You wrote a number of ellipsis characters, hex byte \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to code point U+2026 in Unicode, but the two are as distinct as ASCII and EBCDIC. > Somewhere between my sending and your quoting those ellipses became the > replacement character FFFD Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about encodings and character sets. It doesn't just assume things are ASCII, but makes a half-hearted attempt to be charset-aware, but badly. I can only imagine that it was written back in the Dark Ages where there were a lot of different charsets in use but no conventions for specifying which charset was in use. Or perhaps the author was smoking crack while coding. > Leaving aside whose fault this is (very likely buggy google groups), > this mojibaking cannot happen if the assumption "All text is ASCII" were > to uniformly hold. This is incorrect. People forget that ASCII has evolved since the first version of the standard in 1963. There have actually been five versions of the ASCII standard, plus one unpublished version. (And that's not including the things which are frequently called ASCII but aren't.) ASCII-1963 didn't even include lowercase letters. It is also missing some graphic characters like braces, and included at least two characters no longer used, the up-arrow and left-arrow. The control characters were also significantly different from today. ASCII-1965 was unpublished and unused. I don't know the details of what it changed. ASCII-1967 is a lot closer to the ASCII in use today. It made considerable changes to the control characters, moving, adding, removing, or renaming at least half a dozen control characters. It officially added lowercase letters, braces, and some others. It replaced the up-arrow character with the caret and the left-arrow with the underscore. It was ambiguous, allowing variations and substitutions, e.g.: - character 33 was permitted to be either the exclamation mark ! or the logical OR symbol | - consequently character 124 (vertical bar) was always displayed as a broken bar ¦, which explains why even today many keyboards show it that way - character 35 was permitted to be either the number sign # or the pound sign £ - character 94 could be either a caret ^ or a logical NOT ¬ Even the humble comma could be pressed into service as a cedilla. ASCII-1968 didn't change any characters, but allowed the use of LF on its own. Previously, you had to use either LF/CR or CR/LF as newline. ASCII-1977 removed the ambiguities from the 1967 standard. The most recent version is ASCII-1986 (also known as ANSI X3.4-1986). Unfortunately I haven't been able to find out what changes were made -- I presume they were minor, and didn't affect the character set. So as you can see, even with actual ASCII, you can have mojibake. It's just not normally called that. But if you are given an arbitrary ASCII file of unknown age, containing code 94, how can you be sure it was intended as a caret rather than a logical NOT symbol? You can't. Then there are at least 30 official variations of ASCII, strictly speaking part of ISO-646. These 7-bit codes were commonly called "ASCII" by their users, despite the differences, e.g. replacing the dollar sign $ with the international currency sign ¤, or replacing the left brace { with the letter s with caron š. One consequence of this is that the MIME type for ASCII text is called "US ASCII", despite the redundancy, because many people expect "ASCII" alone to mean whatever national variation they are used to. But it gets worse: there are proprietary variations on ASCII which are commonly called "ASCII" but aren't, including dozens of 8-bit so-called "extended ASCII" character sets, which is where the problems *really* pile up. Invariably back in the 1980s and early 1990s people used to call these "ASCII" no matter that they used 8-bits and contained anything up to 256 characters. Just because somebody calls something "ASCII", doesn't make it so; even if it is ASCII, doesn't mean you know which version of ASCII; even if you know which version, doesn't mean you know how to interpret certain codes. It simply is *wrong* to think that "good ol' plain ASCII text" is unambiguous and devoid of problems. > With unicode there are in-memory formats, transportation formats eg > UTF-8, And the same applies to ASCII. ASCII is a *seven-bit code*. It will work fine on computers where the word-size is seven bits. If the word-size is eight bits, or more, you have to pad the ASCII code. How do you do that? Pad the most-significant end or the least significant end? That's a choice there. How do you pad it, with a zero or a one? That's another choice. If your word-size is more than eight bits, you might even pad *both* ends. In C, a char is defined as the smallest addressable unit of the machine that can contain basic character set, not necessarily eight bits. Implementations of C and C++ sometimes reserve 8, 9, 16, 32, or 36 bits as a "byte" and/or char. Your in-memory representation of ASCII "a" could easily end up as bits 001100001 or 0000000001100001. And then there is the question of whether ASCII characters should be Big Endian or Little Endian. I'm referring here to bit endianness, rather than bytes: should character 'a' be represented as bits 1100001 (most significant bit to the left) or 1000011 (least significant bit to the left)? This may be relevant with certain networking protocols. Not all networking protocols are big-endian, nor are all processors. The Ada programming language even supports both bit orders. When transmitting ASCII characters, the networking protocol could include various start and stop bits and parity codes. A single 7-bit ASCII character might be anything up to 12 bits in length on the wire. It is simply naive to imagine that the transmission of ASCII codes is the same as the in-memory or on-disk storage of ASCII. You're lucky to be active in a time when most common processors have standardized on a single bit-order, and when most (but not all) network protocols have done the same. But that doesn't mean that these issues don't exist for ASCII. If you get a message that purports to be ASCII text but looks like this: "\tS\x1b\x1b{\x01u{'\x1b\x13!" you should suspect strongly that it is "Hello World!" which has been accidentally bit-reversed by some rogue piece of hardware. -- Steven -- https://mail.python.org/mailman/listinfo/python-list