On Friday 06 December 2013 14:30:06 Steven D'Aprano did opine: > On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote: > > Evidently (and completely inadvertently) this exchange has just > > illustrated one of the inadmissable assumptions: > > > > "unicode as a medium is universal in the same way that ASCII used to > > be" > > Ironically, your post was not Unicode. > > Seriously. I am 100% serious. > > Your post was sent using a legacy encoding, Windows-1252, also known as > CP-1252, which is most certainly *not* Unicode. Whatever software you > used to send the message correctly flagged it with a charset header: > > Content-Type: text/plain; charset=windows-1252 > > Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle > encodings correctly (or at all!), it screws up the encoding then sends a > reply with no charset line at all. This is one bug that cannot be blamed > on Google Groups -- or on Unicode. > > > I wrote a number of ellipsis characters ie codepoint 2026 as in: > Actually you didn't. You wrote a number of ellipsis characters, hex byte > \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to > code point U+2026 in Unicode, but the two are as distinct as ASCII and > EBCDIC. > > > Somewhere between my sending and your quoting those ellipses became > > the replacement character FFFD > > Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about > encodings and character sets. It doesn't just assume things are ASCII, > but makes a half-hearted attempt to be charset-aware, but badly. I can > only imagine that it was written back in the Dark Ages where there were > a lot of different charsets in use but no conventions for specifying > which charset was in use. Or perhaps the author was smoking crack while > coding. > > > Leaving aside whose fault this is (very likely buggy google groups), > > this mojibaking cannot happen if the assumption "All text is ASCII" > > were to uniformly hold. > > This is incorrect. People forget that ASCII has evolved since the first > version of the standard in 1963. There have actually been five versions > of the ASCII standard, plus one unpublished version. (And that's not > including the things which are frequently called ASCII but aren't.) > > ASCII-1963 didn't even include lowercase letters. It is also missing > some graphic characters like braces, and included at least two > characters no longer used, the up-arrow and left-arrow. The control > characters were also significantly different from today. > > ASCII-1965 was unpublished and unused. I don't know the details of what > it changed. > > ASCII-1967 is a lot closer to the ASCII in use today. It made > considerable changes to the control characters, moving, adding, > removing, or renaming at least half a dozen control characters. It > officially added lowercase letters, braces, and some others. It > replaced the up-arrow character with the caret and the left-arrow with > the underscore. It was ambiguous, allowing variations and > substitutions, e.g.: > > - character 33 was permitted to be either the exclamation > mark ! or the logical OR symbol | > > - consequently character 124 (vertical bar) was always > displayed as a broken bar آ¦, which explains why even today > many keyboards show it that way > > - character 35 was permitted to be either the number sign # or > the pound sign آ£ > > - character 94 could be either a caret ^ or a logical NOT آ¬ > > Even the humble comma could be pressed into service as a cedilla. > > ASCII-1968 didn't change any characters, but allowed the use of LF on > its own. Previously, you had to use either LF/CR or CR/LF as newline. > > ASCII-1977 removed the ambiguities from the 1967 standard. > > The most recent version is ASCII-1986 (also known as ANSI X3.4-1986). > Unfortunately I haven't been able to find out what changes were made -- > I presume they were minor, and didn't affect the character set. > > So as you can see, even with actual ASCII, you can have mojibake. It's > just not normally called that. But if you are given an arbitrary ASCII > file of unknown age, containing code 94, how can you be sure it was > intended as a caret rather than a logical NOT symbol? You can't. > > Then there are at least 30 official variations of ASCII, strictly > speaking part of ISO-646. These 7-bit codes were commonly called "ASCII" > by their users, despite the differences, e.g. replacing the dollar sign > $ with the international currency sign آ¤, or replacing the left brace > { with the letter s with caron إ،. > > One consequence of this is that the MIME type for ASCII text is called > "US ASCII", despite the redundancy, because many people expect "ASCII" > alone to mean whatever national variation they are used to. > > But it gets worse: there are proprietary variations on ASCII which are > commonly called "ASCII" but aren't, including dozens of 8-bit so-called > "extended ASCII" character sets, which is where the problems *really* > pile up. Invariably back in the 1980s and early 1990s people used to > call these "ASCII" no matter that they used 8-bits and contained > anything up to 256 characters. > > Just because somebody calls something "ASCII", doesn't make it so; even > if it is ASCII, doesn't mean you know which version of ASCII; even if > you know which version, doesn't mean you know how to interpret certain > codes. It simply is *wrong* to think that "good ol' plain ASCII text" > is unambiguous and devoid of problems. > > > With unicode there are in-memory formats, transportation formats eg > > UTF-8, > > And the same applies to ASCII. > > ASCII is a *seven-bit code*. It will work fine on computers where the > word-size is seven bits. If the word-size is eight bits, or more, you > have to pad the ASCII code. How do you do that? Pad the most-significant > end or the least significant end? That's a choice there. How do you pad > it, with a zero or a one? That's another choice. If your word-size is > more than eight bits, you might even pad *both* ends. > > In C, a char is defined as the smallest addressable unit of the machine > that can contain basic character set, not necessarily eight bits. > Implementations of C and C++ sometimes reserve 8, 9, 16, 32, or 36 bits > as a "byte" and/or char. Your in-memory representation of ASCII "a" > could easily end up as bits 001100001 or 0000000001100001. > > And then there is the question of whether ASCII characters should be Big > Endian or Little Endian. I'm referring here to bit endianness, rather > than bytes: should character 'a' be represented as bits 1100001 (most > significant bit to the left) or 1000011 (least significant bit to the > left)? This may be relevant with certain networking protocols. Not all > networking protocols are big-endian, nor are all processors. The Ada > programming language even supports both bit orders. > > When transmitting ASCII characters, the networking protocol could > include various start and stop bits and parity codes. A single 7-bit > ASCII character might be anything up to 12 bits in length on the wire. > It is simply naive to imagine that the transmission of ASCII codes is > the same as the in-memory or on-disk storage of ASCII. > > You're lucky to be active in a time when most common processors have > standardized on a single bit-order, and when most (but not all) network > protocols have done the same. But that doesn't mean that these issues > don't exist for ASCII. If you get a message that purports to be ASCII > text but looks like this: > > "\tS\x1b\x1b{\x01u{'\x1b\x13!" > > you should suspect strongly that it is "Hello World!" which has been > accidentally bit-reversed by some rogue piece of hardware.
You can lay a lot of the ASCII ambiguity on D.E.C. and their vt series terminals, anything newer than a vt100 made liberal use of the msbit in a character. Having written an emulator for the vt-220, I can testify that really getting it right, was a right pain in the ass. And then I added zmodem triggers and detections. Cheers, Gene -- "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Genes Web page <http://geneslinuxbox.net:6309/gene> Mother Earth is not flat! A pen in the hand of this president is far more dangerous than 200 million guns in the hands of law-abiding citizens. -- https://mail.python.org/mailman/listinfo/python-list