Re: [Tutor] close, but no cigar

Steven D'Aprano Tue, 23 Jul 2013 07:49:09 -0700

On 23/07/13 04:27, Jim Mooney wrote:

Okay, I'm getting there, but this should be translating A umlaut to an old
DOS box character, according to my ASCII table,

I understand what you mean, but I should point out that what you say is
*literally impossible*, since neither Ä nor any box-drawing characters are part
of ASCII. What you are saying is figuratively equivalent to this:

...should be translating Москва to モスクワ according to my Latin to French
dictionary...

Even if the ancient Romans knew of the city of Moscow, they didn't write it in
Cyrillic and you certainly can't get Japanese characters by translating it to
French.

Remember that ASCII only has 128 characters, and *everything else* is non-ASCII, whether they are
line-drawing characters, European accented letters, East Asian characters, emoticons, or ancient
Egyptian. People who talk about "extended ASCII" are confused, and all you need to do to
show up their confusion is to ask "which extended ASCII do you mean?" There are dozens.

For example, ordinal value 0xC4 (hex, = 196 in decimal) has the following meaning
depending on the version of "extended ASCII" you use:

Ä LATIN CAPITAL LETTER A WITH DIAERESIS
HEBREW POINT HIRIQ
Δ GREEK CAPITAL LETTER DELTA
ؤ ARABIC LETTER WAW WITH HAMZA ABOVE
─ BOX DRAWINGS LIGHT HORIZONTAL
ƒ LATIN SMALL LETTER F WITH HOOK

using encodings Latin1, CP1255, ISO-8859-7, ISO-8859-6, IBM866, and MacRoman,
in that order. And there are many others.

So the question is, if you have a file name with byte 196 in it, which
character is intended? In isolation, you cannot possibly tell. As an English
speaker, I've used at least four of the above six, although only three in file
names. With single-byte encodings, limited to a mere 256 characters (128 of
which are already locked down to the ASCII charset[1]), you can't have all of
the above except by using Unicode[2].

The old "code pages" technology is sheer chaos, and sadly we'll be living with
it for years to come. But eventually, maybe in another 30 years or so, everyone will use
Unicode all the time, except for specialist and legacy needs, and gradually we'll get
past this nonsense of dozens of encodings and moji-bake and other crap.

[1] Not all encodings are ASCII-compatible, but most of them are.

[2] Or something like it. In Japan, there is a proprietary charset called TRON
which includes even more characters than Unicode. Both TRON and Unicode aim to
include every human character which has ever been used, but they disagree as to
what counts as distinct characters. In a nutshell, there are some tens of
thousands or so characters which are written the same way in Chinese, Japanese
and Korean, but used differently. Unicode's policy is that you can tell from
context which is meant, and gives them a single code-point each, while TRON
gives them three code-points. This is not quite as silly as saying that an
English E, a German E and a French E should be considered three distinct
characters, but (in my opinion) not far off it.

--
Steven
_______________________________________________
Tutor maillist - [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] close, but no cigar

Reply via email to