On 23/07/13 04:27, Jim Mooney wrote:
Okay, I'm getting there, but this should be translating A umlaut to an old
DOS box character, according to my ASCII table,


I understand what you mean, but I should point out that what you say is 
*literally impossible*, since neither Ä nor any box-drawing characters are part 
of ASCII. What you are saying is figuratively equivalent to this:

...should be translating Москва to モスクワ according to my Latin to French 
dictionary...

Even if the ancient Romans knew of the city of Moscow, they didn't write it in 
Cyrillic and you certainly can't get Japanese characters by translating it to 
French.

Remember that ASCII only has 128 characters, and *everything else* is non-ASCII, whether they are 
line-drawing characters, European accented letters, East Asian characters, emoticons, or ancient 
Egyptian. People who talk about "extended ASCII" are confused, and all you need to do to 
show up their confusion is to ask "which extended ASCII do you mean?" There are dozens.

For example, ordinal value 0xC4 (hex, = 196 in decimal) has the following meaning 
depending on the version of "extended ASCII" you use:

Ä LATIN CAPITAL LETTER A WITH DIAERESIS
 HEBREW POINT HIRIQ
Δ GREEK CAPITAL LETTER DELTA
ؤ ARABIC LETTER WAW WITH HAMZA ABOVE
─ BOX DRAWINGS LIGHT HORIZONTAL
ƒ LATIN SMALL LETTER F WITH HOOK


using encodings Latin1, CP1255, ISO-8859-7, ISO-8859-6, IBM866, and MacRoman, 
in that order. And there are many others.

So the question is, if you have a file name with byte 196 in it, which 
character is intended? In isolation, you cannot possibly tell. As an English 
speaker, I've used at least four of the above six, although only three in file 
names. With single-byte encodings, limited to a mere 256 characters (128 of 
which are already locked down to the ASCII charset[1]), you can't have all of 
the above except by using Unicode[2].

The old "code pages" technology is sheer chaos, and sadly we'll be living with 
it for years to come. But eventually, maybe in another 30 years or so, everyone will use 
Unicode all the time, except for specialist and legacy needs, and gradually we'll get 
past this nonsense of dozens of encodings and moji-bake and other crap.





[1] Not all encodings are ASCII-compatible, but most of them are.

[2] Or something like it. In Japan, there is a proprietary charset called TRON 
which includes even more characters than Unicode. Both TRON and Unicode aim to 
include every human character which has ever been used, but they disagree as to 
what counts as distinct characters. In a nutshell, there are some tens of 
thousands or so characters which are written the same way in Chinese, Japanese 
and Korean, but used differently. Unicode's policy is that you can tell from 
context which is meant, and gives them a single code-point each, while TRON 
gives them three code-points. This is not quite as silly as saying that an 
English E, a German E and a French E should be considered three distinct 
characters, but (in my opinion) not far off it.


--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to