[issue20906] Issues in Unicode HOWTO

Graham Wideman Wed, 19 Mar 2014 16:51:06 -0700

Graham Wideman added the comment:

Antoine:


Thanks for your comments -- this is slippery stuff.

> It's better, but how about simply "In this article"?

I was hoping to inform the reader that the hex representations are found in 
many articles, not just special to this one.

> [ showing the glyph ]

Agreed -- it would be good to show the glyphs mentioned. But in a way that 
isn't confusing if the user's web browser doesn't show it correctly.

> For all intents and purposes, iso-8859-1 and friends *are* encodings 
> (and this is how Python actually names them).

I am still mulling this over. iso-8859-1 is most literally an "encoding" in the 
old sense of the word (character <--> byte representation), and is not, per se, 
a unicode-related concept. 

I think part of the ambiguity problem here is that there are two subtly but 
importantly different ideas here:

1. Python string (capable of representing any unicode text) --> some 
full-fidelity and industry recognized unicode byte stream, like utf-8, or 
utf-32. I think this is legitimately described as an "encoding" of the unicode 
string.

versus:

2. 1. Python string --> some other code system, such as ASCII, cp1250, etc. The 
destination code system doesn't necessarily have anything to do with unicode, 
and whole ranges of unicode's characters either result in an exception, or get 
translated as escape sequences. Ie: This is more usefully seen as a translation 
operation, than "merely" encoding.

In 1, the encoding process results in data that stays within concepts defined 
within Unicode. In 2, encoding produces data that would be described by some 
code system outside of Unicode.

At the moment I think Python muddles these two ideas together, and I'm not sure 
how to clarify this. 

> So it should say "16-bit code points" instead, right?

I don't think Unicode code points should ever be described as having a 
particular number of bits. I think this is a core concept: Unicode separates 
the character <--> code point, and code point <--> bits/bytes mappings. 

At most, one might want to distinguish different ranges of unicode code points. 
Even if there is a need to distinguish code points <= 65535, I don't think this 
should be described as "16-bit", as it muddies the distinction between 
Unicode's two mappings.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue20906>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue20906] Issues in Unicode HOWTO

Reply via email to