Graham Wideman added the comment:

Marc-Andre:

Thanks for commenting:

> > 2. 1. Python string --> some other code system, such as 
> > ASCII, cp1250, etc. The destination code system doesn't 
> > necessarily have anything to do with unicode, and whole 
> > ranges of unicode's characters either result in an 
> > exception, or get translated as escape sequences. 
> > Ie: This is more usefully seen as a translation 
> > operation, than "merely" encoding.

> Those are encodings as well. The operation going from Unicode to one of
> these encodings is called "encode" in Python.

Yes I am certainly aware that in Python parlance these are also called "encode" 
(and achieved with encode()), which, I am arguing, is one reason we have 
confusion. These are not encoding into a recognized Unicode-defined byte 
stream, they entail translation and filtering into the allowed character set of 
a different code system and encoding into that code system's byte 
representation (encoding).

> > In 1, the encoding process results in data that stays within concepts 
> > defined within Unicode. In 2, encoding produces data that would be 
> > described by some code system outside of Unicode.
> > At the moment I think Python muddles these two ideas together, 
> > and I'm not sure how to clarify this. 

> An encoding is a mapping of characters to ordinals, nothing more or less.

In unicode, the mapping from characters to ordinals (code points) is not the 
encoding. It's the mapping from code points to bytes that's the encoding. While 
I wish this was a distinction reserved for pedants, unfortunately it's an 
aspect that's important for users of unicode to understand in order to make 
sense of how it works, and what the literature and the web says (correct and 
otherwise).

> You are viewing all this from the a Unicode point of view, but please
> realize that Unicode is rather new in the business and the many
> other encodings Python supports have been around for decades.

I'm advocating that the concepts be clear enough to understand that Unicode 
(UTF-whatever) works differently (two mappings) than non-Unicode systems 
(single mapping), so that users have some hope of understanding what happens in 
moving from one to the other.

> > > So it should say "16-bit code points" instead, right?
 
> > I don't think Unicode code points should ever be described as 
> > having a particular number of bits. I think this is a 
> > core concept: Unicode separates the character <--> code point, 
> > and code point <--> bits/bytes mappings. 

> You have UCS-2 and UCS-4. UCS-2 representable in 16 bits, UCS-4
> needs 21 bits, but is typically stored in 32-bit. Still,
> you're right: it's better to use the correct terms UCS-2 vs. UCS-4
> rather than refer to the number of bits.

I think mixing in UCS just adds confusion here. Unicode consortium has declared 
"UCS" obsolete, and even wants people to stop using that term:
http://www.unicode.org/faq/utf_bom.html
"UCS-2 is obsolete terminology... the term should now be avoided."
(That's a somewhat silly position -- we must still use the term to talk about 
legacy stuff. But probably not necessary here.)

So my point wasn't about UCS. It was about referring to code points as having a 
particular bit width. Fundamentally, code points are numbers, without regard to 
some particular computer number format. It is a separate matter that they can 
be encoded in 8, 16 or 32 bit encoding schemes (utf-8, 16, 32), and that is 
independent of the magnitude of the code point number. 

It _is_ the case that some code points are large enough integers that when 
encoded they _require_, say, 3 bytes in utf-8, or two 16-bit words in utf-16 
and so on. But the number of bits used in the encoding does not necessarily 
correspond to the number of bits that would be required to represent the 
integer code point number in plain binary. (Only in UTF-32 is the encoded value 
simply the binary version of the code point value.)

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue20906>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to