[issue20906] Issues in Unicode HOWTO

Graham Wideman Thu, 20 Mar 2014 19:30:39 -0700

Graham Wideman added the comment:

Marc-Andre: Thanks for your latest comments.


> We could also have called encodings: "character set", "code page",
> "character encoding", "transformation", etc.

I concur with you that things _could_ be called all sorts of names, and the 
choices may be arbitrary. However, creating a clear explanation requires 
figuring out the distinct things of interest in the domain, picking terms for 
those things that are distinct, and then using those terms rigorously. (Usage 
in the field may vary, which in itself may warrant comment.)

I read your slide deck/time-capsule-from-2002,  with interest, on a number of 
points. (I realize that you were involved in the Python 2.x implementation of 
Unicode. Not sure about 3.x?)

Page 8 "What is a Character?" is lovely, showing very explicitly Unicode's two 
levels of mapping, and giving names to the separate parts. It strongly suggests 
this HOWTO page needs a similar figure.

That said, there are a few notes to make on that slide, useful in trying to 
arrive at consistent terms: 

1. The figure shows a more precise word for "what users regard as a character", 
namely "grapheme". I'd forgotten that.

2. It shows e-accent-acute to demonstrate a pair of code points representing a 
single grapheme. That's important, but should avoid suggesting this as the only 
way to form e-accent-acute (canonical equivalence, U+00E9).

3. The illustration identifies the series of code points (the middle row) as 
"the Unicode encoding of the string". Ie: The grapheme-to-code-points mapping 
is described as an encoding. Not a wrong use of general language. But 
inconsistent with the mapping that encode() pertains to. (And I don't think 
that the code-point-to-grapheme transform is ever called "decoding", but I 
could be wrong.)

4. The illustration of Code Units (in the third row) shows graphemes for the 
Code Units (byte values). This confusingly glosses over the fact that those 
graphemes correspond to what you would see if you _decoded_ these byte values 
using CP1252 or ISO 8859-1, suggesting that the result is reasonable or useful. 
It certainly happens that people do this, deliberately or accidentally, but it 
is a misuse of the data, and should be warned against, or at least explained as 
a confusion.

Returning to your most recent message:

> In Python keep it simple: you have Unicode (code points) and 
> 8-bit strings or bytes (code units).

I wish it _were_ that simple. And I agree that, in principle, (assuming Python 
3+) there should "inside your program" where you have the str type which always 
acts as sequences of Unicode code points, and has string functions. And then 
there's "outside your program", where text is represented by sequences of bytes 
that specify or imply some encoding. And your program should use supplied 
library functions to mostly automatically convert on the way in and on the way 
out.

But there are enough situations where the Python programmer, having adopted 
Python 3's string = Unicode approach, sees unexpected results. That prompts 
reading this page, which is called upon to make the fine distinctions to allow 
figuring out what's going on.

I'm not sure what you mean by "8-bit strings" but I'm pretty sure that's not an 
available type in Python 3+. Ie: Some functions (eg: encode()) produce 
sequences of bytes, but those don't work entirely like strs. 

-----------
This discussion to try to revise the article piecemeal has become pretty 
diffuse, with perhaps competing notions of purpose, and what level of detail 
and precision are needed etc. I will try to suggest something productive in a 
subsequent message.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue20906>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue20906] Issues in Unicode HOWTO

Reply via email to