----- A coding scheme works with three sets. A *unique* set of CHARACTERS, a *unique* set of CODE POINTS and a *unique* set of ENCODED CODE POINTS, unicode or not.
The relation between the set of characters and the set of the code points is a *human* table, created with a sheet of paper and a pencil, a deliberate choice of characters with integers as "labels". The relation between the set of the code points and the set of encoded code points is a "mathematical" operation. In the case of an "8bits" coding scheme, like iso-XXX, this operation is a no-op, the relation is an identity. Shortly: set of code points == set of encoded code points. In the case of unicode, The Unicode consortium endorses three such mathematical operations called UTF-8, UTF-16 and UTF-32 where UTF means Unicode Transformation Format, a confusing wording meaning at the same time, the process and the result of the process. This Unicode Transformation does not produce bytes, it produces words/chunks/tokens of *bits* with lengths 8, 16, 32, called Unicode Transformation Units (from this the names UTF-8, -16, -32). At this level, only a structure has been defined (there is no computing). Very important, an healthy coding scheme works conceptually only with this *unique" set of encoded code points, not with bytes, characters or code points. The last step, the machine implementation: it is up to the processor, the compiler, the language to implement all these Unicode Transformation Units with of course their related specifities: char, w_char, int, long, endianess, rune (Go language), ... Not too over-simplified or not too over-complicated and enough to understand one, if not THE, design mistake of the flexible string representation. jmf -- http://mail.python.org/mailman/listinfo/python-list