On Apr 14, 11:59 am, Paul Rubin <http://[EMAIL PROTECTED]> wrote: > "Rhamphoryncus" <[EMAIL PROTECTED]> writes: > > Nope, it's pretty fundamental to working with text, unicode only being > > an extreme example: there's a wide number of ways to break down a > > chunk of text, making the odds of "e" being any particular one fairly > > low. Python's unicode type only makes this slightly worse, not > > promising any particular one is available. > > I don't understand this. I thought that unicode was a character > coding system like ascii, except with an enormous character set > combined with a bunch of different algorithms for encoding unicode > strings as byte sequences. But I've thought of those algorithms > (UTF-8 and so forth) as basically being kludgy data compression > schemes, and unicode strings are still just sequences of code points.
Indexing cost, memory efficiency, and canonical representation: pick two. You can't use a canonical representation (scalar values) without some sort of costly search when indexing (O(log n) probably) or by expanding to the worst-case size (UTF-32). Python has taken the approach of always providing efficient indexing (O(1)), but you can compile it with either UTF-16 (better memory efficiency) or UTF-32 (canonical representation). As an aside, I feel the need to clarify the terms "code points" and "scalar values". The only difference is that "code points" includes the surrogates, whereas "scalar values" does not. As the surrogates are just an encoding detail of UTF-16 I feel this makes "scalar values" the more canonical term. It's all quite confusing though x_x. -- Adam Olsen, aka Rhamphoryncus -- http://mail.python.org/mailman/listinfo/python-list