On Apr 15, 1:55 am, Paul Rubin <http://[EMAIL PROTECTED]> wrote: > "Rhamphoryncus" <[EMAIL PROTECTED]> writes: > > Indexing cost, memory efficiency, and canonical representation: pick > > two. You can't use a canonical representation (scalar values) without > > some sort of costly search when indexing (O(log n) probably) or by > > expanding to the worst-case size (UTF-32). Python has taken the > > approach of always providing efficient indexing (O(1)), but you can > > compile it with either UTF-16 (better memory efficiency) or UTF-32 > > (canonical representation). > > I still don't get it. UTF-16 is just a data compression scheme, right? > I mean, s[17] isn't the 17th character of the (unicode) string regardless > of which memory byte it happens to live at? It could be that that accessing > it takes more than constant time, but that's hidden by the implementation. > > So where does the invariant c==s[s.index(c)] fail, assuming s contains c?
On linux (UTF-32): >>> c = u'\U0010FFFF' >>> c u'\U0010ffff' >>> list(c) [u'\U0010ffff'] On windows (UTF-32): >>> c = u'\U0010FFFF' >>> c u'\U0010ffff' >>> list(c) [u'\udbff', u'\udfff'] The unicode type's repr hides the distinction but you can see it with list. Your "single character" is actually two surrogate code points. s[s.index(c)] would only give you the first surrogate character -- Adam Olsen, aka Rhamphoryncus -- http://mail.python.org/mailman/listinfo/python-list