On Apr 15, 8:56 am, Roel Schroeven <[EMAIL PROTECTED]> wrote: > Paul Rubin schreef: > > > "Rhamphoryncus" <[EMAIL PROTECTED]> writes: > >> Indexing cost, memory efficiency, and canonical representation: pick > >> two. You can't use a canonical representation (scalar values) without > >> some sort of costly search when indexing (O(log n) probably) or by > >> expanding to the worst-case size (UTF-32). Python has taken the > >> approach of always providing efficient indexing (O(1)), but you can > >> compile it with either UTF-16 (better memory efficiency) or UTF-32 > >> (canonical representation). > > > I still don't get it. UTF-16 is just a data compression scheme, right? > > I mean, s[17] isn't the 17th character of the (unicode) string regardless > > of which memory byte it happens to live at? It could be that that accessing > > it takes more than constant time, but that's hidden by the implementation. > > > So where does the invariant c==s[s.index(c)] fail, assuming s contains c? > > I didn't get it either, but now I understand. Like you, I thought Python > Unicode strings contain a canonical representation (in interface, not > necessarily in implementation) but apparently that is not true; see > Neil's post and the reference manual > (http://docs.python.org/ref/types.html#l2h-22). > > A simple example on my Python installation, apparently compiled to use > UTF-16 (sys.maxunicode == 65535): > > >>> s = u'\u1d400'
You're confusing \u, which is followed by 4 digits, and \U, which is followed by eight: >>> list(u'\u1d400') [u'\u1d40', u'0'] >>> list(u'\U0001d400') [u'\U0001d400'] # UTF-32 output, sys.maxunicode == 1114111 [u'\ud835', u'\udc00'] # UTF-16 output, sys.maxunicode == 65535 -- Adam Olsen, aka Rhamphoryncus -- http://mail.python.org/mailman/listinfo/python-list