Am 05.09.2014 um 20:25 schrieb Chris “Kwpolska” Warrick <kwpol...@gmail.com>: > On Sep 5, 2014 7:57 PM, "Kurt Mueller" <kurt.alfred.muel...@gmail.com> wrote: > > Could someone please explain the following behavior to me: > > Python 2.7.7, MacOS 10.9 Mavericks > > > > >>> import sys > > >>> sys.getdefaultencoding() > > 'ascii' > > >>> [ord(c) for c in 'AÄ'] > > [65, 195, 132] > > >>> [ord(c) for c in u'AÄ'] > > [65, 196] > > > > My obviously wrong understanding: > > ‚AÄ‘ in ‚ascii‘ are two characters > > one with ord A=65 and > > one with ord Ä=196 ISO8859-1 <depends on code table> > > —-> why [65, 195, 132] > > u’AÄ’ is an Unicode string > > —-> why [65, 196] > > > > It is just the other way round as I would expect. > > Basically, the first string is just a bunch of bytes, as provided by your > terminal — which sounds like UTF-8 (perfectly logical in 2014). The second > one is converted into a real Unicode representation. The codepoint for Ä is > U+00C4 (196 decimal). It's just a coincidence that it also matches latin1 aka > ISO 8859-1 as Unicode starts with all 256 latin1 codepoints. Please kindly > forget encodings other than UTF-8.
So: ‘AÄ’ is an UTF-8 string represented by 3 bytes: A -> 41 -> 65 first byte decimal Ä -> c384 -> 195 and 132 second and third byte decimal u’AÄ’ is an Unicode string represented by 2 bytes?: A -> U+0041 -> 65 first byte decimal, 00 is omitted or not yielded by ord()? Ä -> U+00C4 -> 196 second byte decimal, 00 is ommited or not yielded by ord()? > BTW: ASCII covers only the first 128 bytes. ACK -- Kurt Mueller, kurt.alfred.muel...@gmail.com -- https://mail.python.org/mailman/listinfo/python-list