On Wed, May 11, 2011 at 3:37 PM, harrismh777 <harrismh...@charter.net> wrote: > hi folks, > I am puzzled by unicode generally, and within the context of python > specifically. For one thing, what do we mean that unicode is used in python > 3.x by default. (I know what default means, I mean, what changed?)
The `unicode' class was renamed to `str', and a stripped-down version of the 2.X `str' class was renamed to `bytes'. > I think part of my problem is that I'm spoiled (American, ascii heritage) > and have been either stuck in ascii knowingly, or UTF-8 without knowing > (just because the code points lined up). I am confused by the implications > for using 3.x, because I am reading that there are significant things to be > aware of... what? Mainly Python 3 no longer does explicit conversion between bytes and unicode, requiring the programmer to be explicit about such conversions. If you have Python 2 code that is sloppy about this, you may get some Unicode encode/decode errors when trying to run the same code in Python 3. The 2to3 tool can help somewhat with this, but it can't prevent all problems. > On my installation 2.6 sys.maxunicode comes up with 1114111, and my 2.7 > and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was > compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the > default compile option for 2.7 & 3.2 (I didn't change anything) is set for > UCS-2 (UTF-16) or 2 byte unicode(?). Do I understand this much correctly? I think that UCS-2 has always been the default unicode width for CPython, although the exact representation used internally is an implementation detail. > The books say that the .py sources are UTF-8 by default... and that 3.x is > either UCS-2 or UCS-4. If I use the file handling capabilities of Python in > 3.x (by default) what encoding will be used, and how will that affect the > output? If you open a file in binary mode, the result is a non-decoded byte stream. If you open a file in text mode and do not specify an encoding, then the result of locale.getpreferredencoding() is used for decoding, and the result is a unicode stream. > If I do not specify any code points above ascii 0xFF does any of this > matter anyway? You mean 0x7F, and probably, due to the need to explicitly encode and decode. -- http://mail.python.org/mailman/listinfo/python-list