willie wrote: > > Thanks for the thorough explanation. One last question > about terminology then I'll go away :) > What is the proper way to describe "ustr" below? > > >>> ustr = buf.decode('UTF-8') > >>> type(ustr) > <type 'unicode'> > > > Is it a "unicode object that contains a UTF-8 encoded > string object?"
No. It is a Python unicode object, period. 1. If it did contain another object you would be (quite justifiably) screaming your peripherals off about the waste of memory :-) 2. You don't need to concern yourself with the internals of a unicode object; however rest assured that it is *not* stored as UTF-8 -- so if you are hoping for a quick "number of utf 8 bytes without actually producing a str object" method, you are out of luck. Consider this example: you have a str object which contains some Russian text, encoded in cp1251. str1 = russian_text unicode1 = str1.decode('cp1251') str2 = unicode1.encode('utf-8') unicode2 = str2.decode('utf-8') Then unicode2 == unicode1, repr(unicode2) == repr(unicode1), there is no way (without the above history) of determining how it was created -- and you don't need to care how it was created. HTH, John -- http://mail.python.org/mailman/listinfo/python-list