On Jun 19, 8:56 am, Paul Rubin <http://[EMAIL PROTECTED]> wrote: > Python 2.5 (r25:51908, Oct 6 2006, 15:24:43) > [GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu4)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import StringIO, cStringIO > >>> StringIO.StringIO('a').getvalue() > 'a' > >>> cStringIO.StringIO('a').getvalue() > 'a' > >>> StringIO.StringIO(u'a').getvalue() > u'a' > >>> cStringIO.StringIO(u'a').getvalue() > 'a\x00\x00\x00' > >>> > > I would have thought StringIO and cStringIO would return the > same result for this ascii-encodeable string.
Looks like a bug to me. > Worse: > > >>> StringIO.StringIO(u'a').getvalue().encode('utf-8').decode('utf-8') > u'a' > > does the right thing, but > > >>> cStringIO.StringIO(u'a').getvalue().encode('utf-8').decode('utf-8') > u'a\x00\x00\x00' > > looks bogus. Am I misunderstanding something? Not worse, no more bogus than before. Note that an explicit design feature of utf8 is that ASCII characters (ord(c) < 128) are unchanged by the transformation. >>> 'a\x00\x00\x00'.encode('utf-8') # IMPLICIT conversion to unicode (effectively .decode('ascii')), then encoding as utf8 'a\x00\x00\x00' # no change to original buggy result >>> >>> 'a\x00\x00\x00'.decode('utf-8') u'a\x00\x00\x00' # as expected >>> -- http://mail.python.org/mailman/listinfo/python-list