Hallvard B Furuseth <h.b.furus...@usit.uio.no> writes: > I've been playing a bit with Python3.2a2, and frankly its charset > handling looks _less_ safe than in Python 2. > > The offender is bytes.__str__: str(b'foo') == "b'foo'". > It's often not clear from looking at a piece of code whether > some data is treated as strings or bytes, particularly when > translating from old code. Which means one cannot see from > context if str(s) or "%s" % s will produce garbage. > > With 2.<late> conversion Unicode <-> string the equivalent operation did > not silently produce garbage: it raised UnicodeError instead. With old > raw Python strings that was not a problem in applications which did not > need to convert any charsets, with python3 they can break. > > I really wish bytes.__str__ would at least by default fail.
I think you misunderstand the purpose of str(). It is to provide a (unicode) string representation of an object and has nothing to do with converting it to unicode: >>> b = b"\xc2\xa3" >>> str(b) "b'\\xc2\\xa3'" If you want to *decode* a bytes string, use its decode method and you get a unicode string (if your bytes string is a valid encoding): >>> b = b"\xc2\xa3" >>> b.decode('utf8') '£' >>> b.decode('ascii') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128) If you want to *encode* a (unicode) string, use its encode method and you get a bytes string (provided your string can be encoded using the given encoding): >>> s="€" >>> s.encode('utf8') b'\xe2\x82\xac' >>> s.encode('ascii') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in position 0: ordinal not in range(128) -- Arnaud -- http://mail.python.org/mailman/listinfo/python-list