David Hughes wrote: > I used this function successfully with Python 2.4 to alter the encoding > of a set of database records from latin-1 to utf-8, but the same > program raises an exception using Python 2.5. This small example shows > the problem: > > import codecs > fo = open('test.dat', 'w') > fo.write('G\xe2teaux') > fo.close() > > fi = open("test.dat",'r') > fx = codecs.EncodedFile(fi, 'utf-8', 'latin-1') > astring = fx.readline() > print astring > ustring = unicode(astring, 'utf-8' ) > print repr(ustring) > print ustring.encode('latin-1') > print ustring.encode('utf-8') > > Python 2.4 gives: > > Gâteaux > u'G\xe2teaux' > Gâteaux > Gâteaux > > which I believe is correct, while 2.5 produces > > Traceback (most recent call last): > File "test_codec.py", line 8, in <module> > astring = fx.readline() > File "C:\Python25\lib\codecs.py", line 709, in readline > data = self.reader.readline() > File "C:\Python25\lib\codecs.py", line 471, in readline > data = self.read(readsize, firstline=True) > File "C:\Python25\lib\codecs.py", line 418, in read > newchars, decodedbytes = self.decode(data, self.errors) > UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3: > invalid data > > Is there a genuine problem here, or have I been misusing this function?
This is indeed a bug in Python 2.5. Fixed in subversion. http://svn.python.org/view/python/trunk/Lib/codecs.py?rev=52517&view=log Peter -- http://mail.python.org/mailman/listinfo/python-list