Simon Willison wrote: > Hello, > > I'm using ElementTree to parse an XML file which includes some data > encoded as cp1252, for example: > > <name>Bob\x92s Breakfast</name> > > If this was a regular bytestring, I would convert it to utf8 using the > following: > >>>> print 'Bob\x92s Breakfast'.decode('cp1252').encode('utf8') > Bob's Breakfast > > But ElementTree gives me back a unicode string, so I get the following > error: > >>>> print u'Bob\x92s Breakfast'.decode('cp1252').encode('utf8') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/ > python2.5/encodings/cp1252.py", line 15, in decode > return codecs.charmap_decode(input,errors,decoding_table) > UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in > position 3: ordinal not in range(128) > > How can I tell Python "I know this says it's a unicode string, but I > need you to treat it like a bytestring"?
I don't get your problem. You get a unicode-object. Which means that it got decoded by ET for you, as any XML-parser must do. So - why don't you get rid of that .decode('cp1252') and happily encode it to utf-8? Diez -- http://mail.python.org/mailman/listinfo/python-list