On 05/15/10 10:27, Adam Tauno Williams wrote: > I'm trying to process OpenStep plist files in Python. I have a parser > which works, but only for strict ASCII. However plist files may contain > accented characters - equivalent to ISO-8859-2 (I believe). For example > I read in the line: > >>>> handle = open('file.txt', 'rb') >>>> data = handle.read() >>>> handle.close() >>>> data > ' "skyp4_filelist_10201/localit\xc3\xa0 termali_sortfield" = > NSFileName;\n'
I presume you're using Python 2.x. > What is the correct way to re-encode this data into UTF-8 so I can use > unicode strings, and then write the output back to ISO8859-? > > I can read the file using codecs as ISO8859-2, but it still doesn't seem > correct. > >>>> handle = codecs.open('file.txt', 'rb', encoding='iso8859-2') >>>> data = handle.read() >>>> handle.close() >>>> data > u' "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" = > NSFileName;\n' When printing in the interactive interpreter, python uses __repr__ representation by default. If you want to use __str__ representation use "print data" (note, your terminal must support printing unicode characters); either way, even though the string looks like '\u0102' when printed on the terminal, the binary pattern inside the memory should correctly represents the accented character. f = codecs.open("in.txt", 'rb', encoding="iso8859-2") f2 = codecs.open("out.txt", 'wb', encoding="utf-8") s = f.read() f2.write(s) f.close() f2.close() -- http://mail.python.org/mailman/listinfo/python-list