On Sat, 2010-05-15 at 20:30 +1000, Lie Ryan wrote: > On 05/15/10 10:27, Adam Tauno Williams wrote: > > I'm trying to process OpenStep plist files in Python. I have a parser > > which works, but only for strict ASCII. However plist files may contain > > accented characters - equivalent to ISO-8859-2 (I believe). For example > > I read in the line: > >>>> handle = open('file.txt', 'rb') > >>>> data = handle.read() > >>>> handle.close() > >>>> data > > ' "skyp4_filelist_10201/localit\xc3\xa0 termali_sortfield" = > > NSFileName;\n' > I presume you're using Python 2.x.
Yes. But the days of all-unicode-strings will be wonderful when it comes. :) > > What is the correct way to re-encode this data into UTF-8 so I can use > > unicode strings, and then write the output back to ISO8859-? > > I can read the file using codecs as ISO8859-2, but it still doesn't seem > > correct. > >>>> handle = codecs.open('file.txt', 'rb', encoding='iso8859-2') > >>>> data = handle.read() > >>>> handle.close() > >>>> data > > u' "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" = > > NSFileName;\n' > When printing in the interactive interpreter, python uses __repr__ > representation by default. If you want to use __str__ representation use > "print data" (note, your terminal must support printing unicode > characters); Using GNOME Terminal, so Unicode characters should display correctly. And I do see the characters when I 'cat' the file. > either way, even though the string looks like '\u0102' when > printed on the terminal, the binary pattern inside the memory should > correctly represents the accented character. Yep. But in the interpreter both unicode() and repr() produce the same output. Nothing displays the accented character. h = codecs.open('file.txt', 'rb', encoding='iso8859-2') data = h.read() h.close() str(data) 'ascii' codec can't encode characters in position 33-34: ordinal not in range(128) unicode(data) u' "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" = NSFileName;\n' repr(data) 'u\' "skyp4_filelist_10201/localit\\u0102\\xa0 termali_sortfield" = NSFileName;\\n\'' I think I'm getting close. Parsing the file seems to work, and while writing it out does not error, rereading my own output fails. :( Possibly I'm 'accidentally' writing the output as UTF-8 and not ISO8859-2. I need the internal data to be UTF-8 but read as ISO8859-2 and rewritten back to ISO8859-2 [at least that is what I believe from the OpenStep files I'm seeing]. What is the 'official' way to encode something from UTF-8 to another code page. I *assumed* that if I wrote a unicode stream back through: h = codecs.open(output_filename, 'wb', encoding='iso8859-2') data = writer.store(defaults) h.write(data) h.close() that is would be re-encoded [word?]. But maybe not? > f = codecs.open("in.txt", 'rb', encoding="iso8859-2") > f2 = codecs.open("out.txt", 'wb', encoding="utf-8") > s = f.read() > f2.write(s) > f.close() > f2.close() -- Adam Tauno Williams <awill...@whitemice.org> LPIC-1, Novell CLA <http://www.whitemiceconsulting.com> OpenGroupware, Cyrus IMAPd, Postfix, OpenLDAP, Samba
"skyp4_filelist_10201/località termali_sortfield" = NSFileName;
-- http://mail.python.org/mailman/listinfo/python-list