Fredrik Lundh wrote: > Anton Vredegoor wrote: > >> I'm trying to import text from an open office document (save as .sxw and >> read the data from content.xml inside the sxw-archive using >> elementtree and such tools). >> >> The encoding that gives me the least problems seems to be cp1252, >> however it's not completely perfect because there are still characters >> in it like \93 or \94. Has anyone handled this before? > > this might help: > > http://effbot.org/zone/unicode-gremlins.htm
Thanks a lot! The code below not only made the strange chars go away, but it also fixed the xml-parsing errors ... Maybe it's useful to someone else too, use at own risk though. Anton from gremlins import kill_gremlins from zipfile import ZipFile, ZIP_DEFLATED def repair(infn,outfn): zin = ZipFile(infn, 'r', ZIP_DEFLATED) zout = ZipFile(outfn, 'w', ZIP_DEFLATED) for x in zin.namelist(): data = zin.read(x) if x == 'contents.xml': zout.writestr(x,kill_gremlins(data).encode('cp1252')) else: zout.writestr(x,data) zout.close() def test(): infn = "xxxx.sxw" outfn = 'dg.sxw' repair(infn,outfn) if __name__=='__main__': test() -- http://mail.python.org/mailman/listinfo/python-list