On Fri, Oct 16, 2009 at 5:07 PM, Stef Mientki <stef.mien...@gmail.com>wrote:
Unfortunately, there is no simple answer to these questions. > Thanks guys, > I didn't know the codecs module, > and the codecs seems to be a good solution, > at least it can safely write a file. > But now I have to open that file in Excel 2000 ... 2007, > and I get something completely wrong. > After changing codecs to latin-1 or windows-1252, > everything works fine. > > Which of the 2 should I use latin-1 or windows-1252 ? > > You should use the encoding that the file is expected to be in in; it was saved in a certain, explicit encoding. You may be able to find what it is on the web, but you'll have to find it and use that. Every file may be different. There's no universal right answer; and there's not really any way to tell what the answer SHOULD be, short of -- trying various encodings until one works. Doing research to find out what this other-program saves or expects to open is all you can do. It wouldn't surprise me if Excel used cp1252 by default; that's vaguely sorta like ISO-8859-1 (also known as latin1), except in the high byte range. The two are similar enough that they are confused in a lot of software with odd results. The thing is, I'd be VERY surprised (neigh, shocked!) if Excel can't open a file that is in UTF8-- it just might need to be TOLD that its utf8 when you go and open the file, as UTF8 looks just like ASCII -- until it contains characters that can't be expressed in ASCII. But I don't know what type of file it is you're saving. > And a more general question, how should I organize my Python programs ? > In general I've data coming from Excel, Delphi, SQLite. > In Python I always use wxPython, so I'm forced to use unicode. > My output often needs to be exported to Excel, SPSS, SQLite. > So would this be a good design ? > I have no idea what SPSS is, but in general the way I handle these issues are by following these rules: - Convert to Unicode from the earliest possible point; the moment data gets into my code, I convert it to unicode. - There has to be some heuristics / intelligent tests to determine HOW you convert it to unicode: you really have to /know/ before-hand what the data was encoded in before, in order to do so. You can often assume its ASCII, but unfortunately, that only works until that moment when its not. And it will eventually be not, guaranteed; you will have to base this decision on the type of source you're getting the data from. Is it from a file, if so, what kind of file? Does the program which produced it always write out a certain encoding? Or is it variable? Is it something user-specified (e.g., in an environment variable or preference), etc? Regardless, the moment you get data-- convert it into unicode, with unicode(data, "<original-encoding>") ... the original-encoding is whatever you determine the encoding the data was in before you got it. - All private storage should be stored as unicode, encoded to UTF8. Private meaning, other programs you don't control don't mess with it. This should include data files AND databases-- you should be storing unicode as UTF in SQLIte. See the 'pragma encoding' instruction at http://www.sqlite.org/pragma.html - Do all processing in your program as unicode. - Encode the data at the last possible moment during the output process, according to whatever it needs to be; if at all possible encode at output as UTF8 if other programs can handle it, as life will one day be better when all programs can be on the same page here. But that's not always possible: when not, be sure the decision of what encoding to use when writing the data out is something your program remembers or can determine at a later point-- so that when/if you need to read it in, you know what encoding it was written out to. - Be prepared when writing data out to experience an error if the internal unicode data contains a character which can't be expressed in the limited output encoding if you're forced to use something non-UTF8, like latin1 or cp1252. If you're constrained to having to support these limited character sets, then you're going to have to make sure you handle that situation gracefully-- either by using an error handler when you encode it (e.g., line.encode("latin1", "ignore") which will exclude any characters that can't be handled in latin1, or I usually prefer line.encode("latin1", "xmlcharrefreplace") which will replace the non-working characters with &x1234; type notation) or by including validators in your UI which reject characters that can't fit into a desired encoding. Even with that validator, use unicode-strings internally. - Mourn for the good ol' days when you actually could imagine the pleasant fiction as something such as 'plain text' existing-- it never really has existed. :) So: I'd use unicode entirely -- in wxPython, stored in your database, and you should be able to use it in Delphi too I believe? Its been years since I used delphi, but. At the point where there's a barrier between 'your' stuff and 'other stuff', you convert from unicode into an encoding-- if you must. Anyways. That's just how I handle it, and rarely run into problems-- Only really when dealing with some new file format from strange old systems when its not obvious what encoding its in due to poor-documentation and naïve implementations. -- Stephen Hansen Development Advanced Prepress Technology shan...@advpubtech.com (818) 748-9282
-- http://mail.python.org/mailman/listinfo/python-list