Stephen Hansen wrote:
On Thu, Oct 15, 2009 at 4:43 PM, Stef Mientki <stef.mien...@gmail.com <mailto:stef.mien...@gmail.com>> wrote:

    hello,

    By writing the following unicode string (I hope it can be send on
    this mailing list)

      Bücken

    to a file

       fh.write ( line )

    I get the following error:

     UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc'
    in position 9: ordinal not in range(128)

    How should I write such a string to a file ?


First, you have to understand that a file never really contains unicode-- not in the way that it exists in memory / in python when you type line = u'Bücken'. It contains a series of bytes that are an encoded form of that abstract unicode data.

There's various encodings you can use-- UTF-8 and UTF-16 are in my experience the most common. UTF-8 is an ASCII-superset, and its the one I see most often.

So, you can do:

  import codecs
  f = codecs.open('filepath', 'w', 'utf-8')
  f.write(line)

To read such a file, you'd do codecs.open as well, just with a 'r' mode and not a 'w' mode.
Thanks guys,
I didn't know the codecs module,
and the codecs seems to be a good solution,
at least it can safely write a file.
But now I have to open that file in Excel 2000 ... 2007,
and I get something completely wrong.
After changing codecs to latin-1 or windows-1252,
everything works fine.

Which of the 2 should I use latin-1 or windows-1252 ?

And a more general question, how should I organize my Python programs ?
In general I've data coming from Excel, Delphi, SQLite.
In Python I always use wxPython, so I'm forced to use unicode.
My output often needs to be exported to Excel, SPSS, SQLite.
So would this be a good design ?

Excel    |      convert        wxPython      convert        Excel
Delphi   |===>    to      ===>   in     ===>   to      ===> SQLite
SQLite   |      unicode        unicode       latin-1        SPSS

thanks,
Stef Mientki


Now, that uses a file object created with the "codecs" module which operates with theoretical unicode streams. It will automatically take any passed in unicode strings, encode them in the specified encoding (utf8), and write the resulting bytes out.

You can also do that manually with a regular file object, via:

  f.write(line.encode("utf8"))

If you are reading such a file later with a normal file object (e.g., not one created with codecs.open), you would do:

  f = open('filepath', 'rb')
  byte_data = f.read()
  uni_data = byte_data.decode("utf8")

That will convert the byte-encoded data back to real unicode strings. Be sure to do this even if it doesn't seem you need to if the file contains encoded unicode data (a thing you can only know based on documentation of whatever produced that file)... for example, a UTF8 encoded file might look and work like a completely normal ASCII file, but if its really UTF8... eventually your code will break that one time someone puts in a non-ascii character. Since UTF8 is an ASCII superset, its indistinguishable from ASCII until it contains a non-ASCII character.


HTH,

--S

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to