[EMAIL PROTECTED] wrote: > > I am having great problems writing norwegian characters æøå to file > from a python application. My (simplified) scenario is as follows: > > 1. I have a web form where the user can enter his name. > > 2. I use the cgi module module to get to the input from the user: > .... > name = form["name"].value
The cgi module should produce plain strings, not Unicode objects, which makes some of the later behaviour quite "interesting". > 3. The name is stored in a file > > fileH = open(namefile , "a") > fileH.write("name:%s \n" % name) > fileH.close() > > Now, this works very well indeed as long the users have 'ascii' names, > however when someone enters a name with one of the norwegian characters > æøå - it breaks at the write() statement. > > UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position This is odd, since writing plain strings to files shouldn't involve any Unicode conversions. If you received a plain string from the cgi module, the text you write to the file should still be a plain string. This is like obtaining a sequence of bytes and just passing them around. Perhaps your Python configuration is different in some non-standard way, although I wouldn't want to point the finger at anything in particular (although sys.getdefaultencoding might suggest something). > Now - I understand that the ascii codec can't be used to decode the > particular characters, however my attempts of specifying an alternative > encoding have all failed. > > I have tried variants along the line: > > fileH = codecs.open(namefile , "a" , "latin-1") / fileH = open(namefile , > "a") > fileH.write(name) / fileH.write(name.encode("latin-1")) > > It seems *whatever* I do the Python interpreter fails to see my pledge > for an alternative encoding, and fails with the dreaded > UnicodeDecodeError. To use a file opened through codecs.open, you really should present Unicode objects to the write method. Otherwise, I imagine that the method will try and automatically convert to Unicode the plain string that the name object supposedly is, and this conversion will assume that the string only contains ASCII characters (as is Python's default behaviour) and thus cause the error you are seeing. Only after getting the text as a Unicode object will the method then try to encode the text in the specified encoding in order to write it to the file. In other words, you'll see this behaviour: name (plain string) -> Unicode object -> encoded text (written to file) Or rather, in the failure case: name (plain string) -> error! (couldn't produce the Unicode object) As Peter Otten suggests, you could first make the Unicode object yourself, stating explicitly that the name object contains "latin-1" characters. In other words: name (plain string) -> Unicode object Then, the write method has an easier time: Unicode object -> encoded text (written to file) All this seems unnecessary for your application, I suppose, since you know (or believe) that the form values only contain "latin-1" characters. However, as is the standard advice on such matters, you may wish to embrace Unicode more eagerly, converting plain strings to Unicode as soon as possible and only converting them to text in various encodings when writing them out. In some Web framework APIs, the values of form fields are immediately available as Unicode without any (or much) additional work. WebStack returns Unicode objects for form fields, as does the Java Servlet API, but I'm not particularly aware of many other Python frameworks which enforce or promote such semantics. Paul -- http://mail.python.org/mailman/listinfo/python-list