On Dec 12, 4:46 am, "weheh" <[EMAIL PROTECTED]> wrote: > Hi John: > Thanks for responding. > > >Look at your file using > > > print repr(open('c:/test/spanish.txt','rb').read()) > > >If you see 'a\xf1o' then use charset="windows-1252" > > I did this ... no change ... still see 'a\xf1o'
So it's not utf-8, it's windows-1252, so stop lying to browsers: like I said, use charset="windows-1252" > > >else if you see 'a\xc3\xb1o' then use charset="utf-8" else ???? > >Based on your responses to Martin, it appears that your file is > >actually windows-1252 but you are telling browsers that it is utf-8. > >Another check: if the file is utf-8, then doing > > > open('c:/test/spanish.txt','rb').read().decode('utf8')>should be OK; if > it's not valid utf8, it will complain. > > No. this causes decode error: > > UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-4: invalid > data No what? YES, the "decode error" is complaining that the data supplied is NOT valid utf-8 data. So it's not utf-8, it's windows-1252, so stop lying to browsers: like I said, use charset="windows-1252" > args = ('utf8', 'a\, 1, 5, 'invalid data') > encoding = 'utf8' > end = 5 > object = 'a\xf1o' > reason = 'invalid data' > start = 1 > > >Yet another check: open the file with Notepad. Do File/SaveAs, and > >look at the Encoding box -- ANSI or UTF-8? > > Notepad says it's ANSI That's correct (in Microsoft jargon) -- it's NOT utf-8. It's windows-1252, so stop lying to browsers: like I said, use charset="windows-1252" > > Thanks. What now? Listen to the Bellman: "What I tell you three times is true". Your file is encoded using windows-1252, NOT utf-8. You need to use charset="windows-1252". > Also, this is a general problem for me, whether I read > from a file or read from an html text field, or read from an html text area. > So I'm looking for a general solution. If it helps to debug by reading from > textarea or text field, let me know. If you are creating a file, you should know what its encoding is. As I said earlier, *every* file is encoded -- so-called "Unicode" files on Windows are encoded using utf16le. If you don't explicitly specify the encoding, it will typically be the default encoding for your locale (e.g. cp1252 in Western Europe etc). If you are reading a file created by others and its encoding is not known, you will have inspect the file and/or guess (using whatever knowledge you have about the language/locale of the creator). "whether I ... read from an html text field, or read from an html text area": isn't that what "charset" is for? HTH, John -- http://mail.python.org/mailman/listinfo/python-list