On Thu, Mar 10, 2016 at 1:39 AM, BartC <b...@freeuk.com> wrote: > On 09/03/2016 14:11, Chris Angelico wrote: >> >> On Thu, Mar 10, 2016 at 1:03 AM, BartC <b...@freeuk.com> wrote: >>> >>> I've just tried a UTF-8 file and getting some odd results. With a file >>> containing [three euro symbols]: >>> >>> €€€ >>> >>> (including a 3-byte utf-8 marker at the start), and opened in text mode, >>> Python 3 gives me this series of bytes (ie. the ord() of each character): >>> >>> 239 >>> 187 >>> 191 >>> 226 >>> 8218 >>> 172 >>> 226 >>> 8218 >>> 172 >>> 226 >>> 8218 >>> 172 >>> >>> And prints the resulting string as: €€€. >> >> >> The first three bytes are the "UTF-8 BOM", which suggests you may have >> created this in a broken editor like Notepad. > > > Yes, that's what I used, but what's broken about it? If Python doesn't > understand the BOM, it should still resynchronise after a few bytes.
It's an extra character. You thought the file contained three characters; it actually contained four. >> For the rest, I'm not sure how you told Python to open this as text, >> but you certainly did NOT specify an encoding of UTF-8. The 8218 >> entries in there are completely bogus. Can you show your code, please, >> and also what you get if you open the file as binary? > > This is the code: > > f=open("input","r") > t=f.read(1000) > f.close() > > print ("T",type(t),len(t)) > > print (t) > > for i in t: > print (ord(i)) > > This doesn't specify any specific code encoding; I don't know how, and > Steven didn't mention anything other than a text file. The input data is > represented by this dump, and this is also what binary mode gives: > > 0000: ef bb bf e2 82 ac e2 82 ac e2 82 ac ............ Okay. Try changing your first line to this: f = open("input", encoding="utf-8") By default, you get a system-specific encoding, which in your case appears to be one of the Windows codepages. That's why you're getting nonsense out of it - you write in one encoding and read in another. It's commonly called mojibake. >> Unicode handling is easy as long as you (a) understand the fundamental >> difference between text and bytes, and (b) declare your encodings. >> Python isn't magical. It can't know the encoding without being told. > > > Hence the BOM bytes. > > (Isn't it better that it's automatic? Someone sends you a text file that you > want to open within a Python program. Are you supposed to analyze it first, > or expect the sender to tell you what it is (they probably won't know) then > need to hack the program to read it properly?) No, it's not better to be automatic. They are supposed to tell you what it is. Someone somewhere saved the file using a particular encoding. In this example, you chose when you told Notepad to save it as UTF-8; so you carry that information with the file, and open it using the encoding="UTF-8" parameter. Analyzing files to try to guess their encodings is fundamentally hard. I have a source of occasional text files that basically just dumps stuff on me without any metadata, and I have to figure out (a) what the encoding is, and (b) what language the text is in. I can generally assume that the files are ASCII-compatible (on the rare occasions when they're not, they're usually going to be UTF-16, which is fairly easy to spot), and then I have two levels of heuristics to try to guess a most-likely encoding - but ultimately, the script just decodes the text as best it can, and then hands the result up to the human. If the result looks mostly like Spanish but has acute accents instead of tildes over the n's, it's probably the wrong codepage. Or if the text is all completely meaningless junk, it's probably Cyrillic or Greek letters, and needs to be decoded using an appropriate eight-bit encoding. It often ends up being trial-and-error to figure out what encoding was actually used. Trying to guess the encoding of text in a file full of bytes is like trying to guess the modem settings (8N1? 7E1?). If the other end doesn't tell you, you'll probably end up with something that carries some decodable content, but not the original content. It's almost completely useless. ChrisA -- https://mail.python.org/mailman/listinfo/python-list