On Thu, 05 Sep 2013 00:17:36 +0000, Dave Angel wrote: > On 4/9/2013 10:29, Ferrous Cranus wrote: > >> Στις 4/9/2013 3:38 μμ, ο/η Dave Angel έγραψε: >>> 'file' isn't magic. And again, it doesn't look at the filename, it >>> looks at the content. >> So, you are saying that it looks a the content of the file and not of >> what encoding we used to save the file into? > > That's right. There's no place where your text editor stores the > encoding it used, so 'file' has to guess, based only on the content.
Correct. The thing that people often fail to understand is that there is no *reliable* way to store the encoding used for a text file in the text file itself. The encoding is *metadata*, not data: it is data about the data, and consequently it has to be stored "out of band". It has to be stored somewhere else, outside of the file. In the case of text files, it is usually not stored anywhere at all. IBM mainframes assume that text files are using EBCDIC; modern Linux systems assume text files are UTF-8; old DOS applications assume text files are ASCII. Some text editors will try to guess the encoding, using various heuristics such as "if the file starts with \xFE\xFF it is UTF-16" but none of them are foolproof: http://blogs.msdn.com/b/oldnewthing/archive/2004/03/24/95235.aspx sometimes with amusing consequences: http://www.hoax-slayer.com/bush-hid-the-facts-notepad.html >> But the contents have within: >> >> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 >> \xf3\xf\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n') >> >> so it should have said greek-iso and not ascii. But the above byte string is also valid ISO-8859-5 (Cyrillic): 'Жуэљѓєяќэяьсѓ\x0fѓєоьсєяђ\n' ISO-8859-2 (Central European): 'śăíůóôďüíďěáó\x0fóôŢěáôďň\n' and ISO-8859-4 (Baltic): 'ļãíųķôīüíīėáķ\x0fķôŪėáôīō\n' Surely you don't expect the file utility to actually recognise that 'Άγνωστοόνομασ\x0fστήματος\n' makes a valid Greek phrase while the others are not meaningful? > No, that line is totally ASCII. Only when it's EXECUTED by Python will > a non ASCII byte string object be created. Like I said, 'file' doesn't > know the first thing about Python syntax, nor should it. Technically, it's not ASCII, since ASCII only knows about bytes \x00 through \x7F (decimal 0 through 127). That's why it isn't correct to describe Python bytes strings as "ASCII strings". They're byte strings that happen to be displayed as ASCII-plus-other-stuff. -- Steven -- https://mail.python.org/mailman/listinfo/python-list