On Wed, Jul 16, 2014 at 11:11 PM, Marko Rauhamaa <ma...@pacujo.net> wrote: > Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>: > >> With a few exceptions, /etc is filled with text files, not binary >> files, and half the executables on the system are text (Python, Perl, >> bash, sh, awk, etc.). > > Our debate seems to stem from a different idea of what text is. To me, > text in the Python sense is a sequence of UCS-4 character code points. > The opposite of text is not necessarily binary.
Let's shift things a moment for an analogy. What is audio? What is sound? (Music, if you like, but I'm not going to get into the debate of whether or not Band So-and-so's output should be called music.) I have a variety of files that store music; some are RIFF WAVs, some are MPEG level 3s, some are Ogg Vorbis files, and right now I have an MKV of "Do you wanna build a snowman?" playing. (As far as I'm concerned, it's primarily there for music, and the video image is buried behind other windows. But I'll accept the argument that that's just a container for some other format of audio, probably MPEG but I haven't checked.) Sound, fundamentally, is a waveform, or a series of air pressures. Text, similarly, is not UCS-4, but a series of characters. We are fortunate enough to have Unicode and can therefore define that text is a sequence of Unicode codepoints, but the distinction isn't a feature of Unicode; if you ask a primary school child to identify the letters in a word, s/he should be able to do so, and that without any computer involvement at all. Letters, digits, and other characters exist independently of encodings or even character sets, but it's really REALLY hard for computers to manipulate what they can't identify. So let's define Unicode text as "a sequence of Unicode codepoints" or "a sequence of Unicode characters", and proceed from there. A file on a Unix or Windows file system consists of a sequence of bytes. Ergo, a file cannot actually contain text; it must store *encoded* text. But this is far and away the most common type of file on any file system. Tweaking the previous script to os.walk() my home directory, rather than scanning $PATH, the ratios are roughly 2:1 the other way - heaps more text files than binary. And this is with my Downloads/ directory being almost entirely binaries, and lots of them; various zip files, deb packages, executables of various types... about the only actual text there would be .patch files. >> Relatively rare. Like, um, email, news, html, Unix config files, >> Windows ini files, source code in just about every language ever, >> SMSes, XML, JSON, YAML, instant messenger apps, > > I would be especially wary of letting Python 3 interpret those files for > me. Python's [text] strings could be a wonderful tool on the inside of > my program, but I definitely would like to micromanage the I/O. Do I > obey the locale or not? That's too big (and painful) a question for > Python to answer on its own (and pretend like everything's under > control). That's a problem that will be solved progressively, by daemons shifting to UTF-8 for everything. But until then, you have to treat log files as "messy" - you can't trust to a simple encoding. But that's unusual compared to the common case. If you're reading your own config files, you can simply stipulate that they are to be encoded UTF-8, and if they're not, you throw an error. Simple! Works with the easy way of opening files in Python. If you're reading someone else's config files, you can either figure out what that program is documented as expecting (and error out if the file's misencoded), or treat it as messy and read it as binary. >> word processors... even *graphic* applications invariably have a text >> tool. > > Thing is, the serious text utilities like word processors probably need > lots of ancillary information so Python's [text] strings might be too > naive to represent even a single character. Ancillary information? (La)TeX files are entirely text, and have all that info in them somewhere. Open Documents are basically zip files of XML data, where XML is ... all text. Granted, it's barely-readable text, but it is UTF-8 encoded text. (I just checked an odt file that I have sitting here, and it does contain a thumbnail in PNG format. But the primary content is all XML files.) >>> More often, len(b'λ') is what I want. >> >> Oh really? Are you sure? What exactly is b'λ'? > > That's something that ought to work in the UTF-8 paradise. > Unfortunately, Python only allows ASCII in bytes. ASCII only! In this > day and age! Even C is not so picky: > > #include <stdio.h> > > int main() > { > printf("Hyvää yötä\n"); > return 0; > } And I have a program that lets me store 1.75 in an integer variable! That's ever so much better than most programs. It's so much less picky! Actually, Python allows all bytes in a bytestring, not just ASCII. However, b'λ' has no meaning; in fact, even b'asdf' is dubious, and this kind of notation exists only because there are many file formats that mix ASCII text and binary data. To be truly accurate, b'asdf' ought to be written as x'61736466' or something, because it's as likely to mean 1634952294 or 1717859169 as it is to mean "asdf". What is C actually storing in that string? Do you know? Can you be truly sure that it's UTF-8? No, you cannot. Anyone might transcode your source file, and I don't think C compilers are aware of character literals and their associated encodings. More importantly, you cannot be sure that that will print "Hyvää yötä" to the console; if the console is set to an encoding other than the one your source file was using, you'll get mojibake. With Python, at least the interpreter gets some idea of what's going on. ChrisA -- https://mail.python.org/mailman/listinfo/python-list