On Sat, 2005-01-22 at 10:10 +0100, Alex Martelli wrote: > The answer for the current implementation, BTW, is "in between" -- some > buffering, but bounded consumption of memory -- but whether that tidbit > of pragmatics is part of the file specs, heh, that's anything but clear > (just as for other important tidbits of Python pragmatics, such as the > facts that list.sort is wickedly fast, 'x in alist' isn't, 'x in adict' > IS...).
A particularly great example when it comes to unexpected buffering effects is the file iterator. Take code that reads a header from a file using an (implicit) iterator, then tries to read() the rest of the file. Taking the example of reading an RFC822-like message into a list of headers and a body blob: .>>> inpath = '/tmp/msg.eml' .>>> infile = open(inpath) .>>> for line in infile: .... if not line.strip(): .... break .... headers.append(tuple(line.split(':',1))) .>>> body = infile.read() (By the way, if you ever implement this yourself for real, you should probably be hurt - use the 'email' or 'rfc822' modules instead. For one thing, reinventing the wheel is rarely a good idea. For another, the above code is horrid - in particular it doesn't handle malformed headers at all, isn't big on readability/comments, etc.) If you run the above code on a saved email message, you'd expect 'body' to contain the body of the message, right? Nope. The iterator created from the file when you use it in that for loop does internal read-ahead for efficiency, and has already read in the entire file or at least a chunk more of it than you've read out of the iterator. It doesn't attempt to hide this from the programmer, so the file position marker is further into the file (possibly at the end on a smaller file) than you'd expect given the data you've actually read in your program. I'd be interested to know if there's a better solution to this than: .>>> inpath = '/tmp/msg.eml' .>>> infile = open(inpath) .>>> initer = iter(infile) .>>> headers = [] .>>> for line in initer: .... if not line.strip(): .... break .... headers.append(tuple(line.split(':',1))) .>>> data = ''.join(x for x in initer) because that seems like a pretty ugly hack (and please ignore the variable names). Perhaps a way to get the file to seek back to the point last read from the iterator when the iterator is destroyed? -- Craig Ringer -- http://mail.python.org/mailman/listinfo/python-list