On Feb 4, 12:53 pm, rdahlstrom <[EMAIL PROTECTED]> wrote: > On Feb 4, 10:17 am, Steven D'Aprano <[EMAIL PROTECTED] > > > > cybersource.com.au> wrote: > > After reading an earlier thread about opening and closing lots of files, > > I thought I'd do a little experiment. > > > Suppose you have a whole lot of files, and you need to open each one, > > append a string, then close them. There's two obvious ways to do it: > > group your code by file, or group your code by procedure. > > > # Method one: grouped by file. > > for each file: > > open the file, append the string, then close it > > > # Method two: grouped by procedure. > > for each file: > > open the file > > for each open file: > > append the string > > for each open file: > > close the file > > > If you have N files, both methods make the same number of I/O calls: N > > opens, N writes, N closes. Which is faster? > > > Intuitively, the first method has *got* to be faster, right? It's got one > > loop instead of three and it doesn't build an intermediate list of open > > file objects. It's so *obviously* going to be faster that it is hardly > > worth bothering to check it with timeit, right? > > > Well, I wouldn't be writing unless that intuitive result was wrong. So > > here's my test results: > > > Method 1: > > > >>> import timeit > > >>> names = ['afile' + str(n) for n in range(1000)] > > >>> T = timeit.Timer('''for name in names: > > > ... fp = open(name, 'a'); fp.write('xyz\\n'); fp.close() > > ... ''', 'from __main__ import names')>>> min(T.repeat(6, 500)) > > > 17.391216039657593 > > > Method 2: > > > >>> for name in names: # reset the files to an empty state. > > > ... fp = open(name, 'w'); fp.close() > > ...>>> T = timeit.Timer('''files = [open(name, 'a') for name in names] > > > ... for fp in files: > > ... fp.write('xyz\\n') > > ... for fp in files: > > ... fp.close() > > ... ''', '''from __main__ import names''')>>> min(T.repeat(6, 500)) > > > 16.823362112045288 > > > Surprisingly, Method 2 is a smidgen faster, by about half a second over > > 500,000 open-write-close cycles. It's not much faster, but it's > > consistent, over many tests, changing many of the parameters (e.g. the > > number of files, the number of runs per timeit test, etc.). > > > I'm using Linux and Python 2.5. > > > So, what's going on? Can anyone explain why the code which does more work > > takes less time? > > > -- > > Steven > > The code that does more work takes more time. The second one does > quite a bit less work. Think of it like this: > > You have 500,000 people to fit through a door. Here are your options: > > 1. For each person, open the door, walk through the door, then close > the door. > 2. Open the door, allow everyone to walk through, then close the > door. > > Which one would you say would be a more efficient way to fit 500,000 > people through the door?
Bad analogy. A better analogy would be if each person has their own door to walk through. My hunch is that is has to do with the OS I/O scheduling. Closing a file triggers a cache flush, which in turn triggers the I/O to schedule a write to disk; the OS scheduler is perhaps more efficient (for a given number of total writes) when it can combine many writes at the same time. Carl Banks -- http://mail.python.org/mailman/listinfo/python-list