To become part of a larger script that will read through all files on a given drive, I was playing around with reading files and wanted to see if there was an optimum value for a read size on my system.
What I noticed is that the file being read is "cached" on subsequent reads. Based on some testing it looks like it's by the underlying OS (windows in this case) but have a few questions. Here's a code sample: ------------------------------------------------------------------------- import os, time # Set the following two variables to # different large files on your system. # Suggest files in the range of 500MB to 1Gig testfile1 = "d:\\test1\\junk1.file" testfile2 = "d:\\test1\\junk2.file" def readfile(filename): size = os.path.getsize(filename) bufsize = 4096 print filename, size, "Bytes" while bufsize < 132000: start = time.clock() f = open(filename, "rb") buf = f.read(bufsize) while buf: buf = f.read(bufsize) f.flush() # note: put here as a test and # it doesn't make a difference f.close() end = time.clock() print bufsize, round(end - start,3) bufsize = bufsize*2 print " " # Comment the second and third readfile and run # the program twice to see a similar result for testfile1 readfile(testfile1) readfile(testfile1) readfile(testfile2) ----------------------------------------------------------------- Sample output for first testfile1: d:\test1\junk1.file 759167228 Bytes 4096 20.366 8192 0.923 16384 0.783 32768 0.737 65536 0.74 131072 0.82 After the first read test at 4096, subsequent read tests seem to be cached. This is even though the file is being closed before initiating another read test. Sample output for second testfile1 d:\test1\junk1.file 759167228 Bytes 4096 1.258 8192 0.944 16384 0.795 32768 0.743 65536 0.725 131072 0.826 Ok, didn't expect much difference here based on the first read, but wanted to note how 4096 is now 1.2 seconds. Sample output for testfile2: d:\test1\junk2.file 1142511616 Bytes 4096 31.514 8192 1.417 16384 1.202 32768 1.11 65536 1.089 131072 1.245 Same situation as our first sample for testfile1. 4096 is not cached, but subsequent reads are. Now some things to note: So it seems the file is being cached, however on my system only ~2MB of additional memory is used when the program is run. This 2MB of memory is released when the script exits. If you comment the second and third readfile lines (as noted in the code): a. Run the program twice, you will see that even if the program exits, this cache is not cleared. b. If you open another command prompt and run the code, it's cached. c. If you close both command prompts, open a new one and run the code it's still cached. It isn't "cleared" until another large file is read. My questions are: 1. I don't quite understand how after one full read of a file, another full read of the same file is "cached" so significantly while consuming so little memory. What exactly is being cached to improve the reading of the file a second time? 2. Is there anyway to somehow to take advantage of this "caching" by initializing it without reading through the entire file first? 3. If the answer to #2 is No, then is there a way to purge this "cache" in order to get a more accurate result in my routine? That is without having to read another large file first? -- http://mail.python.org/mailman/listinfo/python-list