I am trying to do something with a very large tarfile from within Python, and am running into memory constraints. The tarfile in question is a 4-gigabyte datafile from freedb.org, http://ftp.freedb.org/pub/freedb/ , and has about 2.5 million members in it.
Here's a simple toy program that just goes through and counts the number of members in the tarfile, printing a status message every N records (N=10,000 for the smaller file; N=100,000 for the larger). I'm finding that memory usage goes through the roof, simply iterating over the tarfile. I'm using over 2G when I'm barely halfway through the file. This surprises me; I'd expect the memory associated with each iteration to be released at the end of the iteration; but something's obviously building up. On one system, this ends with a MemoryError exception. On another system, it just hangs, bringing the system to its knees, to the point that it takes a minute or so to do simple task switching. Any suggestions to process this beast? I suppose I could just untar the file, and process 2.5 million individual files, but I'm thinking I'd rather process it directly if that's possible. Here's the toy code. (One explanation about the "import tarfilex as tarfile" statement. I'm running Activestate Python 2.5.0, and the tarfile.py module of that vintage was buggy, to the point that it couldn't read these files at all. I brought down the most recent tarfile.py from http://svn.python.org/view/python/trunk/Lib/tarfile.py and saved it as tarfilex.py. It works, at least until I start processing some very large files, anyway.) import tarfilex as tarfile import os, time SOURCEDIR = "F:/Installs/FreeDB/" smallfile = "freedb-update-20080601-20080708.tar" # 63M file smallint = 10000 bigfile = "freedb-complete-20080708.tar" # 4,329M file bigiTnt = 100000 TARFILENAME, INTERVAL = smallfile, smallint # TARFILENAME, INTERVAL = bigfile, bigint def filetype(filename): return os.path.splitext(filename)[1] def memusage(units="M"): import win32process current_process = win32process.GetCurrentProcess() memory_info = win32process.GetProcessMemoryInfo(current_process) bytes = 1 Kbytes = 1024*bytes Mbytes = 1024*Kbytes Gbytes = 1024*Mbytes unitfactors = {'B':1, 'K':Kbytes, 'M':Mbytes, 'G':Gbytes} return memory_info["WorkingSetSize"]//unitfactors[units] def opentar(filename): modes = {".tar":"r", ".gz":"r:gz", ".bz2":"r:bz2"} openmode = modes[filetype(filename)] openedfile = tarfile.open(filename, openmode) return openedfile TFPATH=SOURCEDIR+'/'+TARFILENAME assert os.path.exists(TFPATH) assert tarfile.is_tarfile(TFPATH) tf = opentar(TFPATH) count = 0 print "%s memory: %sM count: %s (starting)" % (time.asctime(), memusage(), count) for tarinfo in tf: count += 1 if count % INTERVAL == 0: print "%s memory: %sM count: %s" % (time.asctime(), memusage(), count) print "%s memory: %sM count: %s (completed)" % (time.asctime(), memusage(), count) Results with the smaller (63M) file: Thu Jul 17 00:18:21 2008 memory: 4M count: 0 (starting) Thu Jul 17 00:18:23 2008 memory: 18M count: 10000 Thu Jul 17 00:18:26 2008 memory: 32M count: 20000 Thu Jul 17 00:18:28 2008 memory: 46M count: 30000 Thu Jul 17 00:18:30 2008 memory: 55M count: 36128 (completed) Results with the larger (4.3G) file: Thu Jul 17 00:18:47 2008 memory: 4M count: 0 (starting) Thu Jul 17 00:19:40 2008 memory: 146M count: 100000 Thu Jul 17 00:20:41 2008 memory: 289M count: 200000 Thu Jul 17 00:21:41 2008 memory: 432M count: 300000 Thu Jul 17 00:22:42 2008 memory: 574M count: 400000 Thu Jul 17 00:23:47 2008 memory: 717M count: 500000 Thu Jul 17 00:24:49 2008 memory: 860M count: 600000 Thu Jul 17 00:25:51 2008 memory: 1002M count: 700000 Thu Jul 17 00:26:54 2008 memory: 1145M count: 800000 Thu Jul 17 00:27:59 2008 memory: 1288M count: 900000 Thu Jul 17 00:29:03 2008 memory: 1430M count: 1000000 Thu Jul 17 00:30:07 2008 memory: 1573M count: 1100000 Thu Jul 17 00:31:11 2008 memory: 1716M count: 1200000 Thu Jul 17 00:32:15 2008 memory: 1859M count: 1300000 Thu Jul 17 00:33:23 2008 memory: 2001M count: 1400000 Traceback (most recent call last): File "C:\test\freedb\tardemo.py", line 40, in <module> for tarinfo in tf: File "C:\test\freedb\tarfilex.py", line 2406, in next tarinfo = self.tarfile.next() File "C:\test\freedb\tarfilex.py", line 2311, in next tarinfo = self.tarinfo.fromtarfile(self) File "C:\test\freedb\tarfilex.py", line 1235, in fromtarfile obj = cls.frombuf(buf) File "C:\test\freedb\tarfilex.py", line 1193, in frombuf if chksum not in calc_chksums(buf): File "C:\test\freedb\tarfilex.py", line 261, in calc_chksums unsigned_chksum = 256 + sum(struct.unpack("148B", buf[:148]) + struct.unpack("356B", buf[156:512])) MemoryError -- http://mail.python.org/mailman/listinfo/python-list