What started as a simple test if it is better to load uncompressed data directly from the harddisk or load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0 GHz system with 3 GByte RAM) seems to show that none of the in Python available compression libraries really works for large sized (i.e. 500 MByte) strings.
Test the provided code and see yourself. At least on my system: zlib fails to decompress raising a memory error pylzma fails to decompress running endlessly consuming 99% of CPU time bz2 fails to compress running endlessly consuming 99% of CPU time The same works with a 10 MByte string without any problem. So what? Is there no compression support for large sized strings in Python? Am I doing something the wrong way here? Is there any and if yes, what is the theoretical upper limit of string size which can be processed by each of the compression libraries? The only limit I know about is 2 GByte for the python.exe process itself, but this seems not to be the actual problem in this case. There are also some other strange effects when trying to create large strings using following code: m = 'm'*1048576 # str1024MB = 1024*m # fails with memory error, but: str512MB_01 = 512*m # works ok # str512MB_02 = 512*m # fails with memory error, but: str256MB_01 = 256*m # works ok str256MB_02 = 256*m # works ok etc. . etc. and so on down to allocation of each single MB in separate string to push python.exe to the experienced upper limit of memory reported by Windows task manager available to python.exe of 2.065.352 KByte. Is the question why did the str1024MB = 1024*m instruction fail, when the memory is apparently there and the target size of 1 GByte can be achieved out of the scope of this discussion thread, or is this the same problem causing the compression libraries to fail? Why is no memory error raised then? Any hints towards understanding what is going on and why and/or towards a workaround are welcome. Claudio ============================================================ # HDvsArchiveUnpackingSpeed_WriteFiles.py strSize10MB = '1234567890'*1048576 # 10 MB strSize500MB = 50*strSize10MB fObj = file(r'c:\strSize500MB.dat', 'wb') fObj.write(strSize500MB) fObj.close() fObj = file(r'c:\strSize500MBCompressed.zlib', 'wb') import zlib strSize500MBCompressed = zlib.compress(strSize500MB) fObj.write(strSize500MBCompressed) fObj.close() fObj = file(r'c:\strSize500MBCompressed.pylzma', 'wb') import pylzma strSize500MBCompressed = pylzma.compress(strSize500MB) fObj.write(strSize500MBCompressed) fObj.close() fObj = file(r'c:\strSize500MBCompressed.bz2', 'wb') import bz2 strSize500MBCompressed = bz2.compress(strSize500MB) fObj.write(strSize500MBCompressed) fObj.close() print print ' Created files: ' print ' %s \n %s \n %s \n %s' %( r'c:\strSize500MB.dat' ,r'c:\strSize500MBCompressed.zlib' ,r'c:\strSize500MBCompressed.pylzma' ,r'c:\strSize500MBCompressed.bz2' ) raw_input(' EXIT with Enter /> ') ============================================================ # HDvsArchiveUnpackingSpeed_TestSpeed.py import time startTime = time.clock() fObj = file(r'c:\strSize500MB.dat', 'rb') strSize500MB = fObj.read() fObj.close() print print ' loading uncompressed data from file: %7.3f seconds'%(time.clock()-startTime,) startTime = time.clock() fObj = file(r'c:\strSize500MBCompressed.zlib', 'rb') strSize500MBCompressed = fObj.read() fObj.close() print print 'loading compressed data from file: %7.3f seconds'%(time.clock()-startTime,) import zlib try: startTime = time.clock() strSize500MB = zlib.decompress(strSize500MBCompressed) print 'decompressing zlib data: %7.3f seconds'%(time.clock()-startTime,) except: print 'decompressing zlib data FAILED' startTime = time.clock() fObj = file(r'c:\strSize500MBCompressed.pylzma', 'rb') strSize500MBCompressed = fObj.read() fObj.close() print print 'loading compressed data from file: %7.3f seconds'%(time.clock()-startTime,) import pylzma try: startTime = time.clock() strSize500MB = pylzma.decompress(strSize500MBCompressed) print 'decompressing pylzma data: %7.3f seconds'%(time.clock()-startTime,) except: print 'decompressing pylzma data FAILED' startTime = time.clock() fObj = file(r'c:\strSize500MBCompressed.bz2', 'rb') strSize500MBCompressed = fObj.read() fObj.close() print print 'loading compressed data from file: %7.3f seconds'%(time.clock()-startTime,) import bz2 try: startTime = time.clock() strSize500MB = bz2.decompress(strSize500MBCompressed) print 'decompressing bz2 data: %7.3f seconds'%(time.clock()-startTime,) except: print 'decompressing bz2 data FAILED' raw_input(' EXIT with Enter /> ') -- http://mail.python.org/mailman/listinfo/python-list