Dear Pythoneers, I'm moderately new to python and it got me completely lost already.
I've got a bunch of large (30MB) txt files containing one 'event' per line. I open files after each other, read them line by line and from each line build a 'data structure' of a main class (HugeClass) containing some simple information as well as several instances of some other classes. No problem so far, but I noticed that the first file was always faster than the others, whereas I would expect it to be slower, if anything. Testing with two copies of the same file shows the same behaviour. Below is a (rather large, I'll explain) chunk of code. I ran this in a directory with two test files called 'test_file0.txt' and 'test_file1.txt', each containing 10k lines of the same information as the 'long_line' variable in the code. This shows the following timing (consistently) for the little piece of code that reads all lines from file: ...processing all 2 files found --> 1/2: ./test_file0.txt Now reading ... DEBUG readLines A took 0.093 s ...took 8.85717201233 seconds --> 2/2: ./test_file0.txt Now reading ... DEBUG readLines A took 3.917 s ...took 12.8725550175 seconds So the first time around the file gets read in in ~0.1 seconds, the second time around it needs almost four seconds! As far as I can see this is related to 'something in memory being copied around' since if I replace the 'alternative 1' by the 'alternative 2', basically making sure that my classes are not used, reading time the second time around drops back to normal (= roughly what it is the first pass). I already want to apologise for the size of the code chunk below. I know about 'minimal reproducible examples' and such but I found out that if I commented out the filling (and thus binding) of some of the member variables in the lower-level classes, the problem (sometimes) also disappears. That also points to some magic happening in memory? I probably mucked something up but I'm really lost as to where. Any help would be appreciated. The original problem showed up using Python 2.4.3 under linux (Fedora Core 1). Python 2.3.5 on OS X 10.4.10 (PPC) appears not to show this issue(?). Thanks, Jeroen P.S. Any ideas on optimising the input to the classes would be welcome too ;-) Jeroen Hegeman jeroen DOT hegeman AT gmail DOT com ===================Start of code chunk========================= #!/usr/bin/env python import time import sys import os import gzip import pdb long_line = "1,31905,0,174501,46152419,2117961,143,-1.0000,51,2,-19.9139,42,-19.9140 , 6.6002,0,0,0,46713.1484,2,0.0000,-1,1.4203220606,0.3876158297,147.121017 4561,147.1284120973,-2,0.0000,-1,1.5887237787,-2.4011900425,-319.7776794 434,319.7906836817,4,21,0.0000,-1,-0.5672637224,2.2052443027,-43.2842369 080,43.3440905719,21,0.0000,-1,-0.8540721536,0.0770076364,-22.7033920288 , 22.7195827425,21,0.0000,-1,0.1623233557,0.5845987201,-28.0794525146,28.0 860084170,21,0.0000,-1,0.1943928897,-0.2195242196,-22.0666370392,22.0685 899391,6,0.0000,-1,-40.1810989380,-127.0743789673,-104.9231948853,239.74 36794163,-6,0.0000,-1,43.2013626099,125.0640945435,-67.7339172363,227.17 53587387,24,0.0000,-1,-57.9123306274,-17.3483123779,-71.8334121704,123.4 397648033,-24,0.0000,-1,84.0985488892,54.4542312622,-62.4525032043,144.5 299239704,5,0.0000,-1,17.7312316895,-109.7260665894,-33.0897827148,116.3 039146130,-5,0.0000,-1,-40.8971862793,70.6098632812,-5.2814140320,82.645 4347683,4,0.0000,-1,-6.2859884724,-17.9586020410,-58.9464384913,69.40294 68585,-3,0.0000,-1,-51.6263811588,0.6104701459,-12.8869901896,54.0368221 571,3,0.0000,-1,16.4690684490,48.0271777511,-51.7867884636,74.5327484701 ,-4,0.0000,-1,67.6295298338,6.4269350171,-10.6658525467,69.9971834876,7, 7,1.0345464706e+01,-7.0800781250e+01,-2.0385742187e+01,7.5256346272e +01,1.3148,0.0072,0.0072,1.3148,0.0072,0.0072,1.0255,1.0413,0.0,0.0,0.0, 0.0,-1.0,-4.2383,49.5276,13,0.1537,0.5156,0,0.9982,0.0034,1.0000,7,1,0.9 566,0.0062,1,0,2,1.2736,1,7.8407,1,0,2,1.2736,1,7.8407,0,0,-1.0,-1.0,5,1 ,-2.4047853470e+01,4.0832519531e+01,-3.8452150822e+00,4.7851562559e +01,1.3383,0.0051,0.0051,1.3383,0.0051,0.0051,0.9340,0.9541,0.0,0.0,0.0, 0.0,-1.0,-2.4609,21.3916,7,0.1166,0.5977,0,0.9999,0.0052,1.0000,9,1,0.99 47,0.0063,1,0,2,0.7735,1,74.7937,1,0,2,0.7735,1,74.7937,0,0,-1.0,-1.0,5, 1,-4.4067382812e+01,2.5634796619e+00,-1.1138916016e+01,4.6203614579e +01,1.3533,0.0054,0.0054,1.3533,0.0054,0.0054,1.0486,1.0903,0.0,0.0,0.0, 0.0,-1.0,-3.9648,31.3733,13,0.1767,0.5508,100,0.9977,0.0040,1.0000,9,1,0 . 0000,0.4349,0,0,0,0.0000,0,-1000.0000,0,0,0,0.0000,0,-1000.0000,0,0,-1.0 ,-1.0,0,1,3.7200927734e+01,2.7465817928e+00,-5.5847163200e +00,3.7994386563e +01,1.3634,0.0062,0.0062,1.6488,0.0385,0.0385,0.7141,0.9013,5.3986899118 e+00,6.6766492833e-01,-2.3780213181e-01,5.4460399892e +00,0.5504,-3.1445,0.7776,9,0.1169,0.7734,0,0.9977,0.0040,1.0000,7,1,0.0 000,0.1099,0,0,0,0.0000,0,-1000.0000,0,0,0,0.0000,0,-1000.0000,1,-1,5.38 93,0.5459,4,1,1.2969970703e+01,3.3203125000e+01,-3.7231445312e +01,5.2001951876e +01,1.4414,0.0129,0.0129,1.4414,0.0129,0.0129,0.9019,0.7331,0.0,0.0,0.0, 0.0,-1.0,-10.0195,12.2034,17,0.1922,0.3633,0,0.9774,0.0248,1.0000,6,1,0. 0000,0.3523,0,0,0,0.0000,0,-1000.0000,0,0,0,0.0000,0,-1000.0000,0,0,-1.0 ,-1.0,0,1,-1.6174327135e+00,-7.1411132812e+00,-1.8798828125e +01,2.0202637222e +01,1.7886,0.0352,0.0352,1.7886,0.0352,0.0352,1.8257,1.2368,0.0,0.0,0.0, 0.0,-1.0,-17.3438,45.6714,10,0.1529,0.5625,0,0.9898,0.0094,1.0000,3,1,-1 . 0000,10000.0000,0,0,0,-1.0000,0,-1.0000,0,0,0,-1.0000,0,-1.0000,0,0,-1.0 ,-1.0,-6,0,-5.9204106331e+00,-3.4484868050e+00,-6.5307617187e +00,9.6740722971e +00,1.6782,0.0326,0.0326,1.6782,0.0326,0.0326,1.0000,1.0000,0.0,0.0,0.0, 0.0,-1.0,-9.4727,37.3401,13,0.2711,0.2344,100,0.9861,0.0045,1.0000,3,1,- 1.0000,10000.0000,0,0,0,-1.0000,0,-1.0000,0,0,0,-1.0000,0,-1.0000,0,0,-1 .0,-1.0,-6,0" ######################################################################## ### class SmallClass: def __init__(self): return def input(self, line, c): self.item0 = int(line[c]); c += 1 self.item1 = float(line[c]); c += 1 self.item2 = int(line[c]); c += 1 self.item3 = float(line[c]); c += 1 self.item4 = float(line[c]); c += 1 self.item5 = float(line[c]); c += 1 self.item6 = float(line[c]); c += 1 return c ######################################################################## ### class ModerateClass: def __init__(self): return def __del__(self): pass return def input(self, line, c): self.items = {} self.item0 = float(line[c]); c += 1 unit1 = SmallClass() c = unit1.input(line, c) self.items[len(self.items)] = unit1 unit2 = SmallClass() c = unit2.input(line, c) self.items[len(self.items)] = unit2 units_chunk = [] chunk_size = int(line[c]) c += 1 for i in xrange(chunk_size): unit = SmallClass() c = unit.input(line, c) units_chunk.append(unit) for i in xrange(10): unit = SmallClass() c = unit.input(line, c) return c ######################################################################## ### class LongClass: def __init__(self): return def clear(self): return def input(self, foo, c): self.item0 = float(foo[c]); c += 1 self.item1 = float(foo[c]); c += 1 self.item2 = float(foo[c]); c += 1 self.item3 = float(foo[c]); c += 1 self.item4 = float(foo[c]); c+=1 self.item5 = float(foo[c]); c+=1 self.item6 = float(foo[c]); c+=1 self.item7 = float(foo[c]); c+=1 self.item8 = float(foo[c]); c+=1 self.item9 = float(foo[c]); c+=1 self.item10 = float(foo[c]); c+=1 self.item11 = float(foo[c]); c+=1 self.item12 = float(foo[c]); c += 1 self.item13 = float(foo[c]); c += 1 self.item14 = float(foo[c]); c += 1 self.item15 = float(foo[c]); c += 1 self.item16 = float(foo[c]); c+=1 self.item17 = float(foo[c]); c+=1 self.item18 = float(foo[c]); c+=1 self.item19 = int(foo[c]); c+=1 self.item20 = float(foo[c]); c+=1 self.item21 = float(foo[c]); c+=1 self.item22 = int(foo[c]); c+=1 self.item23 = float(foo[c]); c += 1 self.item24 = float(foo[c]); c += 1 self.item25 = float(foo[c]); c+=1 self.item26 = int(foo[c]); c+=1 self.item27 = bool(int(foo[c])); c+=1 self.item28 = float(foo[c]); c+=1 self.item29 = float(foo[c]); c+=1 self.item30 = (foo[c] == "1"); c += 1 self.item31 = (foo[c] == "1"); c += 1 self.item32 = float(foo[c]); c += 1 self.item33 = float(foo[c]); c += 1 self.item34 = int(foo[c]); c += 1 self.item35 = float(foo[c]); c += 1 self.item36 = (foo[c] == "1"); c+=1 self.item37 = (foo[c] == "1"); c+=1 self.item38 = float(foo[c]); c += 1 self.item39 = float(foo[c]); c += 1 self.item40 = int(foo[c]); c += 1 self.item41 = float(foo[c]); c += 1 self.item42 = (foo[c] == "1"); c+=1 self.item43 = float(foo[c]); c+=1 self.item44 = float(foo[c]); c+=1 self.item45 = float(foo[c]); c += 1 self.item46 = int(foo[c]); c+=1 self.item47 = bool(int(foo[c])); c+=1 return c ######################################################################## ### class HugeClass: def __init__(self,line): self.clear() self.input(line) return def __del__(self): del self.B4v return def clear(self): self.long_classes = {} self.B4v={} return def input(self, line): try: foo = line.strip().split(',') c = 0 self.asciiVersion = float(foo[c]) c += 1 self.item0 = foo[c]; c += 1 self.item1 = (self.item0 != "0") self.item2 = (foo[c] == "1"); c += 1 self.item3=int(foo[c]); c+=1 self.item4=int(foo[c]); c+=1 self.item5=int(foo[c]); c+=1 self.item6=int(foo[c]); c += 1 self.item7=float(foo[c]); c+=1 self.item8 = foo[c]; c += 1 bit_item = int(self.item8) self.item9 = bool(bit_item & 2048) self.item10 = bool(bit_item & 1024) self.item11 = bool(bit_item & 512) self.item12 = bool(bit_item & 256) self.item13 = bool(bit_item & 128) self.item14 = bool(bit_item & 64) self.item15 = bool(bit_item & 32) self.item16 = bool(bit_item & 16) self.item17 = bool(bit_item & 8) self.item18 = bool(bit_item & 4) self.item19 = bool(bit_item & 2) self.item20 = bool(bit_item & 1) self.item21 = int(foo[c]); c+=1 self.item22 = float(foo[c]); c+=1 self.item23 = int(foo[c]); c+=1 self.item24 = float(foo[c]); c+=1 self.item25 = float(foo[c]); c+=1 self.item26 = foo[c]; c+=1 self.item27 = int(foo[c]); c+=1 self.item28 = int(foo[c]); c+=1 self.item29 = ModerateClass() c = self.item29.input(foo, c) self.item30 = int(foo[c]); c+=1 self.item31 = int(foo[c]); c+=1 for i in xrange(self.item31): unit = LongClass() c = unit.input(foo, c) self.long_classes[len(self.long_classes)] = unit assert(c == len(foo)), "ERROR We did not read the whole line!!!" except (ValueError,IndexError), msg: print >> sys.stderr, \ "ERROR Trouble reading line: `%(msg)s'" % vars() self.clear() return return ######################################################################## ### def readLines(f): DATA = [] f.seek(0) time_a = time.time() for i in f: DATA.append(i) time_b = time.time() time_spent_reading = time_b - time_a print "DEBUG readLines took %.3f s" % time_spent_reading return DATA ######################################################################## ### def ReadClasses(filename): print 'Now reading ...' built_classes = {} # Read lines from file in_file = open(filename, 'r') LINES = readLines(in_file) in_file.close() # and interpret them. for i in LINES: ## This is alternative 1. built_classes[len(built_classes)] = HugeClass(long_line) ## The next line is alternative 2. ## built_classes[len(built_classes)] = long_line del LINES return ######################################################################## ### def ProcessList(): input_files = ["./test_file0.txt", "./test_file0.txt"] # Loop over all files that we found. nfiles = len(input_files) file_index = 0 for i in input_files: print "--> %i/%i: %s" % (file_index+1, nfiles, i) ReadClasses(i) file_index += 1 return ######################################################################## ### if __name__ == "__main__": ProcessList() sys.exit(0) ######################################################################## ### -- http://mail.python.org/mailman/listinfo/python-list