Jeroen Hegeman schreef: > ...processing all 2 files found > --> 1/2: ./test_file0.txt > Now reading ... > DEBUG readLines A took 0.093 s > ...took 8.85717201233 seconds > --> 2/2: ./test_file0.txt > Now reading ... > DEBUG readLines A took 3.917 s > ...took 12.8725550175 seconds > > So the first time around the file gets read in in ~0.1 seconds, the > second time around it needs almost four seconds! As far as I can see > this is related to 'something in memory being copied around' since if > I replace the 'alternative 1' by the 'alternative 2', basically > making sure that my classes are not used, reading time the second > time around drops back to normal (= roughly what it is the first pass).
(First, I had to add timing code to ReadClasses: the code you posted doesn't include them, and only shows timings for ReadLines.) Your program uses quite a bit of memory. I guess it gets harder and harder to allocate the required amounts of memory. If I change this line in ReadClasses: built_classes[len(built_classes)] = HugeClass(long_line) to dummy = HugeClass(long_line) then both times the files are read and your data structures are built, but after each run the data structure is freed. The result is that both runs are equally fast. Also, if I run the first version (without the dummy) on a computer with a bit more memory (1 GiB), it seems there is no problem allocating memory: both runs are equally fast. I'm not sure how to speed things up here... you're doing much processing on a lot of small chunks of data. I have a number of observations and possible improvements though, and some might even speed things up a bit. You read the files, but don't use the contents; instead you use long_line over and over. I suppose you do that because this is a test, not your actual code? __init__() with nothing (or only return) in it is not useful; better to just leave it out. You have a number of return statements that don't do anything (i.e. they return nothing (None actually) at the end of the function). A function without return automatically returns None at the end, so it's better to leave them out. Similarly you don't need to call sys.exit(): the script will terminate anyway if it reaches the end. Better leave it out. LongClass.clear() doesn't do anything and isn't called anyway; leave it out. ModerateClass.__del__() doesn't do anything either. I'm not sure how it affects what happens if ModerateClass gets freed, but I suggest you don't start messing with __del__() until you have more Python knowledge and experience. I'm not sure why you think you need to implement that method. The same goes for HugeClass.__del__(). It does delete self.B4v, but the default behavior will do that too. Again, I don't get why you want to override the default behavior. In a number of cases, you use a dict like this: built_classes = {} for i in LINES: built_classes[len(built_classes)] = ... So you're using the indices 0, 1, 2, ... as the keys. That's not what dictionaries are made for; lists are much better for that: built_classes = [] for i in LINES: built_classes.append(...) HugeClass.B4v isn't used, so you can safely remove it. Your readLines() function reads a whole file into memory. If you're working with large files, that's not such a good idea. It's better to load one line at a time into memory and work on that. I would even completely remove readLines() and restructure ReadClasses() like this: def ReadClasses(filename): print 'Now reading ...' built_classes = [] # Open file in_file = open(filename, 'r') # Read lines and interpret them. time_a = time.time() for i in in_file: ## This is alternative 1. built_classes.append(HugeClass(long_line)) ## The next line is alternative 2. ## built_classes[len(built_classes)] = long_line in_file.close() time_b = time.time() print "DEBUG readClasses took %.3f s" % (time_b - time_a) Personally I only use 'i' for integer indices (as in 'for i in range(10)'); for other use I prefer more descriptive names: for line in in_file: ... But I guess that's up to personal preference. Also you used LINES to store the file contents; the convention is that names with all capitals are used for constants, not for things that change. In ProcessList(), you keep the index in a separate variable. Python has a trick so you don't have to do that yourself: nfiles = len(input_files) for file_index, i in enumerate(input_files): print "--> %i/%i: %s" % (file_index + 1, nfiles, i) ReadClasses(i) Instead of item0, item1, ... , it's generally better to use a list, so you can use item[0], item[1], ... And finally, John Machin's suggestion looks like a good way to restructure that long sequence of conversions and assignments in HugeClass. -- The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom. -- Isaac Asimov Roel Schroeven -- http://mail.python.org/mailman/listinfo/python-list