> In order to deal with 400 thousands texts consisting of 80 million > words, and huge sets of corpora , I have to be care about the memory > things. I need to track every word's behavior, so there needs to be as > many word-objects as words. > I am really suffering from the memory problem, even 4G memory space can > not survive... Only 10,000 texts can kill it in 2 minutes. > By the way, my program has been optimized to ``del`` the objects after > traversing, in order not to store the information in memory all the time.
It may then well be that your application leaks memory, however, the examples that you have given so far don't demonstrate that. Most likely, you still keep references to objects at some point, causing the leak. It's fairly difficult to determine the source of such a problem. As a starting point, I recommend to do print len(gc.get_objects()) several times in the program, to see how the number of (gc-managed) objects increases. This number should continually grow up, or else you don't have a memory leak (or one in a C module which would be even harder to determine). Then, from time to time, call import gc from collections import defaultdict def classify(): counters = defaultdict(lambda:0) for o in gc.get_objects(): counters[type(o)] += 1 counters = [(freq, t) for t,freq in counters.items()] counters.sort() for freq,t in counters[-10:]: print t.__name__, freq a number of times, and see what kind of objects get allocated. Then, for the most frequent kind of object, investigate whether any of them "should" have been deleted. If any, try to find out a) whether the code that should have released them was executed, and b) why they are still referenced (use gc.get_referrers for that). And so on. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list