I am loading a dictionary from a text file and constructing a trie data structure in memory. However, it takes longer than I'm happy with - about 12 seconds on my computer. I profiled it, came up with some clever ideas to cut down on the work (such as by exploiting the fact that the dictionary is sorted) and was only able to shave a small fraction of the time off. However, then I tried calling gc.disable() before loading the trie and it halved the running time! I was surprised. Is that normal? I thought that the cost of garbage collection would be in some way proportional to the amount of garbage created, but I didn't think I was creating any: as far as I can tell the only objects deallocated during the load are strings, which could not be participating in cycles.
I have spent a lot of time in C#, where the standard advice is not to mess about with the garbage collector because you'll probably just make things worse. Does that advice apply in Python? Is it a bad idea to call gc.disable() before loading the trie and then enable it again afterwards? Does the fact that the garbage collector is taking so much time indicate I'm doing something particularly bad here? Is there some way to give the garbage collector a hint that the whole trie is going to be long-lived and get it promoted straight to generation 2 rather than scanning it over and over? $ python Python 2.6.4 (r264:75706, Dec 7 2009, 18:43:55) [GCC 4.4.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> $ time python -c "import trie;t=trie.Trie();t.load(open('sowpods.txt'))" real 0m12.523s user 0m12.380s sys 0m0.140s $ time python -c "import trie;t=trie.Trie();t.load(open('sowpods.txt'))" real 0m12.592s user 0m12.480s sys 0m0.110s $ time python -c "import gc;gc.disable();import trie;t=trie.Trie();t.load(open('sowpods.txt'))" real 0m6.176s user 0m5.980s sys 0m0.190s $ time python -c "import gc;gc.disable();import trie;t=trie.Trie();t.load(open('sowpods.txt'))" real 0m6.331s user 0m5.530s sys 0m0.170s === trie.py === class Trie(object): __slots__=("root", "active") def __init__(self): self.root=[] self.active=False def insert(self, word): if len(word) == 0: self.active=True else: head = word[0] for ch, child in reversed(self.root): if ch == head: child.insert(word[1:]) return child = Trie() self.root.append((head, child)) child.insert(word[1:]) def seek(self, word): if len(word) == 0: return self head = word[0] for ch, child in self.root: if ch == head: return child.seek(word[1:]) return EMPTY_TRIE def load(self, file): for line in file: self.insert(line.strip().lower()) def empty(self): return (not self.root) and not self.active def endings(self, prefix=""): if self.active: yield prefix for ch, child in self.root: for ending in child.endings(prefix+ch): yield ending EMPTY_TRIE = Trie() -- http://mail.python.org/mailman/listinfo/python-list