Weeble, Try to use the full arguments of insert(i, x), instead of using list slices. Every time you create a slice, Python copies the list into a new memory location with the sliced copy. That's probably a big performance impact there if done recursively.
My 2cp, Xav On Fri, Mar 19, 2010 at 10:13 AM, Weeble <clockworksa...@gmail.com> wrote: > I am loading a dictionary from a text file and constructing a trie > data structure in memory. However, it takes longer than I'm happy with > - about 12 seconds on my computer. I profiled it, came up with some > clever ideas to cut down on the work (such as by exploiting the fact > that the dictionary is sorted) and was only able to shave a small > fraction of the time off. However, then I tried calling gc.disable() > before loading the trie and it halved the running time! I was > surprised. Is that normal? I thought that the cost of garbage > collection would be in some way proportional to the amount of garbage > created, but I didn't think I was creating any: as far as I can tell > the only objects deallocated during the load are strings, which could > not be participating in cycles. > > I have spent a lot of time in C#, where the standard advice is not to > mess about with the garbage collector because you'll probably just > make things worse. Does that advice apply in Python? Is it a bad idea > to call gc.disable() before loading the trie and then enable it again > afterwards? Does the fact that the garbage collector is taking so much > time indicate I'm doing something particularly bad here? Is there some > way to give the garbage collector a hint that the whole trie is going > to be long-lived and get it promoted straight to generation 2 rather > than scanning it over and over? > > $ python > Python 2.6.4 (r264:75706, Dec 7 2009, 18:43:55) > [GCC 4.4.1] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> > $ time python -c "import > trie;t=trie.Trie();t.load(open('sowpods.txt'))" > > real 0m12.523s > user 0m12.380s > sys 0m0.140s > $ time python -c "import > trie;t=trie.Trie();t.load(open('sowpods.txt'))" > > real 0m12.592s > user 0m12.480s > sys 0m0.110s > $ time python -c "import gc;gc.disable();import > trie;t=trie.Trie();t.load(open('sowpods.txt'))" > > real 0m6.176s > user 0m5.980s > sys 0m0.190s > $ time python -c "import gc;gc.disable();import > trie;t=trie.Trie();t.load(open('sowpods.txt'))" > > real 0m6.331s > user 0m5.530s > sys 0m0.170s > > > === trie.py === > > class Trie(object): > __slots__=("root", "active") > def __init__(self): > self.root=[] > self.active=False > def insert(self, word): > if len(word) == 0: > self.active=True > else: > head = word[0] > for ch, child in reversed(self.root): > if ch == head: > child.insert(word[1:]) > return > child = Trie() > self.root.append((head, child)) > child.insert(word[1:]) > def seek(self, word): > if len(word) == 0: > return self > head = word[0] > for ch, child in self.root: > if ch == head: > return child.seek(word[1:]) > return EMPTY_TRIE > def load(self, file): > for line in file: > self.insert(line.strip().lower()) > def empty(self): > return (not self.root) and not self.active > def endings(self, prefix=""): > if self.active: > yield prefix > for ch, child in self.root: > for ending in child.endings(prefix+ch): > yield ending > > EMPTY_TRIE = Trie() > -- > http://mail.python.org/mailman/listinfo/python-list >
-- http://mail.python.org/mailman/listinfo/python-list