On Mar 18, 7:13 pm, Weeble <clockworksa...@gmail.com> wrote: > I am loading a dictionary from a text file and constructing a trie > data structure in memory. However, it takes longer than I'm happy with > - about 12 seconds on my computer. I profiled it, came up with some > clever ideas to cut down on the work (such as by exploiting the fact > that the dictionary is sorted) and was only able to shave a small > fraction of the time off. However, then I tried calling gc.disable() > before loading the trie and it halved the running time! I was > surprised. Is that normal? I thought that the cost of garbage > collection would be in some way proportional to the amount of garbage > created, but I didn't think I was creating any: as far as I can tell > the only objects deallocated during the load are strings, which could > not be participating in cycles.
Well, you are creating and destroying a lot of objects in the process, so that will provoke the garbage collector. But you are also doing reversing and searching, and that's slow. Does your application really need to be able to keep things in order like this, or do you just want to know if a word is in the dictionary? If you just want to load up a dictionary and be able to see if words are in it, I would use a dict instead of a list. Even if you want to be able to print out the data in order, you might consider using a dict instead of a list. The print operation could use one sort at the end, instead of searching all the nodes on insertion. You could also use a character that doesn't appear in your data as a sentinel, and then you don't need a separate active indicator -- every leaf node will be empty and be referred to by the sentinel character. You are also doing a lot of recursive operations that could be done without recursing. Finally, do you really need to keep an additional object around for each node in the tree? I have modified your trie code to use a dict and a sentinel, while keeping basically the same API. This may or may not be the best way to do this, depending on your other code which uses this data structure. It could also probably be made faster by removing the setdefault, and not re-building objects when you need them, and even this implementation will load faster if you disable the gc, but in any case, this might give you some ideas about how to make your code go faster. Regards, Pat from collections import defaultdict class TrieTop(object): sentinel = ' ' # Something not in the data def __init__(self, data=None): def defaultrecurse(): return defaultdict(defaultrecurse) if data is None: data = defaultrecurse() self.data = data def insert(self, word): data = self.data for ch in word: data = data[ch] data[self.sentinel] def seek(self, word): data = self.data for ch in word: data = data.get(ch) if data is None: return EMPTY_TRIE return TrieTop(data) def load(self, file): for line in file: self.insert(line.strip().lower()) def empty(self): return (not self.data) def endings(self, prefix=""): def recurse(data, prefix): if not data: yield prefix[:-1] return for ch, data in data.iteritems(): for result in recurse(data, prefix + ch): yield result return sorted(recurse(self.data, prefix)) EMPTY_TRIE = TrieTop() -- http://mail.python.org/mailman/listinfo/python-list