GC is very expensive: am I doing something wrong?

Weeble Thu, 18 Mar 2010 17:17:34 -0700

I am loading a dictionary from a text file and constructing a trie
data structure in memory. However, it takes longer than I'm happy with
- about 12 seconds on my computer. I profiled it, came up with some
clever ideas to cut down on the work (such as by exploiting the fact
that the dictionary is sorted) and was only able to shave a small
fraction of the time off. However, then I tried calling gc.disable()
before loading the trie and it halved the running time! I was
surprised. Is that normal? I thought that the cost of garbage
collection would be in some way proportional to the amount of garbage
created, but I didn't think I was creating any: as far as I can tell
the only objects deallocated during the load are strings, which could
not be participating in cycles.


I have spent a lot of time in C#, where the standard advice is not to
mess about with the garbage collector because you'll probably just
make things worse. Does that advice apply in Python? Is it a bad idea
to call gc.disable() before loading the trie and then enable it again
afterwards? Does the fact that the garbage collector is taking so much
time indicate I'm doing something particularly bad here? Is there some
way to give the garbage collector a hint that the whole trie is going
to be long-lived and get it promoted straight to generation 2 rather
than scanning it over and over?

$ python
Python 2.6.4 (r264:75706, Dec  7 2009, 18:43:55)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
$ time python -c "import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"

real    0m12.523s
user    0m12.380s
sys     0m0.140s
$ time python -c "import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"

real    0m12.592s
user    0m12.480s
sys     0m0.110s
$ time python -c "import gc;gc.disable();import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"

real    0m6.176s
user    0m5.980s
sys     0m0.190s
$ time python -c "import gc;gc.disable();import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"

real    0m6.331s
user    0m5.530s
sys     0m0.170s


=== trie.py ===

class Trie(object):
    __slots__=("root", "active")
    def __init__(self):
        self.root=[]
        self.active=False
    def insert(self, word):
        if len(word) == 0:
            self.active=True
        else:
            head = word[0]
            for ch, child in reversed(self.root):
                if ch == head:
                    child.insert(word[1:])
                    return
            child = Trie()
            self.root.append((head, child))
            child.insert(word[1:])
    def seek(self, word):
        if len(word) == 0:
            return self
        head = word[0]
        for ch, child in self.root:
            if ch == head:
                return child.seek(word[1:])
        return EMPTY_TRIE
    def load(self, file):
        for line in file:
            self.insert(line.strip().lower())
    def empty(self):
        return (not self.root) and not self.active
    def endings(self, prefix=""):
        if self.active:
            yield prefix
        for ch, child in self.root:
            for ending in child.endings(prefix+ch):
                yield ending

EMPTY_TRIE = Trie()
-- 
http://mail.python.org/mailman/listinfo/python-list

GC is very expensive: am I doing something wrong?

Reply via email to