Re: gather information from various files efficiently

Steven Bethard Tue, 14 Dec 2004 10:15:05 -0800

Klaus Neuner wrote:

The straightforward way to solve this problem is to create a
dictionary. Like so:


[...]

a, b = get_information(line)
if a in dict.keys():
    dict[a].append(b)
else:
    dict[a] = [b]


So I timed the three suggestions with a few different datasets:

> cat builddict.py
def askpermission(d, k, v):
    if k in d:
        d[k].append(v)
    else:
        d[k] = [v]

def askforgiveness(d, k, v):
    try:
        d[k].append(v)
    except KeyError:
        d[k] = [v]

def default(d, k, v):
    d.setdefault(k, []).append(v)

def test(items, func):
    d = {}
    for k, v in items:
        func(d, k, v)

Dataset where every value causes a collision:

> python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i in range(1000)], bd.askpermission)" 1000 loops, best of 3: 1.62 msec per loop

> python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i in range(1000)], bd.askforgiveness)" 1000 loops, best of 3: 1.58 msec per loop

> python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i in range(1000)], bd.default)" 100 loops, best of 3: 2.03 msec per loop


Dataset where no value causes a collision:

> python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i in range(1000)], bd.askpermission)" 100 loops, best of 3: 2.29 msec per loop

> python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i in range(1000)], bd.askforgiveness)" 100 loops, best of 3: 9.96 msec per loop

> python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i in range(1000)], bd.default)" 100 loops, best of 3: 2.98 msec per loop


Datset where one of every 5 values causes a collision:

> python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for i in range(1000)], bd.askpermission)" 1000 loops, best of 3: 1.82 msec per loop

> python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for i in range(1000)], bd.askforgiveness)" 1000 loops, best of 3: 1.79 msec per loop

> python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for i in range(1000)], bd.default)" 100 loops, best of 3: 2.2 msec per loop

So, when there are lots of collisions, you may get a small benefit from the try/except solution. If there are very few collisions, you probably would rather the if/else solution. The setdefault solution patterns about like the if/else solution, but is somewhat slower.

I will probably continue to use setdefault, because I think it's prettier =) but if you're running into a speed bottleneck in this sort of situation, you might consider one of the other solutions.

Steve
--
http://mail.python.org/mailman/listinfo/python-list

Re: gather information from various files efficiently

Reply via email to