On Sat, 10 Nov 2007 17:18:37 -0800, Michael Bacarella wrote:
> So, you think the Python's dict implementation degrades towards O(N) > performance when it's fed millions of 64-bit pseudo-random longs? No. Here's my sample file: $ wc -l id2name.txt 8191180 id2name.txt $ ls -lh id2name.txt -rw-rw-r-- 1 steve steve 371M 2007-11-11 14:00 id2name.txt And the results of reading it into a dict (code follows below): $ time ./slurp_dict.py Starting at Sun Nov 11 14:26:51 2007 Line 0 Line 1000000 Line 2000000 Line 3000000 Line 4000000 Line 5000000 Line 6000000 Line 7000000 Line 8000000 Items in dict: 8191180 Completed import at Sun Nov 11 14:29:31 2007 Starting to delete dict... Traceback (most recent call last): File "./slurp_dict.py", line 20, in <module> del id2name KeyboardInterrupt real 35m52.334s user 1m17.663s sys 0m16.758s Notice that the dict is completely read into memory in just two and a half minutes. The script then tries to delete the dict, and 32 minutes later is still struggling. That's the point I got sick of waiting and interrupted the script. Conclusion: it's a memory issue, or maybe a garbage collection issue, not a problem with dicts. Here's my code for creating the key:value file in the first place: #!/usr/bin/python """Make a big file of 64-bit integer keys plus random values.""" bits64 = 2**64 import random template = '%d:This is a bunch of text...\n' fp = open('id2name.txt', 'w') for i in xrange(8191180): fp.write(template % random.randint(0, bits64)) fp.close() ### And here's my code for slurping it in to a dict: #!/usr/bin/python """Read a big file into a dict.""" import gc import time print "Starting at %s" % time.asctime() flag = gc.isenabled() gc.disable() id2name = {} for n, line in enumerate(open('id2name.txt', 'r')): if n % 1000000 == 0: # Give feedback. print "Line %d" % n id,name = line.strip().split(':', 1) id = long(id) id2name[id] = name print "Items in dict:", len(id2name) print "Completed import at %s" % time.asctime() print "Starting to delete dict..." del id2name print "Completed deletion at %s" % time.asctime() if flag: gc.enable() print "Finishing at %s" % time.asctime() ### So, what can you do? Given that I'm not willing to do any more unpaid experimentation for you, here are my suggestions, in no particular order: (1) Presumably you don't want to run your app with the garbage collector turned off. You might still want to play around with the gc module to see what you can learn. (2) More memory will help avoid paging. If you can't get more memory, try more virtual memory. It will still be slow, but at least the operating system doesn't have to try moving blocks around as much. (3) Are you sure you need all eight-million-plus items in the cache all at once? (4) There are lots of algorithms out there for dealing with data too big to fit into main memory. Do some research. (5) There is a data structure designed for dealing with tens of millions of records at once. It is called "a database". If you can't find a better algorithm, and refuse to use an existing RDBMS, I suspect you're going to end up inventing a primitive, inefficient, buggy database which is no faster than existing systems out there. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list