Firstly, thank you for all of your help so far, I really appreciate it. > > So, you think the Python's dict implementation degrades towards O(N) > > performance when it's fed millions of 64-bit pseudo-random longs? > > No.
Yes. I tried your code (with one change, time on feedback lines) and got the same terrible performance against my data set. $ cat stevescode.py #!/usr/bin/python """Read a big file into a dict.""" import gc import time print "Starting at %s" % time.asctime() flag = gc.isenabled() gc.disable() id2name = {} for n, line in enumerate(open('id2name.txt', 'r')): if n % 1000000 == 0: # Give feedback. print "Line %d" % n,time.asctime() id,name = line.strip().split(':', 1) id = long(id) id2name[id] = name print "Items in dict:", len(id2name) print "Completed import at %s" % time.asctime() print "Starting to delete dict..." del id2name print "Completed deletion at %s" % time.asctime() if flag: gc.enable() print "Finishing at %s" % time.asctime() $ ./stevescode.py Starting at Sun Nov 11 10:46:37 2007 Line 0 Sun Nov 11 10:46:37 2007 Line 1000000 Sun Nov 11 10:48:22 2007 Line 2000000 Sun Nov 11 10:51:08 2007 Line 3000000 Sun Nov 11 10:56:16 2007 Line 4000000 Sun Nov 11 10:58:41 2007 Line 5000000 Sun Nov 11 11:03:26 2007 ^C To prove that my machine is sane, I ran the same against your generated sample file and got _excellent_ performance. Start to finish in under a minute. $ cat steves-makedict.py #!/usr/bin/python """Make a big file of 64-bit integer keys plus random values.""" bits64 = 2**64 import random template = '%d:This is a bunch of text...\n' fp = open('id2name.txt', 'w') for i in xrange(8191180): fp.write(template % random.randint(0, bits64)) fp.close() $ ./steves-makedict.py $ ./stevescode.py Starting at Sun Nov 11 11:15:31 2007 Line 0 Sun Nov 11 11:15:31 2007 Line 1000000 Sun Nov 11 11:15:37 2007 Line 2000000 Sun Nov 11 11:15:43 2007 Line 3000000 Sun Nov 11 11:15:49 2007 Line 4000000 Sun Nov 11 11:15:54 2007 Line 5000000 Sun Nov 11 11:16:00 2007 Line 6000000 Sun Nov 11 11:16:07 2007 Line 7000000 Sun Nov 11 11:16:12 2007 Line 8000000 Sun Nov 11 11:16:18 2007 Items in dict: 8191180 Completed import at Sun Nov 11 11:16:19 2007 Starting to delete dict... Completed deletion at Sun Nov 11 11:16:23 2007 Finishing at Sun Nov 11 11:16:23 2007 > Notice that the dict is completely read into memory in just two and a > half minutes. The script then tries to delete the dict, and 32 minutes > later is still struggling. That's the point I got sick of waiting and > interrupted the script. > > Conclusion: it's a memory issue, or maybe a garbage collection issue, not > a problem with dicts. As you can see, not the case at all against my data set. > (1) Presumably you don't want to run your app with the garbage collector > turned off. You might still want to play around with the gc module to see > what you can learn. As you can see, your version did no better. :( > (2) More memory will help avoid paging. If you can't get more memory, try > more virtual memory. It will still be slow, but at least the operating > system doesn't have to try moving blocks around as much. The machine has 8GB, and is not doing anything else when I run this. > (3) Are you sure you need all eight-million-plus items in the cache all > at once? Yes. > (4) There are lots of algorithms out there for dealing with data too big > to fit into main memory. Do some research. It DOES fit into main memory and a dictionary is exactly the right way to do this. > (5) There is a data structure designed for dealing with tens of millions > of records at once. It is called "a database". If you can't find a better > algorithm, and refuse to use an existing RDBMS, I suspect you're going to > end up inventing a primitive, inefficient, buggy database which is no > faster than existing systems out there. I've tried three disk-based implementations already (BerkeleyDB, cdb, and an RDBMS) Performance is terrible because they end up doing too many random disk seeks. Pre-caching all of the data ahead of time has offered us the best performance so far, but is still slower than it ought to be. Creating a HEAP TABLE in the RDBMS is an idea, but moving all of this really easy code into the RDBMS just to find a hashing algorithm that doesn't choke on my keys sounds pretty lame. A cached in main memory hash is the right way to do this. The Perl version runs *very* fast, after all. -- http://mail.python.org/mailman/listinfo/python-list