... [Jon Smirl] > I know in advance how many items will be added to the dictionary. Most > dictionary implementations I have previously worked with are more > efficient if they know ahead of time how big to make their tables.
Richard Jones spent considerable time investigating whether "pre-sizing" lists and dicts in CPython could help, at the "Need For Speed" sprint earlier this year. He didn't find a win worth getting; e.g., read the section "List pre-allocation" at: http://wiki.python.org/moin/NeedForSpeed/Failures Trying it for dicts was also part of what he did, but I don't think specifics about that were recorded on the Wiki. I was at the sprint, and hoots of triumph from Richard's direction were conspicuous by absence during his dict time ;-) > In this case I only need to do a presence test of the key, there is no > actual data associated with the key. The table is used for detecting > duplicate entries. Is there a more efficient to do this test that sticking > an empty string into a dict? The keys are sha1.digest(). It's probably more common to set the value to None or 1, but it doesn't really matter in reality. In theory, using 1 or an empty string relies on the implementation accident that CPython stores those uniquely, while it's guaranteed that None is a singleton object. BTW, is there a reason to use SHA instead of MD5? I ask because the latter is 4 bytes shorter, and you apparently have a /lot/ of keys. If you're using a current version of Python, you can save some memory by using a builtin set object instead. The CPython implementation of sets is very much like its implementation of dicts, but doesn't consume any memory for the (non-existent) associated values. You also get a number of operations useful on sets (like intersection and union). > ... > Since I am rerunning the conversion over and over anything simple that speeds > it up is helpful. > > I already have 4GB RAM, it would take around 30GB to get everything in memory. > > Dictionaries are not a big problem for me, but there are many in use and > they have millions of items in them. A peculiar suggestion: if you don't need to release system resources cleanly at the end of a run, try doing: import os os._exit(0) at the end. If you have dicts with millions of entries swapped to disk, it /can/ consume major time just to page them all in again to decrement the key & value refcounts if you let "clean shutdown" code determine they've become trash. Bailing ungracefully can skip all that work (but also skips other work, like letting the platform C I/O library close still-open files gracefully). -- http://mail.python.org/mailman/listinfo/python-list