Jon Smirl wrote: > On Sun, 06 Aug 2006 15:33:30 -0700, John Machin wrote: [snip..] > > > > Do you have an application with a performance problem? If so, what makes > > you think inserting 1M items into a Python dict is contributing to the > > problem? > > I know in advance how many items will be added to the dictionary. Most > dictionary implementations I have previously worked with are more > efficient if they know ahead of time how big to make their tables. > > In this case I only need to do a presence test of the key, there is no > actual data associated with the key. The table is used for detecting > duplicate entries. Is there a more efficient to do this test that sticking > an empty string into a dict? The keys are sha1.digest().
Possibly a naive question - but would using sets be more efficient ? They are generally used for the sort of job you are describing (testing for duplicates rather than storing data associated with a key). We did some limited performance tests at Resolver Systems and found use of sets and dictionaries to be almost exactly hte same speed - but there could be memory advantages for you. Performance is also likely to be different for vast data sets, but I understand the hashing algorithm to be very similar for both... Fuzzyman http://www.voidspace.org.uk/python/index.shtml -- http://mail.python.org/mailman/listinfo/python-list