In article <877hnpjtdw....@rudin.co.uk>, Paul Rudin <paul.nos...@rudin.co.uk> wrote: >"Karin Lagesen" <karin.lage...@bio.uio.no> writes: > >> Hello. >> >> I have approx 83 million strings, all 14 characters long. I need to be >> able to take another string and find out whether this one is present >> within the 83 million strings. >> >> Now, I have tried storing these strings as a list, a set and a dictionary. >> I know that finding things in a set and a dictionary is very much faster >> than working with a list, so I tried those first. However, I run out of >> memory building both the set and the dictionary, so what I seem to be left >> with is the list, and using the in method. >> >> I imagine that there should be a faster/better way than this? >> > >Shouldn't a set with 83 million 14 character strings be fine in memory >on a stock PC these days? I suppose if it's low on ram you might start >swapping which will kill performance. Perhaps the method you're using to >build the data structures creates lots of garbage? How much ram do you >have and how much memory does the python process use as it builds your >data structures?
And if this is practical there should be no swapping problems, as the working set will be a small fraction of the data used. <SNIP> > >There are other algorithms you can use that have better theoretical >performance than using bisect for this particular problem, but you get >into trade offs between executing things in python as opposed to C if >you start to hand code things in python. There are a lot of possibility, but they depend a great deal on secondary conditions, like how often the 83 M dataset changes. Groetjes Albert -- -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- being exponential -- ultimately falters. alb...@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst -- http://mail.python.org/mailman/listinfo/python-list