"Karin Lagesen" <karin.lage...@bio.uio.no> writes: > Hello. > > I have approx 83 million strings, all 14 characters long. I need to be > able to take another string and find out whether this one is present > within the 83 million strings. > > Now, I have tried storing these strings as a list, a set and a dictionary. > I know that finding things in a set and a dictionary is very much faster > than working with a list, so I tried those first. However, I run out of > memory building both the set and the dictionary, so what I seem to be left > with is the list, and using the in method. > > I imagine that there should be a faster/better way than this? >
Shouldn't a set with 83 million 14 character strings be fine in memory on a stock PC these days? I suppose if it's low on ram you might start swapping which will kill performance. Perhaps the method you're using to build the data structures creates lots of garbage? How much ram do you have and how much memory does the python process use as it builds your data structures? A set should give good performance if the target string is also 14 characters. If you do end up with the list then try using bisect <http://docs.python.org/library/bisect.html> which should be quicker than just using "in" (which I think just scans linearly through the list testing for equality). There are other algorithms you can use that have better theoretical performance than using bisect for this particular problem, but you get into trade offs between executing things in python as opposed to C if you start to hand code things in python. -- http://mail.python.org/mailman/listinfo/python-list