Paul Rudin <paul.nos...@rudin.co.uk> wrote: > Shouldn't a set with 83 million 14 character strings be fine in memory > on a stock PC these days? I suppose if it's low on ram you might start > swapping which will kill performance. Perhaps the method you're using > to build the data structures creates lots of garbage? How much ram do > you have and how much memory does the python process use as it builds > your data structures?
Some simple experiments should show you that a stock PC running a 32 bit Python will struggle: >>> s = "12345678901234" >>> sys.getsizeof(s) 38 >>> 83*38 3154 So more than 3GB just for the strings (and that's for Python 2.x on Python 3.x you'll need nearly 5GB). Running on a 64 bit version of Python should be fine, but for a 32 bit system a naive approach just isn't going to work. Option 1: use a trie. That should reduce storage, maybe it will reduce it enough, maybe not. It depends on the data. Option 2: use a simple database. e.g. shelve. Simple to write and easy to use. Option 3: use a linear search through the file containing the 83 million strings. If the OP really does want to check *one* string then that is comparatively slow but still much faster than posting the question here. If they really wanted to check say 10,000 strings then put those strings in a set and test each line of the 83 million line file against the set instead of doing it the other way round. At some number of test strings this is probably faster than using a database. -- Duncan Booth http://kupuguy.blogspot.com -- http://mail.python.org/mailman/listinfo/python-list