"Karin Lagesen" <karin.lage...@bio.uio.no> wrote in message news:416f727c6f5b0edb932b425db9579808.squir...@webmail.uio.no...
Hello.

I have approx 83 million strings, all 14 characters long. I need to be
able to take another string and find out whether this one is present
within the 83 million strings.

Now, I have tried storing these strings as a list, a set and a dictionary.
I know that finding things in a set and a dictionary is very much faster
than working with a list, so I tried those first. However, I run out of
memory building both the set and the dictionary, so what I seem to be left
with is the list, and using the in method.

I imagine that there should be a faster/better way than this?

You may be overthinking it. How fast does it need to be? I generated the following file:

   >>> f=open('test.txt','wt')
   >>> for i in xrange(83000000):
   ...  f.write('0123456789ABCD\n')
   ...
   >>> f.close()

It took about 15 seconds to search that file with a command line search tool (Windows, and a slow IDE hard drive):

    findstr xyz test.txt

It took about twice that to search via Python with 'in':

   >>> for line in open('test.txt'):
   ...  if 'xyz' in line:
   ...   print line
   ...

Reading only a line at a time has the advantage of using very little memory. Storing 83 million 14-character strings in memory requires a minimum of about 1.2GB not counting overhead for lists/sets/dictionaries.

-Mark


--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to