Dennis Lee Bieber wrote: > On Thu, 29 Apr 2010 11:38:28 +0200, "Karin Lagesen" > <karin.lage...@bio.uio.no> declaimed the following in comp.lang.python: > >> Hello. >> >> I have approx 83 million strings, all 14 characters long. I need to be >> able to take another string and find out whether this one is present >> within the 83 million strings. >> > Is this "another string" also exactly 14 characters long? > >> Now, I have tried storing these strings as a list, a set and a dictionary. >> I know that finding things in a set and a dictionary is very much faster >> than working with a list, so I tried those first. However, I run out of >> memory building both the set and the dictionary, so what I seem to be left >> with is the list, and using the in method. >> > So don't load them into memory... First use a file-based (not memory > limited) sort routine on the 80M strings (if position in the file is > important, make a copy with the line number on the end of each line > before sorting -- but since you've tried using Sets which do not retain > order... probably not a matter). > > Then, since you know the length of the file, and the length of each > line is fixed, you can do direct access I/O (well, since the C-stream > style I/O doesn't do "record" access, you'll need to calculate offsets > from the start of the file as (record# - 1) * recordlen)... > > That lets you do a binary search on the file. Much faster than a > linear search (linear search will average out to 41.5M read operations; > binary should be around 10000 reads) > > Or load the data into some relational database and let the compiled > code do the searching -- probably faster than byte-code interpreted > Python for the same algorithm...
... or put the data into a simple on-disc dictionary such as what mxBeeBase implements: http://www.egenix.com/products/python/mxBase/mxBeeBase/ Once you have that dict set up, lookups are very fast. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 06 2010) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ -- http://mail.python.org/mailman/listinfo/python-list