Re: matching strings in a large set of strings

Christian Heimes Fri, 30 Apr 2010 02:13:16 -0700

s = "12345678901234"
assert len(s) == 14
import sys
sys.getsizeof(s)


So a single 14 char string takes 38 bytes.


Make that at least 40 bytes. You have to take memory alignment into account.

So a set with 83000 such strings takes approximately 1 MB. So far fairly
trivial. But that's just the memory used by the container (the set), not
the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course
is trivial for a modern PC, but the OP is using 83 million such strings,
not 83 thousand, which gives us a grand total of at least 3 gigabytes. An
entry level desktop PC these days is generally 2GB, and entry level
notebooks might have half a gig.

You are pretty much screwed on a 32bit system here. In my experience32bit system can't store more than 2.5 to 2.8 GB on the heap. Eventuallymalloc() will fail since large amounts of the 4 GB address space isreserved for other things like stack, entry point, shared librarymappings, error detection etc. Memory fragmentation isn't an issue here.


Other ideas:

* use a SQL database with an index on the data column. The index couldoptimize the "starting with" case.* You need to search for a string inside a large set of texts? Soundslike a job for a fulltext search machine! Grab PyLucene and index yourdata inside a lucene database. A SSD helps alot here.


Christian

--
http://mail.python.org/mailman/listinfo/python-list

Re: matching strings in a large set of strings

Reply via email to