s = "12345678901234"
assert len(s) == 14
import sys
sys.getsizeof(s)
38

So a single 14 char string takes 38 bytes.

Make that at least 40 bytes. You have to take memory alignment into account.

So a set with 83000 such strings takes approximately 1 MB. So far fairly
trivial. But that's just the memory used by the container (the set), not
the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course
is trivial for a modern PC, but the OP is using 83 million such strings,
not 83 thousand, which gives us a grand total of at least 3 gigabytes. An
entry level desktop PC these days is generally 2GB, and entry level
notebooks might have half a gig.

You are pretty much screwed on a 32bit system here. In my experience 32bit system can't store more than 2.5 to 2.8 GB on the heap. Eventually malloc() will fail since large amounts of the 4 GB address space is reserved for other things like stack, entry point, shared library mappings, error detection etc. Memory fragmentation isn't an issue here.

Other ideas:

* use a SQL database with an index on the data column. The index could optimize the "starting with" case. * You need to search for a string inside a large set of texts? Sounds like a job for a fulltext search machine! Grab PyLucene and index your data inside a lucene database. A SSD helps alot here.

Christian

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to