s = "12345678901234"
assert len(s) == 14
import sys
sys.getsizeof(s)
38
So a single 14 char string takes 38 bytes.
Make that at least 40 bytes. You have to take memory alignment into account.
So a set with 83000 such strings takes approximately 1 MB. So far fairly
trivial. But that's just the memory used by the container (the set), not
the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course
is trivial for a modern PC, but the OP is using 83 million such strings,
not 83 thousand, which gives us a grand total of at least 3 gigabytes. An
entry level desktop PC these days is generally 2GB, and entry level
notebooks might have half a gig.
You are pretty much screwed on a 32bit system here. In my experience
32bit system can't store more than 2.5 to 2.8 GB on the heap. Eventually
malloc() will fail since large amounts of the 4 GB address space is
reserved for other things like stack, entry point, shared library
mappings, error detection etc. Memory fragmentation isn't an issue here.
Other ideas:
* use a SQL database with an index on the data column. The index could
optimize the "starting with" case.
* You need to search for a string inside a large set of texts? Sounds
like a job for a fulltext search machine! Grab PyLucene and index your
data inside a lucene database. A SSD helps alot here.
Christian
--
http://mail.python.org/mailman/listinfo/python-list