Karin Lagesen wrote: > I have approx 83 million strings, all 14 characters long. I need to be > able to take another string and find out whether this one is present > within the 83 million strings. [...] > I run out of memory building both the set and the dictionary, so > what I seem to be left with is the list
I agree with the recommendations to try modules that maintain the set on disk (sqlite, dbm, shelve), but here's an in-memory solution: Hash to about 8 million buckets, and use a compact representation for the several strings in each bucket. That gives you most of the speed and avoids most of the memory overhead. Python makes it easy: class ConstLenStrSet: """ Keep a set of strings all of the same length. Support set.add() and Python's 'in' test. """ def __init__(self, strlen=14, nbuckets=8000000): assert strlen > 0 and nbuckets > 0 self.strlen = strlen self.nbuckets = nbuckets | 1 self.table = [''] * self.nbuckets def _hashstr(self, s): return hash(s) % self.nbuckets def add(self, s): assert len(s) == self.strlen if s not in self: self.table[self._hashstr(s)] += s def __contains__(self, s): data = self.table[self._hashstr(s)] for i in range(0, len(data), self.strlen): if data[i : i + self.strlen] == s: return True return False # Rudimentary demo-test: from random import choice from string import letters def rnd_str(slen=14): return ''.join(choice(letters) for _ in range(slen)) # Test against Python's built-in set sref = set() for i in range(830000): sref.add(rnd_str()) print('Strings generated.') sset = sset = ConstLenStrSet(14, 80000) for s in sref: sset.add(s) print 'ConstLenStrSet loaded.' for s in sref: assert s in sset for i in range(1000): s = rnd_str() if s in sref: print 'That was unlucky.' else: assert s not in sset If building the set is too slow, and you know you don't have a lot of duplicate strings, you can use a faster insert method that doesn't check whether the string is already in the set: def add_quick(self, s): assert len(s) == self.strlen self.table[self._hashstr(s)] += s -- --Bryan -- http://mail.python.org/mailman/listinfo/python-list