[EMAIL PROTECTED] wrote: >> depending on your application, a bloom filter might be a good enough: >> >> http://en.wikipedia.org/wiki/Bloom_filter >> > > Thanks (everyone) for the comments. I like the idea of the bloom > filter or using an md5 hash, since a rare collision will not be a > show-stopper in my case.
btw, since my brain is on vacation, could anyone who already has all the necessary background information in their head (Paul?) tell me if using a chopped-up MD5 or SHA-1 (or whatever) digest as the hash functions for a bloom filter is a good idea... i.e. something like: import sha try: all except NameError: def all(S): for x in S: if not x: return False return True def digest(data): d = sha.sha(data).digest() return [d[i:i+4] for i in range(0, len(d), 4)] class bloom(object): def __init__(self): self.filter = [set() for i in range(len(digest("")))] def add(self, data): for s, p in zip(self.filter, digest(data)): s.add(p) def __contains__(self, data): return all(p in s for s, p in zip(self.filter, digest(data))) b = bloom() b.add("showbiz") b.add("absolution") print "hullaballo" in b print "showbiz" in b print "origin of symmetry" in b (and if this isn't a good idea, why it isn't). </F> -- http://mail.python.org/mailman/listinfo/python-list