On May 4, 12:42 pm, Terry Reedy <tjre...@udel.edu> wrote: > On 5/4/2011 2:17 PM, Raymond Hettinger wrote: > > > Here's a 22-line beauty for a classic and amazing algorithm: > >http://bit.ly/bloom_filter > > > The wiki article on the algorithm is brief and well-written: > >http://en.wikipedia.org/wiki/Bloom_filter > > As I understand the article, the array of num_bits should have > num_probes (more or less independent) bits set for each key. But as I > understand the code > > for i in range(self.num_probes): > h, array_index = divmod(h, num_words) > h, bit_index = divmod(h, 32) > yield array_index, 1 << bit_index > > the same bit is being set or tested num_probes times. The last three > lines have no dependence on i that I can see, so they appear to do the > same thing each time. This seems like a bug.
The 512 bits in h are progressively eaten-up between iterations. So each pass yields a different (array index, bit_mask) pair. It's easy to use the interactive prompt to show that different probes are produced on each pass: >>> bf = BloomFilter(num_bits=1000, num_probes=8) >>> pprint(list(bf.get_probes('Alabama'))) [(19, 1073741824), (11, 64), (9, 134217728), (25, 1024), (24, 33554432), (6, 16), (7, 16777216), (22, 1048576)] The 512 bits are uncorrelated -- otherwise sha512 wouldn't be much of a cryptographic hash ;) The fifty state example in the recipe is a reasonable demonstration that the recipe works as advertised. It successfully finds all fifty states (the true positives) and it tries 100,000 negatives resulting in only a handful of false negatives. That should be somewhat convincing that it all works. Raymond ------- follow my other python tips and recipes on twitter: @raymondh -- http://mail.python.org/mailman/listinfo/python-list