[EMAIL PROTECTED] wrote:

>> depending on your application, a bloom filter might be a good enough:
>>
>>      http://en.wikipedia.org/wiki/Bloom_filter
>>
>
> Thanks (everyone) for the comments.  I like the idea of the bloom
> filter or using an md5 hash, since a rare collision will not be a
> show-stopper in my case.

btw, since my brain is on vacation, could anyone who already has all the 
necessary
background information in their head (Paul?) tell me if using a chopped-up MD5 
or
SHA-1 (or whatever) digest as the hash functions for a bloom filter is a good 
idea...

i.e. something like:

    import sha

    try:
        all
    except NameError:
        def all(S):
             for x in S:
                 if not x:
                    return False
             return True

    def digest(data):
        d = sha.sha(data).digest()
        return [d[i:i+4] for i in range(0, len(d), 4)]

    class bloom(object):

        def __init__(self):
            self.filter = [set() for i in range(len(digest("")))]

        def add(self, data):
            for s, p in zip(self.filter, digest(data)):
                s.add(p)
        def __contains__(self, data):

            return all(p in s for s, p in zip(self.filter, digest(data)))

    b = bloom()
    b.add("showbiz")
    b.add("absolution")
    print "hullaballo" in b
    print "showbiz" in b
    print "origin of symmetry" in b

(and if this isn't a good idea, why it isn't).

</F> 



-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to