Re: Blooming Filters (was: Re: [Job] MySQL consultation)

Peter Thu, 22 Mar 2007 07:27:49 -0800


On Thu, 22 Mar 2007, Tzahi Fadida wrote:

Advocating is a strong word, i was suggesting. How exactly would youaddress 128gb,256gb? Unless of course your system board and CPUsupports such sizes...

The board does not care about sizes. Disk requests are serialized andthey can be any lengths. Implementing a 1024 bit wide address counter tobe pushed out serially to hardware is trivial even with an 8 bit cpufrom 20 years ago. The problem is speed and size. Anything that fits in1 register can be manipulated in 1 clock cycle or less, that is fast.Thus jumping around in an index or a tree using 32 bit integers on 32bit hardware with ~3-4GB of RAM is not a problem. When more than 32 bitsare needed things slow down a lot. It can go from 1 clock cycle to 4-5.When the 'local' data size is larger than the cache, things slow down tomain memory speed. When that is not large enough, you start using diskseeks and VM swapping. Strictly speaking, a 32bit machine could handle100TB or more of data, working as a Turing machine on the 100TB 'tape'(or tapes), but you really wouldn't want that (insert memories ofrecompiling Linux on i386 with 8MB ram here). One of the reasons RDBMSs'like' to run on 'bare' partitions is exectly this: they prefer to usetheir own seek, hash, and striping algorythms instead of relying on theOS. So by the time any dimension of the problem touches on 2^32 thingscan slow down 10-1000 and worse times (even without script kiddies usingSQL and PHP4 scripts to handle the output).

As for 3GB, As i understand you must either have 2gb,4gb,... for thisblooming filters, i.e. you need 4gb which does not leave much room foryour kernel and apps in 32bit systems (and btw swapping is not reallyan option with this hash func). As for "expensive", some memories

There is no fixed size for Bloom filters, they are probabilistic. Youcan make a 10 bit Blooming filter, it depends on your hash algorythm.Its performance is limited by how many bits of storage you give it, howgood your hash is, and how *few* items you store in it. Take a look atthe pigeon hole principle (and the birthday coincidence probability) forclues on the probability limits involved. Bayesian filters (likebogofilter) are also closely related to this afaik. The point is thatthere are ways to build very fast speculative indexes over huge datasets without actually storing the data. This can reduce the number ofactual (expensive) lookups by orders of magnitude. I am not sure whatGoogle uses for algorythms internally but from my adventures with webpublishing and so on I would say that they are using similar principles.


Peter

=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Re: Blooming Filters (was: Re: [Job] MySQL consultation)

Reply via email to