On Thu, 22 Mar 2007, Tzahi Fadida wrote:

Advocating is a strong word, i was suggesting. How exactly would you address 128gb,256gb? Unless of course your system board and CPU supports such sizes...

The board does not care about sizes. Disk requests are serialized and they can be any lengths. Implementing a 1024 bit wide address counter to be pushed out serially to hardware is trivial even with an 8 bit cpu from 20 years ago. The problem is speed and size. Anything that fits in 1 register can be manipulated in 1 clock cycle or less, that is fast. Thus jumping around in an index or a tree using 32 bit integers on 32 bit hardware with ~3-4GB of RAM is not a problem. When more than 32 bits are needed things slow down a lot. It can go from 1 clock cycle to 4-5. When the 'local' data size is larger than the cache, things slow down to main memory speed. When that is not large enough, you start using disk seeks and VM swapping. Strictly speaking, a 32bit machine could handle 100TB or more of data, working as a Turing machine on the 100TB 'tape' (or tapes), but you really wouldn't want that (insert memories of recompiling Linux on i386 with 8MB ram here). One of the reasons RDBMSs 'like' to run on 'bare' partitions is exectly this: they prefer to use their own seek, hash, and striping algorythms instead of relying on the OS. So by the time any dimension of the problem touches on 2^32 things can slow down 10-1000 and worse times (even without script kiddies using SQL and PHP4 scripts to handle the output).

As for 3GB, As i understand you must either have 2gb,4gb,... for this blooming filters, i.e. you need 4gb which does not leave much room for your kernel and apps in 32bit systems (and btw swapping is not really an option with this hash func). As for "expensive", some memories

There is no fixed size for Bloom filters, they are probabilistic. You can make a 10 bit Blooming filter, it depends on your hash algorythm. Its performance is limited by how many bits of storage you give it, how good your hash is, and how *few* items you store in it. Take a look at the pigeon hole principle (and the birthday coincidence probability) for clues on the probability limits involved. Bayesian filters (like bogofilter) are also closely related to this afaik. The point is that there are ways to build very fast speculative indexes over huge data sets without actually storing the data. This can reduce the number of actual (expensive) lookups by orders of magnitude. I am not sure what Google uses for algorythms internally but from my adventures with web publishing and so on I would say that they are using similar principles.

Peter

=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Reply via email to