Atri, * Atri Sharma (atri.j...@gmail.com) wrote: > I just popped in here on Simon's advice to put an idea I had about > optimizing hash joins on this thread.
I'd encourage reading the thread a bit first, in the future.. :) > Essentially, I was thinking of using bloom filters in the part where > we match the actual hash values of the outer relation's tuple and the > hash table. We could do a bloom filter lookup first, and if it gives a > positive, then we can proceed like we do currently, since we could > have a false positive. However, if we have a negative from the bloom > filter, then, we can skip that tuple since bloom filters never give > false negatives. I suggested this up-thread already, but it's not really a bloom filter as there's only one hash function available- I can't see us requiring every data type to provide multiple hash functions. Still, I do think breaking the single 32-bit hash key space up into fixed-sized chunks and then having a bitfield array which we test against (very similar to how the visibility map works) to see if there's any chance that a given hash key exists might be valuable. The problem is that, because we don't have multiple hash functions, it's not clear how much "empty" space we'd actually end up with. This needs to be tested with different data sets and different sizes for the bitfield (leading to correspondingly different sizes for the amount of key space covered by a single bit), along with performance metrics which determine how expensive it is to build and use the bitfield. > Another thing we could potentially look at is that in the case when > the hash table doesnt fit into memory, and we have to do disk lookups, > then, we could do bloom filter lookups first in order to select the > tuples to be read from disk(only the positive ones). We could have a bitfield filter (as I described above) created for each bucket and then test against that before considering if we actually have to go look in that bucket, yes. I'm not sure if that's quite what you were thinking, but I can see how a bitfield per bucket might work. If you were suggesting something else, please clarify. Thanks, Stephen
signature.asc
Description: Digital signature