Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
alamb commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2984503937 > > We already have the full table in memory, so we can not really save anything by compressing it into a bloom filter. > > Agreed: if we're not concerned with larger-than-m

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
xudong963 commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2984129180 From my past experience, bloom filter mostly generates a negative impact. And for most cases, min-max works fine. -- This is an automated message from the Apache Git Service

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2984064963 2. Easier to serialize across the wire Yeah that part is of course true (especially larger tables you probably want to avoid sending over the network). the `1. More

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
mbutrovich commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983990746 > 2. Easier to serialize across the wire This is actually something I've started looking at in the last day and got stuck pretty quickly trying to serialize the HashBro

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
mbutrovich commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983928323 > We already have the full table in memory, so we can not really save anything by compressing it into a bloom filter. Agreed: if we're not concerned with larger-than-me

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983932272 @Dandandan the two ways I thought a bloom filter would be advantageous: 1. More performant if applied to each row than the full hash table, although I admit I haven't poked a

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983934401 Either way I think we can decouple the two things: there seems to be some interest in adding a bloom filter expression, that can be developed in parallel with the hash join pus

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983912966 I believe it should also be possible to share the `Arc` within the created `PhysicalExpr`. This avoids to build a bloom filter. We already have the full table in memory

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
mbutrovich commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983844232 So the high level for Spark is that there’s a BloomFilterAgg aggregate function that returns a byte sequence representing the bloom filter. The BloomFilterMightContaim scalar

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
mbutrovich commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983871673 > I can point to the relevant code for interest, but we may want a different solution for core DF. Maybe this code would at least make it easy for us to have a performa

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
mbutrovich commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983856838 I had also played with building one with the `fastbloom` crate in the hash join operator, but lacked the ability to push it anywhere useful in the plan, which we now have.

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
alamb commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983585902 FYI @mbutrovich -- I believe you were working on something like this related to Comet -- maybe it is worth a look / review here to make sure the design works with comet too if po

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-17 Thread via GitHub
dharanad commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2982806130 @adriangb I would like to work on this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-17 Thread via GitHub
adriangb opened a new issue, #16435: URL: https://github.com/apache/datafusion/issues/16435 ### Is your feature request related to a problem or challenge? Related to #15512 I think this is a first step towards HashJoinExec pushdown. I think we should model that as `col >= hash