alamb commented on issue #16435:
URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2984503937
> > We already have the full table in memory, so we can not really save
anything by compressing it into a bloom filter.
>
> Agreed: if we're not concerned with larger-than-m
xudong963 commented on issue #16435:
URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2984129180
From my past experience, bloom filter mostly generates a negative impact.
And for most cases, min-max works fine.
--
This is an automated message from the Apache Git Service
Dandandan commented on issue #16435:
URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2984064963
2. Easier to serialize across the wire
Yeah that part is of course true (especially larger tables you probably want
to avoid sending over the network).
the `1. More
mbutrovich commented on issue #16435:
URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983990746
> 2. Easier to serialize across the wire
This is actually something I've started looking at in the last day and got
stuck pretty quickly trying to serialize the HashBro
mbutrovich commented on issue #16435:
URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983928323
> We already have the full table in memory, so we can not really save
anything by compressing it into a bloom filter.
Agreed: if we're not concerned with larger-than-me
adriangb commented on issue #16435:
URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983932272
@Dandandan the two ways I thought a bloom filter would be advantageous:
1. More performant if applied to each row than the full hash table, although
I admit I haven't poked a
adriangb commented on issue #16435:
URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983934401
Either way I think we can decouple the two things: there seems to be some
interest in adding a bloom filter expression, that can be developed in parallel
with the hash join pus
Dandandan commented on issue #16435:
URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983912966
I believe it should also be possible to share the `Arc` within
the created `PhysicalExpr`.
This avoids to build a bloom filter. We already have the full table in
memory
mbutrovich commented on issue #16435:
URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983844232
So the high level for Spark is that there’s a BloomFilterAgg aggregate
function that returns a byte sequence representing the bloom filter. The
BloomFilterMightContaim scalar
mbutrovich commented on issue #16435:
URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983871673
> I can point to the relevant code for interest, but we may want a different
solution for core DF.
Maybe this code would at least make it easy for us to have a performa
mbutrovich commented on issue #16435:
URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983856838
I had also played with building one with the `fastbloom` crate in the hash
join operator, but lacked the ability to push it anywhere useful in the plan,
which we now have.
alamb commented on issue #16435:
URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983585902
FYI @mbutrovich -- I believe you were working on something like this
related to Comet -- maybe it is worth a look / review here to make sure the
design works with comet too if po
dharanad commented on issue #16435:
URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2982806130
@adriangb I would like to work on this
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to
adriangb opened a new issue, #16435:
URL: https://github.com/apache/datafusion/issues/16435
### Is your feature request related to a problem or challenge?
Related to #15512
I think this is a first step towards HashJoinExec pushdown. I think we
should model that as `col >= hash
14 matches
Mail list logo