Re: [DISCUSS] FLIP-324: Introduce Runtime Filter for Flink Batch Jobs

liu ron Mon, 19 Jun 2023 20:19:06 -0700

Hi, Jing

Thanks for your feedback.


> Afaiu, the runtime Filter will only be Injected when the gap between the
build data size and prob data size is big enough. Let's make an extreme
example. If the small table(build side) has one row and the large
table(probe side) contains tens of billions of rows. This will be the ideal
use case for the runtime filter and the improvement will be significant. Is
this correct?

Yes, you are right.

> Speaking of the "Conditions of injecting Runtime Filter" in the FLIP, will
the value of max-build-data-size and min-prob-data-size depend on the
parallelism config? I.e. with the same data-size setting, is it possible to
inject or don't inject runtime filters by adjusting the parallelism?

First, let me clarify two points. The first is that RuntimeFilter decides
whether to inject or not in the optimization phase, but we do not consider
operator parallelism in the SQL optimization phase currently, which is set
at the ExecNode level. The second is that in batch mode, the default
AdaptiveBatchScheduler[1] is now used, which will derive the parallelism of
the downstream operator based on the amount of data produced by the
upstream operator, that is, the parallelism is determined by runtime
adaptation. In the above case, we cannot decide whether to inject
BloomFilter in the optimization stage based on parallelism.
A more important point is that the purpose of Runtime Filter is to reduce
the amount of data for shuffle, and thus the amount of data processed by
the downstream join operator. Therefore, I understand that regardless of
the parallelism of the probe, the amount of data in the shuffle must be
reduced after inserting the Runtime Filter, which is beneficial to the join
operator, so whether to insert the RuntimeFilter or not is not dependent on
the parallelism.

> Does it make sense to reconsider the formula of ratio
calculation to help users easily control the filter injection?

Only when ndv does not exist will row count be considered. when size uses
the default value and ndv cannot be taken, it is true that this condition
may always hold, but this does not seem to affect anything, and the user is
also likely to change the value of the size. One question, how do you think
we should make it easier for users to control the  filter injection？


[1]:
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/elastic_scaling/#adaptive-batch-scheduler

Best,
Ron

Jing Ge <[email protected]> 于2023年6月20日周二 07:11写道：

> Hi Lijie,
>
> Thanks for your proposal. It is a really nice feature. I'd like to ask a
> few questions to understand your thoughts.
>
> Afaiu, the runtime Filter will only be Injected when the gap between the
> build data size and prob data size is big enough. Let's make an extreme
> example. If the small table(build side) has one row and the large
> table(probe side) contains tens of billions of rows. This will be the ideal
> use case for the runtime filter and the improvement will be significant. Is
> this correct?
>
> Speaking of the "Conditions of injecting Runtime Filter" in the FLIP, will
> the value of max-build-data-size and min-prob-data-size depend on the
> parallelism config? I.e. with the same data-size setting, is it possible to
> inject or don't inject runtime filters by adjusting the parallelism?
>
> In the FLIP, there are default values for the new configuration parameters
> that will be used to check the injection condition. If ndv cannot be
> estimated, row count will be used. Given the max-build-data-size is 10MB
> and the min-prob-data-size is 10GB, in the worst case, the min-filter-ratio
> will be 0.999, i.e. the probeNdv is 1000 times buildNdv . If we consider
> the duplication and the fact that the large table might have more columns
> than the small table, the probeNdv should still be 100 or 10 times
> buildNdv, which ends up with a min-filter-ratio equals to 0.99 or 0.9. Both
> are bigger than the default value 0.5 in the FLIP. If I am not mistaken,
> commonly, a min-filter-ratio less than 0.99 will always allow injecting the
> runtime filter. Does it make sense to reconsider the formula of ratio
> calculation to help users easily control the filter injection?
>
> Best regards,
> Jing
>
> On Mon, Jun 19, 2023 at 4:42 PM Lijie Wang <[email protected]>
> wrote:
>
> > Hi Stefan,
> >
> > >> bypassing the dataflow
> > I believe it's a possible solution, but it may require more coordination
> > and extra conditions (such as DFS), I do think it should be excluded from
> > the first version. I'll put it in Future+Improvements as a potential
> > improvement.
> >
> > Thanks again for your quick reply :)
> >
> > Best,
> > Lijie
> >
> > Stefan Richter <[email protected]> 于2023年6月19日周一 20:51写道：
> >
> > >
> > > Hi Lijie,
> > >
> > > I think you understood me correctly. But I would not consider this a
> true
> > > cyclic dependency in the dataflow because I would not suggest to send
> the
> > > filter through an edge in the job graph from join to scan. I’d rather
> > > bypass the stream graph for exchanging bringing the filter to the scan.
> > For
> > > example, the join could report the filter after the build phase, e.g.
> to
> > > the JM or a predefined DFS folder. And when the probe scan is
> scheduled,
> > > the JM provides the filter information to the scan when it gets
> scheduled
> > > for execution or the scan looks in DFS if it can find any filter that
> it
> > > can use as part of initialization. I’m not suggesting to do it exactly
> in
> > > those ways, but just to show what I mean by "bypassing the dataflow".
> > >
> > > Anyways, I’m fine with excluding this optimization from the current
> FLIP
> > > if you believe it would be hard to implement in Flink.
> > >
> > > Best,
> > > Stefan
> > >
> > >
> > > > On 19. Jun 2023, at 14:07, Lijie Wang <[email protected]>
> > wrote:
> > > >
> > > > Hi Stefan,
> > > >
> > > > If I understand correctly(I hope so), the hash join operator needs to
> > > send
> > > > the bloom filter to probe scan, and probe scan also needs to send the
> > > > filtered data to the hash join operator. This means there will be a
> > cycle
> > > > in the data flow, it will be hard for current Flink to schedule this
> > kind
> > > > of graph. I admit we can find a way to do this, but that's probably a
> > > > bit outside the scope of this FLIP.  So let's do these complex
> > > > optimizations later, WDYT？
> > > >
> > > > Best,
> > > > Lijie
> > > >
> > > > Stefan Richter <[email protected] <mailto:
> > > [email protected]>> 于2023年6月19日周一 18:15写道：
> > > >
> > > >> Hi Lijie,
> > > >>
> > > >> Exactly, my proposal was to build the bloom filter in the hash
> > > operator. I
> > > >> don’t know about all the details about the implementation of Flink’s
> > > join
> > > >> operator, but I’d assume that even if the join is a two input
> operator
> > > it
> > > >> gets scheduled for 2 different pipelines. First the build phase with
> > the
> > > >> scan from the dimension table and after that’s completed the probe
> > phase
> > > >> with the scan of the fact table. I’m not proposing the use the bloom
> > > filter
> > > >> only in the join operator, but rather send the bloom filter to the
> > probe
> > > >> scan before starting the probe. I assume this would require some
> form
> > of
> > > >> side channel to transport the filter and coordination to tell the
> > > sources
> > > >> that such a filter is available. I cannot answer how hard those
> would
> > > be to
> > > >> implement, but the idea doesn’t seem impossible to me.
> > > >>
> > > >> Best,
> > > >> Stefan
> > > >>
> > > >>
> > > >>> On 19. Jun 2023, at 11:56, Lijie Wang <[email protected]>
> > > wrote:
> > > >>>
> > > >>> Hi Stefan,
> > > >>>
> > > >>> Now I know what you mean about point 1. But currently it is
> > unfeasible
> > > >> for
> > > >>> Flink, because the building of the hash table is inside the hash
> join
> > > >>> operator. The hash join operator has two inputs, it will first
> > process
> > > >> the
> > > >>> data of the build-input to build a hash table, and then use the
> hash
> > > >> table
> > > >>> to process the data of the probe-input. If we want to use the built
> > > hash
> > > >>> table to deduplicate data for bloom filter, we must put the bloom
> > > filter
> > > >>> inside the hash join operator.  However, in this way, the data
> > reaching
> > > >> the
> > > >>> join operator cannot be reduced (the shuffle/network overhead
> cannot
> > be
> > > >>> reduced), which is not what we expected.
> > > >>>
> > > >>> Regarding the filter type, I agree with you, more types of filters
> > can
> > > >>> get further
> > > >>> optimization,  and it is in our future plan (We described it in the
> > > >> section
> > > >>> Future+Improvements#More+underlying+implementations).
> > > >>>
> > > >>> Best,
> > > >>> Lijie
> > > >>>
> > > >>> Stefan Richter <[email protected] <mailto:
> > > [email protected]> <mailto:
> > > >> [email protected] <mailto:[email protected]
> > >>>
> > > 于2023年6月19日周一 15:58写道：
> > > >>>
> > > >>>>
> > > >>>> Hi Lijie,
> > > >>>>
> > > >>>> thanks for your response, I agree with what you said about points
> 2
> > > and
> > > >> 3.
> > > >>>> Let me explain a bit more about point 1. This would not apply to
> all
> > > >> types
> > > >>>> of joins and my suggestion is also *not* to build a hash table
> only
> > > for
> > > >> the
> > > >>>> purpose to build the bloom filter.
> > > >>>> I was thinking about the scenario of a hash join, where you would
> > > build
> > > >>>> the hash table as part of the join algorithm anyways and then use
> > the
> > > >>>> keyset of that hash table to 1) have better insights on about NDV
> > and
> > > >> 2) be
> > > >>>> able to construct the bloom filter without duplicates and
> therefore
> > > >> faster.
> > > >>>> So the preconditions where I would use this is if you are
> building a
> > > >> hash
> > > >>>> table as part of the join and you know you are not building for a
> > key
> > > >>>> column (because there would be no duplicates to eliminate). Then
> > your
> > > >> bloom
> > > >>>> filter construction could benefit already from the deduplication
> > work
> > > >> that
> > > >>>> was done for building the hash table.
> > > >>>>
> > > >>>> I also wanted to point out that besides bloom filter and IN filter
> > you
> > > >>>> could also think of other types of filter that can become
> > interesting
> > > >> for
> > > >>>> certain distributions and meta data. For example, if you have
> > min/max
> > > >>>> information about columns and partitions you could have a bit
> vector
> > > >>>> represent equilibrium-sized ranges of the key space between min
> and
> > > max
> > > >> and
> > > >>>> have the bits represent what part of the range is present and push
> > > that
> > > >>>> information down to the scan.
> > > >>>>
> > > >>>> Best,
> > > >>>> Stefan
> > > >>>>
> > > >>>>
> > > >>>>> On 19. Jun 2023, at 08:26, Lijie Wang <[email protected]
> > > <mailto:[email protected]>>
> > > >> wrote:
> > > >>>>>
> > > >>>>> Hi Stefan,
> > > >>>>>
> > > >>>>> Thanks for your feedback. Let me briefly summarize the
> optimization
> > > >>>> points
> > > >>>>> you mentioned above (Please correct me if I'm wrong):
> > > >>>>>
> > > >>>>> 1. Build an extra hash table for deduplication before building
> the
> > > >> bloom
> > > >>>>> filter.
> > > >>>>> 2. Use the two-phase approach to build the bloom filter(first
> > local,
> > > >> then
> > > >>>>> OR-combine).
> > > >>>>> 3. Use blocked bloom filters to improve the cache efficiency.
> > > >>>>>
> > > >>>>> For the above 3 points, I have the following questions or
> opinions:
> > > >>>>>
> > > >>>>> For point 1, it seems that building a hash table also requires
> > > >> traversing
> > > >>>>> all build side data, and the overhead seems to be the same as
> > > building
> > > >> a
> > > >>>>> bloom filter directly? In addition, the hash table will take up
> > more
> > > >>>> space
> > > >>>>> when the amount of data is large, which is why we choose to use
> > bloom
> > > >>>>> filter instead of hash table.
> > > >>>>>
> > > >>>>> For point 2, I think it's a good idea to use the two-phase
> approach
> > > to
> > > >>>>> build the bloom filter. But rather than directly broadcasting the
> > > local
> > > >>>>> bloom filter to the probe side, I prefer to introduce a global
> node
> > > for
> > > >>>> the
> > > >>>>> OR-combine(like two-phase-agg[1]), then broadcast the combined
> > bloom
> > > >>>> filter
> > > >>>>> to the probe side. The latter can reduce the amount of data
> > > transferred
> > > >>>> by
> > > >>>>> the network. I will change the FLIP like this.
> > > >>>>>
> > > >>>>> For point 3, I think it's a nice optimization, but I prefer to
> put
> > it
> > > >> to
> > > >>>>> the future improvements. There is already an implementation of
> > bloom
> > > >>>> filter
> > > >>>>> in flink, we can simply reuse it. Introducing a new bloom filter
> > > >>>>> implementation introduces some complexity  (we need to implement
> > it,
> > > >> test
> > > >>>>> it, etc), and is not the focus of this FLIP.
> > > >>>>>
> > > >>>>> [1]
> > > >>>>>
> > > >>>>
> > > >>
> > >
> >
> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://www.google.com/url?q%253Dhttps://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tuning/%252523local-global-aggregation%2526source%253Dgmail-imap%2526ust%253D1687760804000000%2526usg%253DAOvVaw2eoXknGWmG4TSiznxtHFWG%26source%3Dgmail-imap%26ust%3D1687773407000000%26usg%3DAOvVaw3V4Sv1o119cpU4xfP0ifkj&source=gmail-imap&ust=1687781326000000&usg=AOvVaw033xxrkenJpx27XzCVKsda
> > > >>>>>
> > > >>>>> Best,
> > > >>>>> Lijie
> > > >>>>>
> > > >>>>> Stefan Richter <[email protected] <mailto:
> > > [email protected]> <mailto:
> > > >> [email protected] <mailto:[email protected]
> >>
> > > <mailto:
> > > >>>> [email protected] <mailto:
> [email protected]
> > >
> > > <mailto:[email protected]>>>
> > > >> 于2023年6月16日周五 16:45写道：
> > > >>>>>
> > > >>>>>> Hi,
> > > >>>>>>
> > > >>>>>> Thanks for the proposal of this feature! I have a question about
> > the
> > > >>>>>> filter build and a some suggestions for potential improvements.
> > > >> First, I
> > > >>>>>> wonder why you suggest to run the filter builder as separate
> > > operator
> > > >>>> with
> > > >>>>>> parallelism 1. I’d suggest to integrate the filter distributed
> > build
> > > >>>> with
> > > >>>>>> the hash table build phase as follows:
> > > >>>>>>
> > > >>>>>> 1. Build the hash table completely in each subtask.
> > > >>>>>> 2. The keyset of the hash table is giving us a precise NDV count
> > for
> > > >>>> every
> > > >>>>>> subtask.
> > > >>>>>> 3. Build a filter from the subtask hash table. For low
> cardinality
> > > >>>> tables,
> > > >>>>>> I’d go with the suggested optimization of IN-filter.
> > > >>>>>> 4. Each build subtask transfers the local bloom filter to all
> > probe
> > > >>>>>> operators.
> > > >>>>>> 5. On the probe operator we can either probe against the
> > individual
> > > >>>>>> filters, or we OR-combine all subtask filters into aggregated
> > bloom
> > > >>>> filter.
> > > >>>>>>
> > > >>>>>> I’m suggesting this because building inserting into a (larger)
> > bloom
> > > >>>>>> filter can be costly, especially once the filter exceeds cache
> > sizes
> > > >>>> and is
> > > >>>>>> therefor better parallelized. First inserting into the hash
> table
> > > also
> > > >>>>>> deduplicates the keys and we avoid inserting records twice into
> > the
> > > >>>> bloom
> > > >>>>>> filter. If we want to improve cache efficiency for the build of
> > > larger
> > > >>>>>> filters, we could structure them as blocked bloom filters, where
> > the
> > > >>>> filter
> > > >>>>>> is separated into blocks and all bits of one key go only into
> one
> > > >> block.
> > > >>>>>> That allows us to apply software managed buffering to first
> group
> > > keys
> > > >>>> that
> > > >>>>>> go into the same partition (ideally fitting into cache) and then
> > > bulk
> > > >>>> load
> > > >>>>>> partitions once we collected enough keys for one round of
> loading.
> > > >>>>>>
> > > >>>>>> Best,
> > > >>>>>> Stefan
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> <
> > > >>>>
> > > >>
> > >
> >
> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://www.google.com/url?q%253Dhttps://www.confluent.io/%2526source%253Dgmail-imap%2526ust%253D1687760804000000%2526usg%253DAOvVaw3p0tBjuVsWz3SLYyPQukfL%26source%3Dgmail-imap%26ust%3D1687773407000000%26usg%3DAOvVaw1THgA9fFMrOd7QpGpwiRx6&source=gmail-imap&ust=1687781326000000&usg=AOvVaw1f-3D9-2lZDGsvFBjeFlvn
> > > >>>>>
> > > >>>>>> Stefan Richter
> > > >>>>>> Principal Engineer II
> > > >>>>>>
> > > >>>>>> Follow us:  <
> > > >>>>>>
> > > >>>>
> > > >>
> > >
> >
> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://www.google.com/url?q%253Dhttps://www.confluent.io/blog?utm_source%25253Dfooter%252526utm_medium%25253Demail%252526utm_campaign%25253Dch.email-signature_type.community_content.blog%2526source%253Dgmail-imap%2526ust%253D1687760804000000%2526usg%253DAOvVaw2VU_JTYB24Wp4bF2JshdU7%26source%3Dgmail-imap%26ust%3D1687773407000000%26usg%3DAOvVaw37ghBlQPqP0tTXCfNJCqKv&source=gmail-imap&ust=1687781326000000&usg=AOvVaw20v4QTnSyAz_HAHbMyVY7J
> > > >>>>>
> > > >>>>>> <
> > > >>>>
> > > >>
> > >
> >
> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://www.google.com/url?q%253Dhttps://twitter.com/ConfluentInc%2526source%253Dgmail-imap%2526ust%253D1687760804000000%2526usg%253DAOvVaw2irnDxUAhXR0N8FUk2orze%26source%3Dgmail-imap%26ust%3D1687773407000000%26usg%3DAOvVaw0ItT553mEuA5KaeJWSH36D&source=gmail-imap&ust=1687781326000000&usg=AOvVaw1mNvHaIwjIKU_gqOuDYLDK
> > > >>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> On 15. Jun 2023, at 13:35, Lijie Wang <
> [email protected]
> > > <mailto:[email protected]>
> > > >> <mailto:[email protected]>
> > > >>>> <mailto:[email protected]>> wrote:
> > > >>>>>>>
> > > >>>>>>> Hi,  Benchao and Aitozi,
> > > >>>>>>>
> > > >>>>>>> Thanks for your feedback about this FLIP.
> > > >>>>>>>
> > > >>>>>>> @Benchao
> > > >>>>>>>
> > > >>>>>>>>> I think it would be reasonable to also support "pipeline
> > shuffle"
> > > >> if
> > > >>>>>>> possible.
> > > >>>>>>> As I said above, runtime filter can work well with all shuffle
> > > mode,
> > > >>>>>>> including pipeline shuffle.
> > > >>>>>>>
> > > >>>>>>>>> if the RuntimeFIlterBuilder could be done quickly than
> > > >> RuntimeFilter
> > > >>>>>>> operator, it can still filter out additional data afterwards.
> > > >>>>>>> I think the main purpose of runtime filter is to reduce the
> > shuffle
> > > >>>> data
> > > >>>>>>> and the data arriving at join. Although eagerly running the
> large
> > > >>>>>>> table side can process datas in advance, most of the data may
> be
> > > >>>>>>> irrelevant, causing huge shuffle overhead and slowing the join.
> > In
> > > >>>>>>> addition, if the join is a hash-join, the probe side of the
> > > hash-join
> > > >>>>>> also
> > > >>>>>>> needs to wait for its build side to complete, so the large
> table
> > > side
> > > >>>> is
> > > >>>>>>> likely to be back-pressed.
> > > >>>>>>> In addition, I don't tend to add too many configuration options
> > in
> > > >> the
> > > >>>>>>> first version, which may make it more difficult to use (users
> > need
> > > to
> > > >>>>>>> understand a lot of internal implementation details). Maybe it
> > > could
> > > >>>> be a
> > > >>>>>>> future improvement (if it's worthwhile)?
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> @Aitozi
> > > >>>>>>>
> > > >>>>>>>>> IMO, In the current implementation two source table operators
> > > will
> > > >> be
> > > >>>>>>> executed simultaneously.
> > > >>>>>>> The example in FLIP uses blocking shuffle(I will add this point
> > to
> > > >>>> FLIP).
> > > >>>>>>> The runtime filter is generally chained with the large table
> side
> > > to
> > > >>>>>> reduce
> > > >>>>>>> the shuffle data (as shown in Figure 2 of FLIP). The job
> vertices
> > > >>>> should
> > > >>>>>> be
> > > >>>>>>> scheduled in topological order, so the large table side can
> only
> > be
> > > >>>>>>> scheduled after the RuntimeFilterBuilder finishes.
> > > >>>>>>>
> > > >>>>>>>>> Are there some tests to show the default value of
> > > >>>>>>> table.optimizer.runtime-filter.min-probe-data-size 10G is a
> good
> > > >>>> default
> > > >>>>>>> value.
> > > >>>>>>> It's not tested yet, but it will be done before merge the code.
> > The
> > > >>>>>> current
> > > >>>>>>> value refers to systems such as spark and hive. Before code
> > > merging,
> > > >> we
> > > >>>>>>> will test on TPC-DS 10 T to find an optimal set of values. If
> you
> > > >> have
> > > >>>>>>> relevant experience on it, welcome to give some suggestions.
> > > >>>>>>>
> > > >>>>>>>>> What's the representation of the runtime filter node in
> > planner ?
> > > >>>>>>> As shown in Figure 1 of FLIP, we intend to add two new physical
> > > >> nodes,
> > > >>>>>>> RuntimeFilterBuilder and RuntimeFilter.
> > > >>>>>>>
> > > >>>>>>> Best,
> > > >>>>>>> Lijie
> > > >>>>>>>
> > > >>>>>>> Aitozi <[email protected] <mailto:[email protected]>
> > > <mailto:[email protected]> <mailto:
> > > >> [email protected] <mailto:[email protected]>> <mailto:
> > > >>>> [email protected] <mailto:[email protected]> <mailto:
> > > [email protected]>>>
> > > >>>>>> 于2023年6月15日周四 15:52写道：
> > > >>>>>>>
> > > >>>>>>>> Hi Lijie,
> > > >>>>>>>>
> > > >>>>>>>> Nice to see this valuable feature. After reading the FLIP I
> have
> > > >>>> some
> > > >>>>>>>> questions below:
> > > >>>>>>>>
> > > >>>>>>>>> Schedule the TableSource(dim) first.
> > > >>>>>>>>
> > > >>>>>>>> How does it know to schedule the TableSource(dim) first ? IMO,
> > In
> > > >> the
> > > >>>>>>>> current implementation two source table operators will be
> > executed
> > > >>>>>>>> simultaneously.
> > > >>>>>>>>
> > > >>>>>>>>> If the data volume on the probe side is too small, the
> overhead
> > > of
> > > >>>>>>>> building runtime filter is not worth it.
> > > >>>>>>>>
> > > >>>>>>>> Are there some tests to show the default value of
> > > >>>>>>>> table.optimizer.runtime-filter.min-probe-data-size 10G is a
> good
> > > >>>> default
> > > >>>>>>>> value. The same to
> > > >> table.optimizer.runtime-filter.max-build-data-size
> > > >>>>>>>>
> > > >>>>>>>>> the runtime filter can be pushed down along the probe side,
> as
> > > >> close
> > > >>>> to
> > > >>>>>>>> data sources as possible
> > > >>>>>>>>
> > > >>>>>>>> What's the representation of the runtime filter node in
> planner
> > ?
> > > Is
> > > >>>> it
> > > >>>>>> a
> > > >>>>>>>> Filternode
> > > >>>>>>>>
> > > >>>>>>>> Best,
> > > >>>>>>>>
> > > >>>>>>>> Aitozi.
> > > >>>>>>>>
> > > >>>>>>>> Benchao Li <[email protected] <mailto:[email protected]
> >
> > > <mailto:[email protected]>
> > > >> <mailto:[email protected]>>
> > > >>>> 于2023年6月15日周四 14:30写道：
> > > >>>>>>>>
> > > >>>>>>>>> Hi Lijie,
> > > >>>>>>>>>
> > > >>>>>>>>> Regarding the shuffle mode, I think it would be reasonable to
> > > also
> > > >>>>>>>> support
> > > >>>>>>>>> "pipeline shuffle" if possible.
> > > >>>>>>>>>
> > > >>>>>>>>> "pipeline shuffle" is a essential for OLAP/MPP computing,
> > > although
> > > >>>> this
> > > >>>>>>>> has
> > > >>>>>>>>> not been much exposed to users for now, I know a few
> companies
> > > that
> > > >>>>>> uses
> > > >>>>>>>>> Flink as a MPP computing engine, and there is an ongoing
> > > effort[1]
> > > >> to
> > > >>>>>>>> make
> > > >>>>>>>>> this usage more powerful.
> > > >>>>>>>>>
> > > >>>>>>>>> Back to your concern that "Even if the RuntimeFilter becomes
> > > >> running
> > > >>>>>>>> before
> > > >>>>>>>>> the RuntimeFilterBuilder finished, it will not process any
> data
> > > and
> > > >>>>>> will
> > > >>>>>>>>> occupy resources", whether it benefits us depends on the
> scale
> > of
> > > >>>> data,
> > > >>>>>>>> if
> > > >>>>>>>>> the RuntimeFIlterBuilder could be done quickly than
> > RuntimeFilter
> > > >>>>>>>> operator,
> > > >>>>>>>>> it can still filter out additional data afterwards. Hence in
> my
> > > >>>>>> opinion,
> > > >>>>>>>> we
> > > >>>>>>>>> do not need to make the edge between RuntimeFilterBuilder and
> > > >>>>>>>> RuntimeFilter
> > > >>>>>>>>> BLOCKING only, at least it can be configured.
> > > >>>>>>>>>
> > > >>>>>>>>> [1]
> > > >>>>>>
> > > >>>>
> > > >>
> > >
> >
> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://www.google.com/url?q%253Dhttps://www.google.com/url?q%25253Dhttps://issues.apache.org/jira/browse/FLINK-25318%252526source%25253Dgmail-imap%252526ust%25253D1687433776000000%252526usg%25253DAOvVaw3GqdpuiCqegqRLDv1PjMiL%2526source%253Dgmail-imap%2526ust%253D1687760804000000%2526usg%253DAOvVaw1oNzOlNn0UCDtz1M9jAw1x%26source%3Dgmail-imap%26ust%3D1687773407000000%26usg%3DAOvVaw3Zt14Wvxs_b8ghD0dIgPfH&source=gmail-imap&ust=1687781326000000&usg=AOvVaw0HsmkkqPeZGZOBvFiA8NOA
> > > >>>>>>>>>
> > > >>>>>>>>> Lijie Wang <[email protected] <mailto:
> > > [email protected]> <mailto:
> > > >> [email protected] <mailto:[email protected]>>
> <mailto:
> > > >>>> [email protected] <mailto:[email protected]>
> <mailto:
> > > [email protected]>> <mailto:
> > > >> [email protected] <mailto:[email protected]>>>
> > > >>>>>> 于2023年6月15日周四 14:18写道：
> > > >>>>>>>>>
> > > >>>>>>>>>> Hi Yuxia,
> > > >>>>>>>>>>
> > > >>>>>>>>>> I made a mistake in the above response.
> > > >>>>>>>>>>
> > > >>>>>>>>>> The runtime filter can work well with all shuffle mode.
> > However,
> > > >>>>>> hybrid
> > > >>>>>>>>>> shuffle and blocking shuffle are currently recommended for
> > batch
> > > >>>> jobs
> > > >>>>>>>>>> (piepline shuffle is not recommended).
> > > >>>>>>>>>>
> > > >>>>>>>>>> One more thing to mention here is that we will force the
> edge
> > > >>>> between
> > > >>>>>>>>>> RuntimeFilterBuilder and RuntimeFilter to be
> > BLOCKING(regardless
> > > >> of
> > > >>>>>>>> which
> > > >>>>>>>>>> BatchShuffleMode is set). Because the RuntimeFilter really
> > > doesn’t
> > > >>>>>> need
> > > >>>>>>>>> to
> > > >>>>>>>>>> run before the RuntimeFilterBuilder finished. Even if the
> > > >>>>>> RuntimeFilter
> > > >>>>>>>>>> becomes running before the RuntimeFilterBuilder finished, it
> > > will
> > > >>>> not
> > > >>>>>>>>>> process any data and will occupy resources.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Best,
> > > >>>>>>>>>> Lijie
> > > >>>>>>>>>>
> > > >>>>>>>>>> Lijie Wang <[email protected] <mailto:
> > > [email protected]> <mailto:
> > > >> [email protected] <mailto:[email protected]>>
> <mailto:
> > > >>>> [email protected] <mailto:[email protected]>
> <mailto:
> > > [email protected]>> <mailto:
> > > >> [email protected] <mailto:[email protected]>>>
> > > >>>>>> 于2023年6月15日周四 09:48写道：
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hi Yuxia,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks for your feedback. The answers of your questions are
> > as
> > > >>>>>>>> follows:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 1. Yes, the row count comes from statistic of underlying
> > > table(Or
> > > >>>>>>>>>>> estimated based on the statistic of underlying table, if
> the
> > > >> build
> > > >>>>>>>> side
> > > >>>>>>>>>> or
> > > >>>>>>>>>>> probe side is not TableScan).  If the statistic
> unavailable,
> > we
> > > >>>> will
> > > >>>>>>>>> not
> > > >>>>>>>>>>> inject a runtime filter(As you said, we can hardly evaluate
> > the
> > > >>>>>>>>>> benefits).
> > > >>>>>>>>>>> Besides, AFAIK, the estimated data size of build side is
> also
> > > >> based
> > > >>>>>>>> on
> > > >>>>>>>>>> the
> > > >>>>>>>>>>> row count statistics, that is, if the statistics is
> > > unavailable,
> > > >>>> the
> > > >>>>>>>>>>> requirement
> > > "table.optimizer.runtime-filter.max-build-data-size"
> > > >>>>>>>> cannot
> > > >>>>>>>>>> be
> > > >>>>>>>>>>> evaluated either. I'll add this point into FLIP.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 2.
> > > >>>>>>>>>>> Estimated data size does not meet requirement (in planner
> > > >>>>>>>> optimization
> > > >>>>>>>>>>> phase) -> No filter
> > > >>>>>>>>>>> Estimated data size meets the requirement (in planner
> > > >> optimization
> > > >>>>>>>>>> phase),
> > > >>>>>>>>>>> but the real data size does not meet the requirement(in
> > > execution
> > > >>>>>>>>> phase)
> > > >>>>>>>>>> ->
> > > >>>>>>>>>>> Fake filter
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 3. Yes, the runtime filter is only for batch jobs/blocking
> > > >> shuffle.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Best,
> > > >>>>>>>>>>> Lijie
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> yuxia <[email protected] <mailto:
> > > [email protected]> <mailto:
> > > >> [email protected] <mailto:[email protected]>>
> > > <mailto:
> > > >>>> [email protected] <mailto:[email protected]>
> > > <mailto:[email protected]>>
> > > >> <mailto:
> > > >>>>>> [email protected] <mailto:[email protected]
> >
> > > <mailto:[email protected]>
> > > >> <mailto:[email protected]>>>
> > > >>>> 于2023年6月14日周三 20:37写道：
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks Lijie for starting this discussion. Excited to see
> > > >> runtime
> > > >>>>>>>>> filter
> > > >>>>>>>>>>>> is to be implemented in Flink.
> > > >>>>>>>>>>>> I have few questions about it:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> 1: As the FLIP said, `if the ndv cannot be estimated, use
> > row
> > > >>>> count
> > > >>>>>>>>>>>> instead`. So, does row count comes from the statistic from
> > > >>>>>>>> underlying
> > > >>>>>>>>>>>> table? What if the the statistic is also unavailable
> > > considering
> > > >>>>>>>> users
> > > >>>>>>>>>>>> maynot always remember to generate statistic in
> production.
> > > >>>>>>>>>>>> I'm wondering whether it make senese that just disable
> > runtime
> > > >>>>>>>> filter
> > > >>>>>>>>> if
> > > >>>>>>>>>>>> statistic is unavailable since in that case, we can hardly
> > > >>>> evaluate
> > > >>>>>>>>> the
> > > >>>>>>>>>>>> benefits of runtime-filter.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> 2: The FLIP said: "We will inject the runtime filters only
> > if
> > > >> the
> > > >>>>>>>>>>>> following requirements are met:xxx", but it also said,
> "Once
> > > >> this
> > > >>>>>>>>> limit
> > > >>>>>>>>>> is
> > > >>>>>>>>>>>> exceeded, it will output a fake filter(which always
> returns
> > > >> true)"
> > > >>>>>>>> in
> > > >>>>>>>>>>>> `RuntimeFilterBuilderOperator` part; Seems they are
> > > >> contradictory,
> > > >>>>>>>> so
> > > >>>>>>>>>> i'm
> > > >>>>>>>>>>>> wondering what's the real behavior, no filter will be
> > injected
> > > >> or
> > > >>>>>>>> fake
> > > >>>>>>>>>>>> filter?
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> 3: Does it also mean runtime-filter can only take effect
> in
> > > >>>> blocking
> > > >>>>>>>>>>>> shuffle?
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Best regards,
> > > >>>>>>>>>>>> Yuxia
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> ----- 原始邮件 -----
> > > >>>>>>>>>>>> 发件人: "ron9 liu" <[email protected] <mailto:
> > > [email protected]> <mailto:[email protected]>
> > > >> <mailto:[email protected]>
> > > >>>> <mailto:[email protected]>>
> > > >>>>>>>>>>>> 收件人: "dev" <[email protected] <mailto:
> > [email protected]>
> > > <mailto:[email protected]>
> > > >> <mailto:[email protected]>
> > > >>>> <mailto:[email protected]>>
> > > >>>>>>>>>>>> 发送时间: 星期三, 2023年 6 月 14日 下午 5:29:28
> > > >>>>>>>>>>>> 主题: Re: [DISCUSS] FLIP-324: Introduce Runtime Filter for
> > Flink
> > > >>>> Batch
> > > >>>>>>>>>> Jobs
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks Lijie start this discussion. Runtime Filter is a
> > common
> > > >>>>>>>>>>>> optimization
> > > >>>>>>>>>>>> to improve the join performance that has been adopted by
> > many
> > > >>>>>>>>> computing
> > > >>>>>>>>>>>> engines such as Spark, Doris, etc... Flink is a streaming
> > > batch
> > > >>>>>>>>>> computing
> > > >>>>>>>>>>>> engine, and we are continuously optimizing the performance
> > of
> > > >>>>>>>> batches.
> > > >>>>>>>>>>>> Runtime filter is a general performance optimization
> > technique
> > > >>>> that
> > > >>>>>>>>> can
> > > >>>>>>>>>>>> improve the performance of Flink batch jobs, so we are
> > > >> introducing
> > > >>>>>>>> it
> > > >>>>>>>>> on
> > > >>>>>>>>>>>> batch as well.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Looking forward to all feedback.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Best,
> > > >>>>>>>>>>>> Ron
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Lijie Wang <[email protected] <mailto:
> > > [email protected]> <mailto:
> > > >> [email protected] <mailto:[email protected]>>
> <mailto:
> > > >>>> [email protected] <mailto:[email protected]>
> <mailto:
> > > [email protected]>> <mailto:
> > > >>>>>> [email protected] <mailto:[email protected]>
> > <mailto:
> > > [email protected]> <mailto:
> > > >> [email protected] <mailto:[email protected]>>>>
> > > >>>> 于2023年6月14日周三 17:17写道：
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Hi devs
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Ron Liu, Gen Luo and I would like to start a discussion
> > about
> > > >>>>>>>>>> FLIP-324:
> > > >>>>>>>>>>>>> Introduce Runtime Filter for Flink Batch Jobs[1]
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Runtime Filter is a common optimization to improve join
> > > >>>>>>>> performance.
> > > >>>>>>>>>> It
> > > >>>>>>>>>>>> is
> > > >>>>>>>>>>>>> designed to dynamically generate filter conditions for
> > > certain
> > > >>>>>>>> Join
> > > >>>>>>>>>>>> queries
> > > >>>>>>>>>>>>> at runtime to reduce the amount of scanned or shuffled
> > data,
> > > >>>> avoid
> > > >>>>>>>>>>>>> unnecessary I/O and network transmission, and speed up
> the
> > > >> query.
> > > >>>>>>>>> Its
> > > >>>>>>>>>>>>> working principle is building a filter(e.g. bloom filter)
> > > based
> > > >>>> on
> > > >>>>>>>>> the
> > > >>>>>>>>>>>> data
> > > >>>>>>>>>>>>> on the small table side(build side) first, then pass this
> > > >> filter
> > > >>>>>>>> to
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>> large table side(probe side) to filter the irrelevant
> data
> > on
> > > >> it,
> > > >>>>>>>>> this
> > > >>>>>>>>>>>> can
> > > >>>>>>>>>>>>> reduce the data reaching the join and improve
> performance.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> You can find more details in the FLIP-324[1]. Looking
> > forward
> > > >> to
> > > >>>>>>>>> your
> > > >>>>>>>>>>>>> feedback.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> > >
> >
> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://www.google.com/url?q%253Dhttps://www.google.com/url?q%25253Dhttps://cwiki.apache.org/confluence/display/FLINK/FLIP-324%252525253A%2525252BIntroduce%2525252BRuntime%2525252BFilter%2525252Bfor%2525252BFlink%2525252BBatch%2525252BJobs%252526source%25253Dgmail-imap%252526ust%25253D1687433776000000%252526usg%25253DAOvVaw0ke1ZHcJ--A1QgsbB84MHA%2526source%253Dgmail-imap%2526ust%253D1687760804000000%2526usg%253DAOvVaw21E3CQyayeBTYztmOnwMcz%26source%3Dgmail-imap%26ust%3D1687773407000000%26usg%3DAOvVaw0xVu0zYYNRmh8u8aq7uSi3&source=gmail-imap&ust=1687781326000000&usg=AOvVaw1LXwtWT177350iKD3sKCEt
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>> Ron & Gen & Lijie
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> --
> > > >>>>>>>>>
> > > >>>>>>>>> Best,
> > > >>>>>>>>> Benchao Li
> > >
> > >
> >
>

Re: [DISCUSS] FLIP-324: Introduce Runtime Filter for Flink Batch Jobs

Reply via email to