Re: [DISCUSS] FLIP-379: Dynamic source parallelism inference for batch jobs

Xia Sun Wed, 01 Nov 2023 03:22:29 -0700

Thanks Lijie for the comments!
1. For Hive source, dynamic parallelism inference in batch scenarios is a
superset of static parallelism inference. As a follow-up task, we can
consider changing the default value of
'table.exec.hive.infer-source-parallelism' to false.


2. I think that both dynamic parallelism inference and static parallelism
inference have their own use cases. Currently, for streaming sources and
other sources that are not sensitive to dynamic information, the benefits
of dynamic parallelism inference may not be significant. In such cases, we
can continue to use static parallelism inference.

Thanks,
Xia

Lijie Wang <wangdachui9...@gmail.com> 于2023年11月1日周三 14:52写道：

> Hi Xia,
>
> Thanks for driving this FLIP, +1 for the proposal.
>
> I have 2 questions about the relationship between static inference and
> dynamic inference:
>
> 1. AFAIK, currently the hive table source enable static inference by
> default. In this case, which one (static vs dynamic) will take effect ? I
> think it would be better if we can point this out in FLIP
>
> 2. As you mentioned above, dynamic inference is the most ideal way, so do
> we have plan to deprecate the static inference in the future?
>
> Best,
> Lijie
>
> Zhu Zhu <reed...@gmail.com> 于2023年10月31日周二 20:19写道：
>
> > Thanks for opening the FLIP and kicking off this discussion, Xia!
> > The proposed changes make up an important missing part of the dynamic
> > parallelism inference of adaptive batch scheduler.
> >
> > Besides that, it is also one good step towards supporting dynamic
> > parallelism inference for streaming sources, e.g. allowing Kafka
> > sources to determine its parallelism automatically based on the
> > number of partitions.
> >
> > +1 for the proposal.
> >
> > Thanks,
> > Zhu
> >
> > Xia Sun <xingbe...@gmail.com> 于2023年10月31日周二 16:01写道：
> >
> > > Hi everyone,
> > > I would like to start a discussion on FLIP-379: Dynamic source
> > parallelism
> > > inference for batch jobs[1].
> > >
> > > In general, there are three main ways to set source parallelism for
> batch
> > > jobs:
> > > (1) User-defined source parallelism.
> > > (2) Connector static parallelism inference.
> > > (3) Dynamic parallelism inference.
> > >
> > > Compared to manually setting parallelism, automatic parallelism
> inference
> > > is easier to use and can better adapt to varying data volumes each day.
> > > However, static parallelism inference cannot leverage runtime
> > information,
> > > resulting in inaccurate parallelism inference. Therefore, for batch
> jobs,
> > > dynamic parallelism inference is the most ideal, but currently, the
> > support
> > > for adaptive batch scheduler is not very comprehensive.
> > >
> > > Therefore, we aim to introduce a general interface that enables the
> > > adaptive batch scheduler to dynamically infer the source parallelism at
> > > runtime. Please refer to the FLIP[1] document for more details about
> the
> > > proposed design and implementation.
> > >
> > > I also thank Zhu Zhu and LiJie Wang for their suggestions during the
> > > pre-discussion.
> > > Looking forward to your feedback and suggestions, thanks.
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-379%3A+Dynamic+source+parallelism+inference+for+batch+jobs
> > >
> > > Best regards,
> > > Xia
> > >
> >
>

Re: [DISCUSS] FLIP-379: Dynamic source parallelism inference for batch jobs

Reply via email to