Hi Hangxiang, David,

Thank you for your replies. Your responses are very helpful.

Best regards,
Tony Wei

David Anderson <dander...@apache.org> 於 2023年3月14日 週二 下午12:12寫道:

> I believe there is some noticeable overhead if you are using the
> heap-based state backend, but with RocksDB I think the difference is
> negligible.
>
> David
>
> On Tue, Mar 7, 2023 at 11:10 PM Hangxiang Yu <master...@gmail.com> wrote:
> >
> > Hi, Tony.
> > "be detrimental to performance" means that some extra space overhead of
> the field of the key-group may influence performance.
> > As we know, Flink will write the key group as the prefix of the key to
> speed up rescaling.
> > So the format will be like: key group | key len | key | ......
> > You could check the relationship between max parallelism and bytes of
> key group as below:
> > ------------------------------------------
> > max parallelism   bytes of key group
> >        128                    1
> >       32768                 2
> > ------------------------------------------
> > So I think the cost will be very small if the real key length >> 2 bytes.
> >
> > On Wed, Mar 8, 2023 at 1:06 PM Tony Wei <tony19920...@gmail.com> wrote:
> >>
> >> Hi experts,
> >>
> >>> Setting the maximum parallelism to a very large value can be
> detrimental to performance because some state backends have to keep
> internal data structures that scale with the number of key-groups (which
> are the internal implementation mechanism for rescalable state).
> >>>
> >>> Changing the maximum parallelism explicitly when recovery from
> original job will lead to state incompatibility.
> >>
> >>
> >> I read the section above from Flink official document [1], and I'm
> wondering what the detail is regarding to the side-effect.
> >>
> >> Suppose that I have a Flink SQL job with large state, large parallelism
> and using RocksDB as my state backend.
> >> I would like to set the max parallelism as 32768, so that I don't
> bother if the max parallelism can be divided by the parallelism whenever I
> want to scale my job,
> >> because the number of key groups will not differ too much between each
> subtask.
> >>
> >> I'm wondering if this is a good practice, because based on the official
> document it is not recommended actually.
> >> If possible, I would like to know the detail about this side-effect.
> Which state backend will have this issue? and Why?
> >> Please give me an advice. Thanks in advance.
> >>
> >> Best regards,
> >> Tony Wei
> >>
> >> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#setting-the-maximum-parallelism
> >
> >
> >
> > --
> > Best,
> > Hangxiang.
>

Reply via email to