Re: [DISCUSS] FLIP-423 ～FLIP-428: Introduce Disaggregated State Storage and Management in Flink 2.0

Zakelly Lan Tue, 19 Mar 2024 00:40:24 -0700

Hi Yunfeng,

Thanks for the suggestion!


I will reorganize the FLIP-425 accordingly.


Best,
Zakelly

On Tue, Mar 19, 2024 at 3:20 PM Yunfeng Zhou <[email protected]>
wrote:

> Hi Xintong and Zakelly,
>
> > 2. Regarding Strictly-ordered and Out-of-order of Watermarks
> I agree with it that watermarks can use only out-of-order mode for
> now, because there is still not a concrete example showing the
> correctness risk about it. However, the strictly-ordered mode should
> still be supported as the default option for non-record event types
> other than watermark, at least for checkpoint barriers.
>
> I noticed that this information has already been documented in "For
> other non-record events, such as RecordAttributes ...", but it's at
> the bottom of the "Watermark" section, which might not be very
> obvious. Thus it might be better to reorganize the FLIP to better
> claim that the two order modes are designed for all non-record events,
> and which mode this FLIP would choose for each type of event.
>
> Best,
> Yunfeng
>
> On Tue, Mar 19, 2024 at 1:09 PM Xintong Song <[email protected]>
> wrote:
> >
> > Thanks for the quick response. Sounds good to me.
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Tue, Mar 19, 2024 at 1:03 PM Zakelly Lan <[email protected]>
> wrote:
> >
> > > Hi Xintong,
> > >
> > > Thanks for sharing your thoughts!
> > >
> > > 1. Regarding Record-ordered and State-ordered of processElement.
> > > >
> > > > I understand that while State-ordered likely provides better
> performance,
> > > > Record-ordered is sometimes required for correctness. The question
> is how
> > > > should a user choose between these two modes? My concern is that
> such a
> > > > decision may require users to have in-depth knowledge about the Flink
> > > > internals, and may lead to correctness issues if State-ordered is
> chosen
> > > > improperly.
> > > >
> > > > I'd suggest not to expose such a knob, at least in the first version.
> > > That
> > > > means always use Record-ordered for custom operators / UDFs, and keep
> > > > State-ordered for internal usages (built-in operators) only.
> > > >
> > >
> > > Indeed, users may not be able to choose the mode properly. I agree to
> keep
> > > such options for internal use.
> > >
> > >
> > > 2. Regarding Strictly-ordered and Out-of-order of Watermarks.
> > > >
> > > > I'm not entirely sure about Strictly-ordered being the default, or
> even
> > > > being supported. From my understanding, a Watermark(T) only suggests
> that
> > > > all records with event time before T has arrived, and it has nothing
> to
> > > do
> > > > with whether records with event time after T has arrived or not. From
> > > that
> > > > perspective, preventing certain records from arriving before a
> Watermark
> > > is
> > > > never supported. I also cannot come up with any use case where
> > > > Strictly-ordered is necessary. This implies the same issue as 1): how
> > > does
> > > > the user choose between the two modes?
> > > >
> > > > I'd suggest not expose the knob to users and only support
> Out-of-order,
> > > > until we see a concrete use case that Strictly-ordered is needed.
> > > >
> > >
> > > The semantics of watermarks do not define the sequence between a
> watermark
> > > and subsequent records. For the most part, this is inconsequential,
> except
> > > it may affect some current users who have previously relied on the
> implicit
> > > assumption of an ordered execution. I'd be fine with initially
> supporting
> > > only out-of-order processing. We may consider exposing the
> > > 'Strictly-ordered' mode once we encounter a concrete use case that
> > > necessitates it.
> > >
> > >
> > > My philosophies behind not exposing the two config options are:
> > > > - There are already too many options in Flink that barely know how
> to use
> > > > them. I think Flink should try as much as possible to decide its own
> > > > behavior, rather than throwing all the decisions to the users.
> > > > - It's much harder to take back knobs than to introduce them.
> Therefore,
> > > > options should be introduced only if concrete use cases are
> identified.
> > > >
> > >
> > > I agree to keep minimal configurable items especially for the MVP.
> Given
> > > that we have the opportunity to refine the functionality before the
> > > framework transitions from @Experimental to @PublicEvolving, it makes
> sense
> > > to refrain from presenting user-facing options until we have ensured
> > > their necessity.
> > >
> > >
> > > Best,
> > > Zakelly
> > >
> > > On Tue, Mar 19, 2024 at 12:06 PM Xintong Song <[email protected]>
> > > wrote:
> > >
> > > > Sorry for joining the discussion late.
> > > >
> > > > I have two questions about FLIP-425.
> > > >
> > > > 1. Regarding Record-ordered and State-ordered of processElement.
> > > >
> > > > I understand that while State-ordered likely provides better
> performance,
> > > > Record-ordered is sometimes required for correctness. The question
> is how
> > > > should a user choose between these two modes? My concern is that
> such a
> > > > decision may require users to have in-depth knowledge about the Flink
> > > > internals, and may lead to correctness issues if State-ordered is
> chosen
> > > > improperly.
> > > >
> > > > I'd suggest not to expose such a knob, at least in the first version.
> > > That
> > > > means always use Record-ordered for custom operators / UDFs, and keep
> > > > State-ordered for internal usages (built-in operators) only.
> > > >
> > > > 2. Regarding Strictly-ordered and Out-of-order of Watermarks.
> > > >
> > > > I'm not entirely sure about Strictly-ordered being the default, or
> even
> > > > being supported. From my understanding, a Watermark(T) only suggests
> that
> > > > all records with event time before T has arrived, and it has nothing
> to
> > > do
> > > > with whether records with event time after T has arrived or not. From
> > > that
> > > > perspective, preventing certain records from arriving before a
> Watermark
> > > is
> > > > never supported. I also cannot come up with any use case where
> > > > Strictly-ordered is necessary. This implies the same issue as 1): how
> > > does
> > > > the user choose between the two modes?
> > > >
> > > > I'd suggest not expose the knob to users and only support
> Out-of-order,
> > > > until we see a concrete use case that Strictly-ordered is needed.
> > > >
> > > >
> > > > My philosophies behind not exposing the two config options are:
> > > > - There are already too many options in Flink that barely know how
> to use
> > > > them. I think Flink should try as much as possible to decide its own
> > > > behavior, rather than throwing all the decisions to the users.
> > > > - It's much harder to take back knobs than to introduce them.
> Therefore,
> > > > options should be introduced only if concrete use cases are
> identified.
> > > >
> > > > WDYT?
> > > >
> > > > Best,
> > > >
> > > > Xintong
> > > >
> > > >
> > > >
> > > > On Fri, Mar 8, 2024 at 2:45 AM Jing Ge <[email protected]>
> > > wrote:
> > > >
> > > > > +1 for Gyula's suggestion. I just finished FLIP-423 which
> introduced
> > > the
> > > > > intention of the big change and high level architecture. Great
> content
> > > > btw!
> > > > > The only public interface change for this FLIP is one new config
> to use
> > > > > ForSt. It makes sense to have one dedicated discussion thread for
> each
> > > > > concrete system design.
> > > > >
> > > > > @Zakelly The links in your mail do not work except the last one,
> > > because
> > > > > the FLIP-xxx has been included into the url like
> > > > >
> > >
> https://lists.apache.org/thread/nmd9qd0k8l94ygcfgllxms49wmtz1864FLIP-425
> > > > .
> > > > >
> > > > > NIT fix:
> > > > >
> > > > > FLIP-424:
> > > > https://lists.apache.org/thread/nmd9qd0k8l94ygcfgllxms49wmtz1864
> > > > >
> > > > > FLIP-425:
> > > > https://lists.apache.org/thread/wxn1j848fnfkqsnrs947wh1wmj8n8z0h
> > > > >
> > > > > FLIP-426:
> > > > https://lists.apache.org/thread/bt931focfl9971cwq194trmf3pkdsxrf
> > > > >
> > > > > FLIP-427:
> > > > https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ft
> > > > >
> > > > > FLIP-428:
> > > > https://lists.apache.org/thread/vr8f91p715ct4lop6b3nr0fh4z5p312b
> > > > >
> > > > > Best regards,
> > > > > Jing
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Mar 7, 2024 at 10:14 AM Zakelly Lan <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > Thank you all for a lively discussion here, and it is a good
> time to
> > > > move
> > > > > > forward to more detailed discussions. Thus we open several
> threads
> > > for
> > > > > > sub-FLIPs:
> > > > > >
> > > > > > FLIP-424:
> > > > > https://lists.apache.org/thread/nmd9qd0k8l94ygcfgllxms49wmtz1864
> > > > > > FLIP-425
> > > > > > <
> > > > >
> > >
> https://lists.apache.org/thread/nmd9qd0k8l94ygcfgllxms49wmtz1864FLIP-425
> > > > >:
> > > > > > https://lists.apache.org/thread/wxn1j848fnfkqsnrs947wh1wmj8n8z0h
> > > > > > FLIP-426
> > > > > > <
> > > > >
> > >
> https://lists.apache.org/thread/wxn1j848fnfkqsnrs947wh1wmj8n8z0hFLIP-426
> > > > >:
> > > > > > https://lists.apache.org/thread/bt931focfl9971cwq194trmf3pkdsxrf
> > > > > > FLIP-427
> > > > > > <
> > > > >
> > >
> https://lists.apache.org/thread/bt931focfl9971cwq194trmf3pkdsxrfFLIP-427
> > > > >:
> > > > > > https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ft
> > > > > > FLIP-428
> > > > > > <
> > > > >
> > >
> https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ftFLIP-428
> > > > >:
> > > > > > https://lists.apache.org/thread/vr8f91p715ct4lop6b3nr0fh4z5p312b
> > > > > >
> > > > > > If you want to talk about the overall architecture, roadmap,
> > > milestones
> > > > > or
> > > > > > something related with multiple FLIPs, please post it here.
> Otherwise
> > > > you
> > > > > > can discuss some details in separate mails. Let's try to avoid
> > > repeated
> > > > > > discussion in different threads. I will sync important messages
> here
> > > if
> > > > > > there are any in the above threads.
> > > > > >
> > > > > > And reply to @Jeyhun: We will ensure the content between those
> FLIPs
> > > is
> > > > > > consistent.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Zakelly
> > > > > >
> > > > > > On Thu, Mar 7, 2024 at 2:16 PM Yuan Mei <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > I have been a bit busy these few weeks and sorry for responding
> > > late.
> > > > > > >
> > > > > > > The original thinking of keeping discussion within one thread
> is
> > > for
> > > > > > easier
> > > > > > > tracking and avoid for repeated discussion in different
> threads.
> > > > > > >
> > > > > > > For details, It might be good to start in different threads if
> > > > needed.
> > > > > > >
> > > > > > > We will think of a way to better organize the discussion.
> > > > > > >
> > > > > > > Best
> > > > > > > Yuan
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Mar 7, 2024 at 4:38 AM Jeyhun Karimov <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > + 1 for the suggestion.
> > > > > > > > Maybe we can the discussion with the FLIPs with minimum
> > > > dependencies
> > > > > > > (from
> > > > > > > > the other new/proposed FLIPs).
> > > > > > > > Based on our discussion on a particular FLIP, the subsequent
> (or
> > > > its
> > > > > > > > dependent) FLIP(s) can be updated accordingly?
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Jeyhun
> > > > > > > >
> > > > > > > > On Wed, Mar 6, 2024 at 5:34 PM Gyula Fóra <
> [email protected]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey all!
> > > > > > > > >
> > > > > > > > > This is a massive improvement / work. I just started going
> > > > through
> > > > > > the
> > > > > > > > > Flips and have a more or less meta comment.
> > > > > > > > >
> > > > > > > > > While it's good to keep the overall architecture discussion
> > > > here, I
> > > > > > > think
> > > > > > > > > we should still have separate discussions for each FLIP
> where
> > > we
> > > > > can
> > > > > > > > > discuss interface details etc. With so much content if we
> start
> > > > > > adding
> > > > > > > > > minor comments here that will lead to nowhere but those
> > > > discussions
> > > > > > are
> > > > > > > > > still important and we should have them in separate threads
> > > (one
> > > > > for
> > > > > > > each
> > > > > > > > > FLIP)
> > > > > > > > >
> > > > > > > > > What do you think?
> > > > > > > > > Gyula
> > > > > > > > >
> > > > > > > > > On Wed, Mar 6, 2024 at 8:50 AM Yanfei Lei <
> [email protected]
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi team,
> > > > > > > > > >
> > > > > > > > > > Thanks for your discussion. Regarding FLIP-425, we have
> > > > > > supplemented
> > > > > > > > > > several updates to answer high-frequency questions:
> > > > > > > > > >
> > > > > > > > > > 1. We captured a flame graph of the Hashmap state
> backend in
> > > > > > > > > > "Synchronous execution with asynchronous APIs"[1], which
> > > > reveals
> > > > > > that
> > > > > > > > > > the framework overhead (including reference counting,
> > > > > > future-related
> > > > > > > > > > code and so on) consumes about 9% of the keyed operator
> CPU
> > > > time.
> > > > > > > > > > 2. We added a set of comparative experiments for
> watermark
> > > > > > > processing,
> > > > > > > > > > the performance of Out-Of-Order mode is 70% better than
> > > > > > > > > > strictly-ordered mode under ~140MB state size.
> Instructions
> > > on
> > > > > how
> > > > > > to
> > > > > > > > > > run this test have also been added[2].
> > > > > > > > > > 3. Regarding the order of StreamRecord, whether it has
> state
> > > > > access
> > > > > > > or
> > > > > > > > > > not. We supplemented a new *Strict order of
> > > > 'processElement'*[3].
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-425%3A+Asynchronous+Execution+Model#FLIP425:AsynchronousExecutionModel-SynchronousexecutionwithasynchronousAPIs
> > > > > > > > > > [2]
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-425%3A+Asynchronous+Execution+Model#FLIP425:AsynchronousExecutionModel-Strictly-orderedmodevs.Out-of-ordermodeforwatermarkprocessing
> > > > > > > > > > [3]
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-425%3A+Asynchronous+Execution+Model#FLIP425:AsynchronousExecutionModel-ElementOrder
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Yanfei
> > > > > > > > > >
> > > > > > > > > > Yunfeng Zhou <[email protected]> 于2024年3月5日周二
> > > > 09:25写道：
> > > > > > > > > > >
> > > > > > > > > > > Hi Zakelly,
> > > > > > > > > > >
> > > > > > > > > > > > 5. I'm not very sure ... revisiting this later since
> it
> > > is
> > > > > not
> > > > > > > > > > important.
> > > > > > > > > > >
> > > > > > > > > > > It seems that we still have some details to confirm
> about
> > > > this
> > > > > > > > > > > question. Let's postpone this to after the critical
> parts
> > > of
> > > > > the
> > > > > > > > > > > design are settled.
> > > > > > > > > > >
> > > > > > > > > > > > 8. Yes, we had considered ... metrics should be like
> > > > > > afterwards.
> > > > > > > > > > >
> > > > > > > > > > > Oh sorry I missed FLIP-431. I'm fine with discussing
> this
> > > > topic
> > > > > > in
> > > > > > > > > > milestone 2.
> > > > > > > > > > >
> > > > > > > > > > > Looking forward to the detailed design about the strict
> > > mode
> > > > > > > between
> > > > > > > > > > > same-key records and the benchmark results about the
> epoch
> > > > > > > mechanism.
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Yunfeng
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Mar 4, 2024 at 7:59 PM Zakelly Lan <
> > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Yunfeng,
> > > > > > > > > > > >
> > > > > > > > > > > > For 1:
> > > > > > > > > > > > I had a discussion with Lincoln Lee, and I realize
> it is
> > > a
> > > > > > common
> > > > > > > > > case
> > > > > > > > > > the same-key record should be blocked before the
> > > > > `processElement`.
> > > > > > It
> > > > > > > > is
> > > > > > > > > > easier for users to understand. Thus I will introduce a
> > > strict
> > > > > mode
> > > > > > > for
> > > > > > > > > > this and make it default. My rough idea is just like
> yours,
> > > by
> > > > > > > invoking
> > > > > > > > > > some method of AEC instance before `processElement`. The
> > > > detailed
> > > > > > > > design
> > > > > > > > > > will be described in FLIP later.
> > > > > > > > > > > >
> > > > > > > > > > > > For 2:
> > > > > > > > > > > > I agree with you. We could throw exceptions for now
> and
> > > > > > optimize
> > > > > > > > this
> > > > > > > > > > later.
> > > > > > > > > > > >
> > > > > > > > > > > > For 5:
> > > > > > > > > > > >>
> > > > > > > > > > > >> It might be better to move the default values to the
> > > > > Proposed
> > > > > > > > > Changes
> > > > > > > > > > > >> section instead of making them public for now, as
> there
> > > > will
> > > > > > be
> > > > > > > > > > > >> compatibility issues once we want to dynamically
> adjust
> > > > the
> > > > > > > > > thresholds
> > > > > > > > > > > >> and timeouts in future.
> > > > > > > > > > > >
> > > > > > > > > > > > Agreed. The whole framework is under experiment
> until we
> > > > > think
> > > > > > it
> > > > > > > > is
> > > > > > > > > > complete in 2.0 or later. The default value should be
> better
> > > > > > > determined
> > > > > > > > > > with more testing results and production experience.
> > > > > > > > > > > >
> > > > > > > > > > > >> The configuration execution.async-state.enabled
> seems
> > > > > > > unnecessary,
> > > > > > > > > as
> > > > > > > > > > > >> the infrastructure may automatically get this
> > > information
> > > > > from
> > > > > > > the
> > > > > > > > > > > >> detailed state backend configurations. We may
> revisit
> > > this
> > > > > > part
> > > > > > > > > after
> > > > > > > > > > > >> the core designs have reached an agreement. WDYT?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I'm not very sure if there is any use case where
> users
> > > > write
> > > > > > > their
> > > > > > > > > > code using async APIs but run their job in a synchronous
> way.
> > > > The
> > > > > > > first
> > > > > > > > > two
> > > > > > > > > > scenarios that come to me are for benchmarking or for a
> small
> > > > > > state,
> > > > > > > > > while
> > > > > > > > > > they don't want to rewrite their code. Actually it is
> easy to
> > > > > > > support,
> > > > > > > > so
> > > > > > > > > > I'd suggest providing it. But I'm fine with revisiting
> this
> > > > later
> > > > > > > since
> > > > > > > > > it
> > > > > > > > > > is not important. WDYT?
> > > > > > > > > > > >
> > > > > > > > > > > > For 8:
> > > > > > > > > > > > Yes, we had considered the I/O metrics group
> especially
> > > the
> > > > > > > > > > back-pressure, idle and task busy per second. In the
> current
> > > > plan
> > > > > > we
> > > > > > > > can
> > > > > > > > > do
> > > > > > > > > > state access during back-pressure, meaning that those
> metrics
> > > > for
> > > > > > > input
> > > > > > > > > > would better be redefined. I suggest we discuss these
> > > existing
> > > > > > > metrics
> > > > > > > > as
> > > > > > > > > > well as some new metrics that should be introduced in
> > > FLIP-431
> > > > > > later
> > > > > > > in
> > > > > > > > > > milestone 2, since we have basically finished the
> framework
> > > > thus
> > > > > we
> > > > > > > > will
> > > > > > > > > > have a better view of what metrics should be like
> afterwards.
> > > > > WDYT?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Zakelly
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Mar 4, 2024 at 6:49 PM Yunfeng Zhou <
> > > > > > > > > > [email protected]> wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> Hi Zakelly,
> > > > > > > > > > > >>
> > > > > > > > > > > >> Thanks for the responses!
> > > > > > > > > > > >>
> > > > > > > > > > > >> > 1. I will discuss this with some expert SQL
> > > developers.
> > > > > ...
> > > > > > > mode
> > > > > > > > > > for StreamRecord processing.
> > > > > > > > > > > >>
> > > > > > > > > > > >> In DataStream API there should also be use cases
> when
> > > the
> > > > > > order
> > > > > > > of
> > > > > > > > > > > >> output is strictly required. I agree with it that
> SQL
> > > > > experts
> > > > > > > may
> > > > > > > > > help
> > > > > > > > > > > >> provide more concrete use cases that can accelerate
> our
> > > > > > > > discussion,
> > > > > > > > > > > >> but please allow me to search for DataStream use
> cases
> > > > that
> > > > > > can
> > > > > > > > > prove
> > > > > > > > > > > >> the necessity of this strict order preservation
> mode, if
> > > > > > answers
> > > > > > > > > from
> > > > > > > > > > > >> SQL experts are shown to be negative.
> > > > > > > > > > > >>
> > > > > > > > > > > >> For your convenience, my current rough idea is that
> we
> > > can
> > > > > > add a
> > > > > > > > > > > >> module between the Input(s) and processElement()
> module
> > > in
> > > > > > Fig 2
> > > > > > > > of
> > > > > > > > > > > >> FLIP-425. The module will be responsible for caching
> > > > records
> > > > > > > whose
> > > > > > > > > > > >> keys collide with in-flight records, and AEC will
> only
> > > be
> > > > > > > > > responsible
> > > > > > > > > > > >> for handling async state calls, without knowing the
> > > record
> > > > > > each
> > > > > > > > call
> > > > > > > > > > > >> belongs to. We may revisit this topic once the
> necessity
> > > > of
> > > > > > the
> > > > > > > > > strict
> > > > > > > > > > > >> order mode is clarified.
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> > 2. The amount of parallel StateRequests ...
> instead of
> > > > > > > invoking
> > > > > > > > > > yield
> > > > > > > > > > > >>
> > > > > > > > > > > >> Your suggestions generally appeal to me. I think we
> may
> > > > let
> > > > > > > > > > > >> corresponding Flink jobs fail with OOM for now,
> since
> > > the
> > > > > > > majority
> > > > > > > > > of
> > > > > > > > > > > >> a StateRequest should just be references to existing
> > > Java
> > > > > > > objects,
> > > > > > > > > > > >> which only occupies very small memory space and can
> > > hardly
> > > > > > cause
> > > > > > > > OOM
> > > > > > > > > > > >> in common cases. We can monitor the pending
> > > StateRequests
> > > > > and
> > > > > > if
> > > > > > > > > there
> > > > > > > > > > > >> is really a risk of OOM in extreme cases, we can
> throw
> > > > > > > Exceptions
> > > > > > > > > with
> > > > > > > > > > > >> proper messages notifying users what to do, like
> > > > increasing
> > > > > > > memory
> > > > > > > > > > > >> through configurations.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Your suggestions to adjust threshold adaptively or
> to
> > > use
> > > > > the
> > > > > > > > > blocking
> > > > > > > > > > > >> buffer sounds good, and in my opinion we can
> postpone
> > > them
> > > > > to
> > > > > > > > future
> > > > > > > > > > > >> FLIPs since they seem to only benefit users in rare
> > > cases.
> > > > > > Given
> > > > > > > > > that
> > > > > > > > > > > >> FLIP-423~428 has already been a big enough design,
> it
> > > > might
> > > > > be
> > > > > > > > > better
> > > > > > > > > > > >> to focus on the most critical design for now and
> > > postpone
> > > > > > > > > > > >> optimizations like this. WDYT?
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> > 5. Sure, we will introduce new configs as well as
> > > their
> > > > > > > default
> > > > > > > > > > value.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Thanks for adding the default values and the values
> > > > > themselves
> > > > > > > > LGTM.
> > > > > > > > > > > >> It might be better to move the default values to the
> > > > > Proposed
> > > > > > > > > Changes
> > > > > > > > > > > >> section instead of making them public for now, as
> there
> > > > will
> > > > > > be
> > > > > > > > > > > >> compatibility issues once we want to dynamically
> adjust
> > > > the
> > > > > > > > > thresholds
> > > > > > > > > > > >> and timeouts in future.
> > > > > > > > > > > >>
> > > > > > > > > > > >> The configuration execution.async-state.enabled
> seems
> > > > > > > unnecessary,
> > > > > > > > > as
> > > > > > > > > > > >> the infrastructure may automatically get this
> > > information
> > > > > from
> > > > > > > the
> > > > > > > > > > > >> detailed state backend configurations. We may
> revisit
> > > this
> > > > > > part
> > > > > > > > > after
> > > > > > > > > > > >> the core designs have reached an agreement. WDYT?
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> Besides, inspired by Jeyhun's comments, it comes to
> me
> > > > that
> > > > > > > > > > > >>
> > > > > > > > > > > >> 8. Should this FLIP introduce metrics that measure
> the
> > > > time
> > > > > a
> > > > > > > > Flink
> > > > > > > > > > > >> job is back-pressured by State IOs? Under the
> current
> > > > > design,
> > > > > > > this
> > > > > > > > > > > >> metric could measure the time when the blocking
> buffer
> > > is
> > > > > full
> > > > > > > and
> > > > > > > > > > > >> yield() cannot get callbacks to process, which
> means the
> > > > > > > operator
> > > > > > > > is
> > > > > > > > > > > >> fully waiting for state responses.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Best regards,
> > > > > > > > > > > >> Yunfeng
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Mon, Mar 4, 2024 at 12:33 PM Zakelly Lan <
> > > > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Hi Yunfeng,
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Thanks for your detailed comments!
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >> 1. Why do we need a close() method on
> StateIterator?
> > > > This
> > > > > > > > method
> > > > > > > > > > seems
> > > > > > > > > > > >> >> unused in the usage example codes.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > The `close()` is introduced to release internal
> > > > resources,
> > > > > > but
> > > > > > > > it
> > > > > > > > > > does not seem to require the user to call it. I removed
> this.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >> 2. In FutureUtils.combineAll()'s JavaDoc, it is
> > > stated
> > > > > that
> > > > > > > "No
> > > > > > > > > > null
> > > > > > > > > > > >> >> entries are allowed". It might be better to
> further
> > > > > explain
> > > > > > > > what
> > > > > > > > > > will
> > > > > > > > > > > >> >> happen if a null value is passed, ignoring the
> value
> > > in
> > > > > the
> > > > > > > > > > returned
> > > > > > > > > > > >> >> Collection or throwing exceptions. Given that
> > > > > > > > > > > >> >> FutureUtils.emptyFuture() can be returned in the
> > > > example
> > > > > > > code,
> > > > > > > > I
> > > > > > > > > > > >> >> suppose the former one might be correct.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > The statement "No null entries are allowed"
> refers to
> > > > the
> > > > > > > > > > parameters, it means some arrayList like [null,
> StateFuture1,
> > > > > > > > > StateFuture2]
> > > > > > > > > > passed in are not allowed, and an Exception will be
> thrown.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >> 1. According to Fig 2 of this FLIP, ... . This
> > > > situation
> > > > > > > should
> > > > > > > > > be
> > > > > > > > > > > >> >> avoided and the order of same-key records should
> be
> > > > > > strictly
> > > > > > > > > > > >> >> preserved.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > I will discuss this with some expert SQL
> developers.
> > > And
> > > > > if
> > > > > > it
> > > > > > > > is
> > > > > > > > > > valid and common, I suggest a strict order preservation
> mode
> > > > for
> > > > > > > > > > StreamRecord processing. WDYT?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >> 2. The FLIP says that StateRequests submitted by
> > > > > Callbacks
> > > > > > > will
> > > > > > > > > not
> > > > > > > > > > > >> >> invoke further yield() methods. Given that
> yield() is
> > > > > used
> > > > > > > when
> > > > > > > > > > there
> > > > > > > > > > > >> >> is "too much" in-flight data, does it mean
> > > > StateRequests
> > > > > > > > > submitted
> > > > > > > > > > by
> > > > > > > > > > > >> >> Callbacks will never be "too much"? What if the
> total
> > > > > > number
> > > > > > > of
> > > > > > > > > > > >> >> StateRequests exceed the capacity of Flink
> operator's
> > > > > > memory
> > > > > > > > > space?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > The amount of parallel StateRequests for one
> > > > StreamRecord
> > > > > > > cannot
> > > > > > > > > be
> > > > > > > > > > determined since the code is written by users. So the
> > > in-flight
> > > > > > > > requests
> > > > > > > > > > may be "too much", and may cause OOM. Users should
> > > re-configure
> > > > > > their
> > > > > > > > > job,
> > > > > > > > > > controlling the amount of on-going StreamRecord. And I
> > > suggest
> > > > > two
> > > > > > > ways
> > > > > > > > > to
> > > > > > > > > > avoid this:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Adaptively adjust the count of on-going
> StreamRecord
> > > > > > according
> > > > > > > > to
> > > > > > > > > > historical StateRequests amount.
> > > > > > > > > > > >> > Also control the max StateRequests that can be
> > > executed
> > > > in
> > > > > > > > > parallel
> > > > > > > > > > for each StreamRecord, and if it exceeds, put the new
> > > > > StateRequest
> > > > > > in
> > > > > > > > the
> > > > > > > > > > blocking buffer waiting for execution (instead of
> invoking
> > > > > > yield()).
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > WDYT?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >> 3.1 I'm concerned that the out-of-order execution
> > > mode,
> > > > > > along
> > > > > > > > > with
> > > > > > > > > > the
> > > > > > > > > > > >> >> epoch mechanism, would bring more complexity to
> the
> > > > > > execution
> > > > > > > > > model
> > > > > > > > > > > >> >> than the performance improvement it promises.
> Could
> > > we
> > > > > add
> > > > > > > some
> > > > > > > > > > > >> >> benchmark results proving the benefit of this
> mode?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Agreed, will do.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >> 3.2 The FLIP might need to add a public API
> section
> > > > > > > describing
> > > > > > > > > how
> > > > > > > > > > > >> >> users or developers can switch between these two
> > > > > execution
> > > > > > > > modes.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Good point. We will add a Public API section.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >> 3.3 Apart from the watermark and checkpoint
> mentioned
> > > > in
> > > > > > this
> > > > > > > > > FLIP,
> > > > > > > > > > > >> >> there are also more other events that might
> appear in
> > > > the
> > > > > > > > stream
> > > > > > > > > of
> > > > > > > > > > > >> >> data records. It might be better to generalize
> the
> > > > > > execution
> > > > > > > > mode
> > > > > > > > > > > >> >> mechanism to handle all possible events.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Yes, I missed this point. Thanks for the reminder.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >> 4. It might be better to treat callback-handling
> as a
> > > > > > > > > > > >> >> MailboxDefaultAction, instead of Mails, to avoid
> the
> > > > > > overhead
> > > > > > > > of
> > > > > > > > > > > >> >> repeatedly creating Mail objects.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >  I thought the intermediated wrapper for callback
> can
> > > > not
> > > > > be
> > > > > > > > > > omitted, since there will be some context switch before
> each
> > > > > > > execution.
> > > > > > > > > The
> > > > > > > > > > MailboxDefaultAction in most cases is processInput right?
> > > While
> > > > > the
> > > > > > > > > > callback should be executed with higher priority. I'd
> suggest
> > > > not
> > > > > > > > > changing
> > > > > > > > > > the basic logic of Mailbox and the default action since
> it is
> > > > > very
> > > > > > > > > critical
> > > > > > > > > > for performance. But yes, we will try our best to avoid
> > > > creating
> > > > > > > > > > intermediated objects.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >> 5. Could this FLIP provide the current default
> values
> > > > for
> > > > > > > > things
> > > > > > > > > > like
> > > > > > > > > > > >> >> active buffer size thresholds and timeouts? These
> > > could
> > > > > > help
> > > > > > > > with
> > > > > > > > > > > >> >> memory consumption and latency analysis.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Sure, we will introduce new configs as well as
> their
> > > > > default
> > > > > > > > > value.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >> 6. Why do we need to record the hashcode of a
> record
> > > in
> > > > > its
> > > > > > > > > > > >> >> RecordContext? It seems not used.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > The context switch before each callback execution
> > > > involves
> > > > > > > > > > setCurrentKey, where the hashCode is re-calculated. We
> cache
> > > it
> > > > > for
> > > > > > > > > > accelerating.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >> 7. In "timers can be stored on the JVM heap or
> > > > RocksDB",
> > > > > > the
> > > > > > > > link
> > > > > > > > > > > >> >> points to a document in flink-1.15. It might be
> > > better
> > > > to
> > > > > > > > verify
> > > > > > > > > > the
> > > > > > > > > > > >> >> referenced content is still valid in the latest
> Flink
> > > > and
> > > > > > > > update
> > > > > > > > > > the
> > > > > > > > > > > >> >> link accordingly. Same for other references if
> any.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Thanks for the reminder! Will check.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Thanks a lot & Best,
> > > > > > > > > > > >> > Zakelly
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > On Sat, Mar 2, 2024 at 6:18 AM Jeyhun Karimov <
> > > > > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> Hi,
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> Thanks for the great proposals. I have a few
> comments
> > > > > > > comments:
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> - Backpressure Handling. Flink's original
> > > backpressure
> > > > > > > handling
> > > > > > > > > is
> > > > > > > > > > quite
> > > > > > > > > > > >> >> robust and the semantics is quite "simple"
> (simple is
> > > > > > > > beautiful).
> > > > > > > > > > > >> >> This mechanism has proven to perform
> better/robust
> > > than
> > > > > the
> > > > > > > > other
> > > > > > > > > > open
> > > > > > > > > > > >> >> source streaming systems, where they were
> relying on
> > > > some
> > > > > > > > > loopback
> > > > > > > > > > > >> >> information.
> > > > > > > > > > > >> >> Now that the proposal also relies on loopback
> (yield
> > > in
> > > > > > this
> > > > > > > > > > case), it is
> > > > > > > > > > > >> >> not clear how well the new backpressure handling
> > > > proposed
> > > > > > in
> > > > > > > > > > FLIP-425 is
> > > > > > > > > > > >> >> robust and handle fluctuating workloads.
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> - Watermark/Timer Handling: Similar arguments
> apply
> > > for
> > > > > > > > watermark
> > > > > > > > > > and timer
> > > > > > > > > > > >> >> handling. IMHO, we need more benchmarks showing
> the
> > > > > > overhead
> > > > > > > > > > > >> >> of epoch management with different parameters
> (e.g.,
> > > > > window
> > > > > > > > size,
> > > > > > > > > > watermark
> > > > > > > > > > > >> >> strategy, etc)
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> - DFS consistency guarantees. The proposal in
> > > FLIP-427
> > > > is
> > > > > > > > > > DFS-agnostic.
> > > > > > > > > > > >> >> However, different cloud providers have different
> > > > storage
> > > > > > > > > > consistency
> > > > > > > > > > > >> >> models.
> > > > > > > > > > > >> >> How do we want to deal with them?
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >>  Regards,
> > > > > > > > > > > >> >> Jeyhun
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> On Fri, Mar 1, 2024 at 6:08 PM Zakelly Lan <
> > > > > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> > Thanks Piotr for sharing your thoughts!
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > I guess it depends how we would like to treat
> the
> > > > local
> > > > > > > > disks.
> > > > > > > > > > I've always
> > > > > > > > > > > >> >> > > thought about them that almost always
> eventually
> > > > all
> > > > > > > state
> > > > > > > > > > from the DFS
> > > > > > > > > > > >> >> > > should end up cached in the local disks.
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > OK I got it. In our proposal we treat local
> disk as
> > > > an
> > > > > > > > optional
> > > > > > > > > > cache, so
> > > > > > > > > > > >> >> > the basic design will handle the case with
> state
> > > > > residing
> > > > > > > in
> > > > > > > > > DFS
> > > > > > > > > > only. It
> > > > > > > > > > > >> >> > is a more 'cloud-native' approach that does not
> > > rely
> > > > on
> > > > > > any
> > > > > > > > > > local storage
> > > > > > > > > > > >> >> > assumptions, which allow users to dynamically
> > > adjust
> > > > > the
> > > > > > > > > > capacity or I/O
> > > > > > > > > > > >> >> > bound of remote storage to gain performance or
> save
> > > > the
> > > > > > > cost,
> > > > > > > > > > even without
> > > > > > > > > > > >> >> > a job restart.
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > In
> > > > > > > > > > > >> >> > > the currently proposed more fine grained
> > > solution,
> > > > > you
> > > > > > > > make a
> > > > > > > > > > single
> > > > > > > > > > > >> >> > > request to DFS per each state access.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > Ah that's not accurate. Actually we buffer the
> > > state
> > > > > > > requests
> > > > > > > > > > and process
> > > > > > > > > > > >> >> > them in batch, multiple requests will
> correspond to
> > > > one
> > > > > > DFS
> > > > > > > > > > access (One
> > > > > > > > > > > >> >> > block access for multiple keys performed by
> > > RocksDB).
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > In that benchmark you mentioned, are you
> requesting
> > > > the
> > > > > > > state
> > > > > > > > > > > >> >> > > asynchronously from local disks into memory?
> If
> > > the
> > > > > > > benefit
> > > > > > > > > > comes from
> > > > > > > > > > > >> >> > > parallel I/O, then I would expect the
> benefit to
> > > > > > > > > > disappear/shrink when
> > > > > > > > > > > >> >> > > running multiple subtasks on the same
> machine, as
> > > > > they
> > > > > > > > would
> > > > > > > > > > be making
> > > > > > > > > > > >> >> > > their own parallel requests, right? Also
> enabling
> > > > > > > > > > checkpointing would
> > > > > > > > > > > >> >> > > further cut into the available I/O budget.
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > That's an interesting topic. Our proposal is
> > > > > specifically
> > > > > > > > aimed
> > > > > > > > > > at the
> > > > > > > > > > > >> >> > scenario where the machine I/O is not fully
> loaded
> > > > but
> > > > > > the
> > > > > > > > I/O
> > > > > > > > > > latency has
> > > > > > > > > > > >> >> > indeed become a bottleneck for each subtask.
> While
> > > > the
> > > > > > > > > > distributed file
> > > > > > > > > > > >> >> > system is a prime example of a scenario
> > > characterized
> > > > > by
> > > > > > > > > > abundant and
> > > > > > > > > > > >> >> > easily scalable I/O bandwidth coupled with
> higher
> > > I/O
> > > > > > > > latency.
> > > > > > > > > > You may
> > > > > > > > > > > >> >> > expect to increase the parallelism of a job to
> > > > enhance
> > > > > > the
> > > > > > > > > > performance as
> > > > > > > > > > > >> >> > well, but that also brings in more waste of
> CPU's
> > > and
> > > > > > > memory
> > > > > > > > > for
> > > > > > > > > > building
> > > > > > > > > > > >> >> > up more subtasks. This is one drawback for the
> > > > > > > > > > computation-storage tightly
> > > > > > > > > > > >> >> > coupled nodes. While in our proposal, the
> parallel
> > > > I/O
> > > > > > with
> > > > > > > > all
> > > > > > > > > > the
> > > > > > > > > > > >> >> > callbacks still running in one task,
> pre-allocated
> > > > > > > > > computational
> > > > > > > > > > resources
> > > > > > > > > > > >> >> > are better utilized. It is a much more
> lightweight
> > > > way
> > > > > to
> > > > > > > > > > perform parallel
> > > > > > > > > > > >> >> > I/O.
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > Just with what granularity those async requests
> > > > should
> > > > > be
> > > > > > > > made.
> > > > > > > > > > > >> >> > > Making state access asynchronous is
> definitely
> > > the
> > > > > > right
> > > > > > > > way
> > > > > > > > > > to go!
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > I think the current proposal is based on such
> core
> > > > > ideas:
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> >    - A pure cloud-native disaggregated state.
> > > > > > > > > > > >> >> >    - Fully utilize the given resources and try
> not
> > > to
> > > > > > waste
> > > > > > > > > them
> > > > > > > > > > (including
> > > > > > > > > > > >> >> >    I/O).
> > > > > > > > > > > >> >> >    - The ability to scale isolated resources
> (I/O
> > > or
> > > > > CPU
> > > > > > or
> > > > > > > > > > memory)
> > > > > > > > > > > >> >> >    independently.
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > We think a fine-grained granularity is more
> inline
> > > > with
> > > > > > > those
> > > > > > > > > > ideas,
> > > > > > > > > > > >> >> > especially without local disk assumptions and
> > > without
> > > > > any
> > > > > > > > waste
> > > > > > > > > > of I/O by
> > > > > > > > > > > >> >> > prefetching. Please note that it is not a
> > > replacement
> > > > > of
> > > > > > > the
> > > > > > > > > > original local
> > > > > > > > > > > >> >> > state with synchronous execution. Instead this
> is a
> > > > > > > solution
> > > > > > > > > > embracing the
> > > > > > > > > > > >> >> > cloud-native era, providing much more
> scalability
> > > and
> > > > > > > > resource
> > > > > > > > > > efficiency
> > > > > > > > > > > >> >> > when handling a *huge state*.
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > What also worries me a lot in this fine grained
> > > model
> > > > > is
> > > > > > > the
> > > > > > > > > > effect on the
> > > > > > > > > > > >> >> > > checkpointing times.
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > Your concerns are very reasonable. Faster
> > > > checkpointing
> > > > > > is
> > > > > > > > > > always a core
> > > > > > > > > > > >> >> > advantage of disaggregated state, but only for
> the
> > > > > async
> > > > > > > > phase.
> > > > > > > > > > There will
> > > > > > > > > > > >> >> > be some complexity introduced by in-flight
> > > requests,
> > > > > but
> > > > > > > I'd
> > > > > > > > > > suggest a
> > > > > > > > > > > >> >> > checkpoint containing those in-flight state
> > > requests
> > > > as
> > > > > > > part
> > > > > > > > of
> > > > > > > > > > the state,
> > > > > > > > > > > >> >> > to accelerate the sync phase by skipping the
> buffer
> > > > > > > draining.
> > > > > > > > > > This makes
> > > > > > > > > > > >> >> > the buffer size have little impact on
> checkpoint
> > > > time.
> > > > > > And
> > > > > > > > all
> > > > > > > > > > the changes
> > > > > > > > > > > >> >> > keep within the execution model we proposed
> while
> > > the
> > > > > > > > > checkpoint
> > > > > > > > > > barrier
> > > > > > > > > > > >> >> > alignment or handling will not be touched in
> our
> > > > > > proposal,
> > > > > > > > so I
> > > > > > > > > > guess
> > > > > > > > > > > >> >> > the complexity is relatively controllable. I
> have
> > > > faith
> > > > > > in
> > > > > > > > that
> > > > > > > > > > :)
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > Also regarding the overheads, it would be
> great if
> > > > you
> > > > > > > could
> > > > > > > > > > provide
> > > > > > > > > > > >> >> > > profiling results of the benchmarks that you
> > > > > conducted
> > > > > > to
> > > > > > > > > > verify the
> > > > > > > > > > > >> >> > > results. Or maybe if you could describe the
> steps
> > > > to
> > > > > > > > > reproduce
> > > > > > > > > > the
> > > > > > > > > > > >> >> > results?
> > > > > > > > > > > >> >> > > Especially "Hashmap (sync)" vs "Hashmap with
> > > async
> > > > > > API".
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > Yes we could profile the benchmarks. And for
> the
> > > > > > comparison
> > > > > > > > of
> > > > > > > > > > "Hashmap
> > > > > > > > > > > >> >> > (sync)" vs "Hashmap with async API", we
> conduct a
> > > > > > Wordcount
> > > > > > > > job
> > > > > > > > > > written
> > > > > > > > > > > >> >> > with async APIs but disabling the async
> execution
> > > by
> > > > > > > directly
> > > > > > > > > > completing
> > > > > > > > > > > >> >> > the future using sync state access. This
> evaluates
> > > > the
> > > > > > > > overhead
> > > > > > > > > > of newly
> > > > > > > > > > > >> >> > introduced modules like 'AEC' in sync execution
> > > (even
> > > > > > > though
> > > > > > > > > > they are not
> > > > > > > > > > > >> >> > designed for it). The code will be provided
> later.
> > > > For
> > > > > > > other
> > > > > > > > > > results of our
> > > > > > > > > > > >> >> > PoC[1], you can follow the instructions
> here[2] to
> > > > > > > reproduce.
> > > > > > > > > > Since the
> > > > > > > > > > > >> >> > compilation may take some effort, we will
> directly
> > > > > > provide
> > > > > > > > the
> > > > > > > > > > jar for
> > > > > > > > > > > >> >> > testing next week.
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > And @Yunfeng Zhou, I have noticed your mail
> but it
> > > > is a
> > > > > > bit
> > > > > > > > > late
> > > > > > > > > > in my
> > > > > > > > > > > >> >> > local time and the next few days are weekends.
> So I
> > > > > will
> > > > > > > > reply
> > > > > > > > > > to you
> > > > > > > > > > > >> >> > later. Thanks for your response!
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > [1]
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-PoCResults
> > > > > > > > > > > >> >> > [2]
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-Appendix:HowtorunthePoC
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > Best,
> > > > > > > > > > > >> >> > Zakelly
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > On Fri, Mar 1, 2024 at 6:38 PM Yunfeng Zhou <
> > > > > > > > > > [email protected]>
> > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > > Hi,
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > Thanks for proposing this design! I just read
> > > > > FLIP-424
> > > > > > > and
> > > > > > > > > > FLIP-425
> > > > > > > > > > > >> >> > > and have some questions about the proposed
> > > changes.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > For Async API (FLIP-424)
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > 1. Why do we need a close() method on
> > > > StateIterator?
> > > > > > This
> > > > > > > > > > method seems
> > > > > > > > > > > >> >> > > unused in the usage example codes.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > 2. In FutureUtils.combineAll()'s JavaDoc, it
> is
> > > > > stated
> > > > > > > that
> > > > > > > > > > "No null
> > > > > > > > > > > >> >> > > entries are allowed". It might be better to
> > > further
> > > > > > > explain
> > > > > > > > > > what will
> > > > > > > > > > > >> >> > > happen if a null value is passed, ignoring
> the
> > > > value
> > > > > in
> > > > > > > the
> > > > > > > > > > returned
> > > > > > > > > > > >> >> > > Collection or throwing exceptions. Given that
> > > > > > > > > > > >> >> > > FutureUtils.emptyFuture() can be returned in
> the
> > > > > > example
> > > > > > > > > code,
> > > > > > > > > > I
> > > > > > > > > > > >> >> > > suppose the former one might be correct.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > For Async Execution (FLIP-425)
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > 1. According to Fig 2 of this FLIP, if a
> recordB
> > > > has
> > > > > > its
> > > > > > > > key
> > > > > > > > > > collide
> > > > > > > > > > > >> >> > > with an ongoing recordA, its processElement()
> > > > method
> > > > > > can
> > > > > > > > > still
> > > > > > > > > > be
> > > > > > > > > > > >> >> > > triggered immediately, and then it might be
> moved
> > > > to
> > > > > > the
> > > > > > > > > > blocking
> > > > > > > > > > > >> >> > > buffer in AEC if it involves state
> operations.
> > > This
> > > > > > means
> > > > > > > > > that
> > > > > > > > > > > >> >> > > recordB's output will precede recordA's
> output in
> > > > > > > > downstream
> > > > > > > > > > > >> >> > > operators, if recordA involves state
> operations
> > > > while
> > > > > > > > recordB
> > > > > > > > > > does
> > > > > > > > > > > >> >> > > not. This will harm the correctness of Flink
> jobs
> > > > in
> > > > > > some
> > > > > > > > use
> > > > > > > > > > cases.
> > > > > > > > > > > >> >> > > For example, in dim table join cases, recordA
> > > could
> > > > > be
> > > > > > a
> > > > > > > > > delete
> > > > > > > > > > > >> >> > > operation that involves state access, while
> > > recordB
> > > > > > could
> > > > > > > > be
> > > > > > > > > > an insert
> > > > > > > > > > > >> >> > > operation that needs to visit external
> storage
> > > > > without
> > > > > > > > state
> > > > > > > > > > access.
> > > > > > > > > > > >> >> > > If recordB's output precedes recordA's, then
> an
> > > > entry
> > > > > > > that
> > > > > > > > is
> > > > > > > > > > supposed
> > > > > > > > > > > >> >> > > to finally exist with recordB's value in the
> sink
> > > > > table
> > > > > > > > might
> > > > > > > > > > actually
> > > > > > > > > > > >> >> > > be deleted according to recordA's command.
> This
> > > > > > situation
> > > > > > > > > > should be
> > > > > > > > > > > >> >> > > avoided and the order of same-key records
> should
> > > be
> > > > > > > > strictly
> > > > > > > > > > > >> >> > > preserved.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > 2. The FLIP says that StateRequests
> submitted by
> > > > > > > Callbacks
> > > > > > > > > > will not
> > > > > > > > > > > >> >> > > invoke further yield() methods. Given that
> > > yield()
> > > > is
> > > > > > > used
> > > > > > > > > > when there
> > > > > > > > > > > >> >> > > is "too much" in-flight data, does it mean
> > > > > > StateRequests
> > > > > > > > > > submitted by
> > > > > > > > > > > >> >> > > Callbacks will never be "too much"? What if
> the
> > > > total
> > > > > > > > number
> > > > > > > > > of
> > > > > > > > > > > >> >> > > StateRequests exceed the capacity of Flink
> > > > operator's
> > > > > > > > memory
> > > > > > > > > > space?
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > 3. In the "Watermark" section, this FLIP
> provided
> > > > an
> > > > > > > > > > out-of-order
> > > > > > > > > > > >> >> > > execution mode apart from the default
> > > > > strictly-ordered
> > > > > > > > mode,
> > > > > > > > > > which can
> > > > > > > > > > > >> >> > > optimize performance by allowing more
> concurrent
> > > > > > > > executions.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > 3.1 I'm concerned that the out-of-order
> execution
> > > > > mode,
> > > > > > > > along
> > > > > > > > > > with the
> > > > > > > > > > > >> >> > > epoch mechanism, would bring more complexity
> to
> > > the
> > > > > > > > execution
> > > > > > > > > > model
> > > > > > > > > > > >> >> > > than the performance improvement it promises.
> > > Could
> > > > > we
> > > > > > > add
> > > > > > > > > some
> > > > > > > > > > > >> >> > > benchmark results proving the benefit of this
> > > mode?
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > 3.2 The FLIP might need to add a public API
> > > section
> > > > > > > > > describing
> > > > > > > > > > how
> > > > > > > > > > > >> >> > > users or developers can switch between these
> two
> > > > > > > execution
> > > > > > > > > > modes.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > 3.3 Apart from the watermark and checkpoint
> > > > mentioned
> > > > > > in
> > > > > > > > this
> > > > > > > > > > FLIP,
> > > > > > > > > > > >> >> > > there are also more other events that might
> > > appear
> > > > in
> > > > > > the
> > > > > > > > > > stream of
> > > > > > > > > > > >> >> > > data records. It might be better to
> generalize
> > > the
> > > > > > > > execution
> > > > > > > > > > mode
> > > > > > > > > > > >> >> > > mechanism to handle all possible events.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > 4. It might be better to treat
> callback-handling
> > > > as a
> > > > > > > > > > > >> >> > > MailboxDefaultAction, instead of Mails, to
> avoid
> > > > the
> > > > > > > > overhead
> > > > > > > > > > of
> > > > > > > > > > > >> >> > > repeatedly creating Mail objects.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > 5. Could this FLIP provide the current
> default
> > > > values
> > > > > > for
> > > > > > > > > > things like
> > > > > > > > > > > >> >> > > active buffer size thresholds and timeouts?
> These
> > > > > could
> > > > > > > > help
> > > > > > > > > > with
> > > > > > > > > > > >> >> > > memory consumption and latency analysis.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > 6. Why do we need to record the hashcode of a
> > > > record
> > > > > in
> > > > > > > its
> > > > > > > > > > > >> >> > > RecordContext? It seems not used.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > 7. In "timers can be stored on the JVM heap
> or
> > > > > > RocksDB",
> > > > > > > > the
> > > > > > > > > > link
> > > > > > > > > > > >> >> > > points to a document in flink-1.15. It might
> be
> > > > > better
> > > > > > to
> > > > > > > > > > verify the
> > > > > > > > > > > >> >> > > referenced content is still valid in the
> latest
> > > > Flink
> > > > > > and
> > > > > > > > > > update the
> > > > > > > > > > > >> >> > > link accordingly. Same for other references
> if
> > > any.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > Best,
> > > > > > > > > > > >> >> > > Yunfeng Zhou
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > On Thu, Feb 29, 2024 at 2:17 PM Yuan Mei <
> > > > > > > > > > [email protected]> wrote:
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > Hi Devs,
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > This is a joint work of Yuan Mei, Zakelly
> Lan,
> > > > > > Jinzhong
> > > > > > > > Li,
> > > > > > > > > > Hangxiang
> > > > > > > > > > > >> >> > Yu,
> > > > > > > > > > > >> >> > > > Yanfei Lei and Feng Wang. We'd like to
> start a
> > > > > > > discussion
> > > > > > > > > > about
> > > > > > > > > > > >> >> > > introducing
> > > > > > > > > > > >> >> > > > Disaggregated State Storage and Management
> in
> > > > Flink
> > > > > > > 2.0.
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > The past decade has witnessed a dramatic
> shift
> > > in
> > > > > > > Flink's
> > > > > > > > > > deployment
> > > > > > > > > > > >> >> > > mode,
> > > > > > > > > > > >> >> > > > workload patterns, and hardware
> improvements.
> > > > We've
> > > > > > > moved
> > > > > > > > > > from the
> > > > > > > > > > > >> >> > > > map-reduce era where workers are
> > > > > computation-storage
> > > > > > > > > tightly
> > > > > > > > > > coupled
> > > > > > > > > > > >> >> > > nodes
> > > > > > > > > > > >> >> > > > to a cloud-native world where containerized
> > > > > > deployments
> > > > > > > > on
> > > > > > > > > > Kubernetes
> > > > > > > > > > > >> >> > > > become standard. To enable Flink's
> Cloud-Native
> > > > > > future,
> > > > > > > > we
> > > > > > > > > > introduce
> > > > > > > > > > > >> >> > > > Disaggregated State Storage and Management
> that
> > > > > uses
> > > > > > > DFS
> > > > > > > > as
> > > > > > > > > > primary
> > > > > > > > > > > >> >> > > storage
> > > > > > > > > > > >> >> > > > in Flink 2.0, as promised in the Flink 2.0
> > > > Roadmap.
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > Design Details can be found in FLIP-423[1].
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > This new architecture is aimed to solve the
> > > > > following
> > > > > > > > > > challenges
> > > > > > > > > > > >> >> > brought
> > > > > > > > > > > >> >> > > in
> > > > > > > > > > > >> >> > > > the cloud-native era for Flink.
> > > > > > > > > > > >> >> > > > 1. Local Disk Constraints in
> containerization
> > > > > > > > > > > >> >> > > > 2. Spiky Resource Usage caused by
> compaction in
> > > > the
> > > > > > > > current
> > > > > > > > > > state model
> > > > > > > > > > > >> >> > > > 3. Fast Rescaling for jobs with large
> states
> > > > > > (hundreds
> > > > > > > of
> > > > > > > > > > Terabytes)
> > > > > > > > > > > >> >> > > > 4. Light and Fast Checkpoint in a native
> way
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > More specifically, we want to reach a
> consensus
> > > > on
> > > > > > the
> > > > > > > > > > following issues
> > > > > > > > > > > >> >> > > in
> > > > > > > > > > > >> >> > > > this discussion:
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > 1. Overall design
> > > > > > > > > > > >> >> > > > 2. Proposed Changes
> > > > > > > > > > > >> >> > > > 3. Design details to achieve Milestone1
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > In M1, we aim to achieve an end-to-end
> baseline
> > > > > > version
> > > > > > > > > > using DFS as
> > > > > > > > > > > >> >> > > > primary storage and complete core
> > > > functionalities,
> > > > > > > > > including:
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > - Asynchronous State APIs (FLIP-424)[2]:
> > > > Introduce
> > > > > > new
> > > > > > > > APIs
> > > > > > > > > > for
> > > > > > > > > > > >> >> > > > asynchronous state access.
> > > > > > > > > > > >> >> > > > - Asynchronous Execution Model
> (FLIP-425)[3]:
> > > > > > > Implement a
> > > > > > > > > > non-blocking
> > > > > > > > > > > >> >> > > > execution model leveraging the asynchronous
> > > APIs
> > > > > > > > introduced
> > > > > > > > > > in
> > > > > > > > > > > >> >> > FLIP-424.
> > > > > > > > > > > >> >> > > > - Grouping Remote State Access
> (FLIP-426)[4]:
> > > > > Enable
> > > > > > > > > > retrieval of
> > > > > > > > > > > >> >> > remote
> > > > > > > > > > > >> >> > > > state data in batches to avoid unnecessary
> > > > > round-trip
> > > > > > > > costs
> > > > > > > > > > for remote
> > > > > > > > > > > >> >> > > > access
> > > > > > > > > > > >> >> > > > - Disaggregated State Store (FLIP-427)[5]:
> > > > > Introduce
> > > > > > > the
> > > > > > > > > > initial
> > > > > > > > > > > >> >> > version
> > > > > > > > > > > >> >> > > of
> > > > > > > > > > > >> >> > > > the ForSt disaggregated state store.
> > > > > > > > > > > >> >> > > > - Fault Tolerance/Rescale Integration
> > > > > (FLIP-428)[6]:
> > > > > > > > > > Integrate
> > > > > > > > > > > >> >> > > > checkpointing mechanisms with the
> disaggregated
> > > > > state
> > > > > > > > store
> > > > > > > > > > for fault
> > > > > > > > > > > >> >> > > > tolerance and fast rescaling.
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > We will vote on each FLIP in separate
> threads
> > > to
> > > > > make
> > > > > > > > sure
> > > > > > > > > > each FLIP
> > > > > > > > > > > >> >> > > > reaches a consensus. But we want to keep
> the
> > > > > > discussion
> > > > > > > > > > within a
> > > > > > > > > > > >> >> > focused
> > > > > > > > > > > >> >> > > > thread (this thread) for easier tracking of
> > > > > contexts
> > > > > > to
> > > > > > > > > avoid
> > > > > > > > > > > >> >> > duplicated
> > > > > > > > > > > >> >> > > > questions/discussions and also to think of
> the
> > > > > > > > > > problem/solution in a
> > > > > > > > > > > >> >> > full
> > > > > > > > > > > >> >> > > > picture.
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > Looking forward to your feedback
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > Best,
> > > > > > > > > > > >> >> > > > Yuan, Zakelly, Jinzhong, Hangxiang, Yanfei
> and
> > > > Feng
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > [1]
> > > https://cwiki.apache.org/confluence/x/R4p3EQ
> > > > > > > > > > > >> >> > > > [2]
> > > https://cwiki.apache.org/confluence/x/SYp3EQ
> > > > > > > > > > > >> >> > > > [3]
> > > https://cwiki.apache.org/confluence/x/S4p3EQ
> > > > > > > > > > > >> >> > > > [4]
> > > https://cwiki.apache.org/confluence/x/TYp3EQ
> > > > > > > > > > > >> >> > > > [5]
> > > https://cwiki.apache.org/confluence/x/T4p3EQ
> > > > > > > > > > > >> >> > > > [6]
> > > https://cwiki.apache.org/confluence/x/UYp3EQ
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] FLIP-423 ～FLIP-428: Introduce Disaggregated State Storage and Management in Flink 2.0

Reply via email to