Re: [DISCUSS] FLIP-468: Introducing StreamGraph-Based Job Submission.

Ron Liu Tue, 23 Jul 2024 22:45:52 -0700

Hi, Junrui

Thanks for your detailed reply.


After reading the updated FLIP-468 & FLIP-470, I see that the design looks
good.


Best.
Ron

Junrui Lee <jrlee....@gmail.com> 于2024年7月18日周四 14:26写道：

> Hi all,
>
> I would like to follow up on my previous email regarding your feedback.
> Below
> is a concise summary of my main points:
>
> 1. Compiled Plan:
> IIUC, the compiled plan is primarily for ensuring execution plan
> compatibility across job versions (e.g., during upgrades). Eventually, it
> needs to be converted to StreamGraph for submission. Persisting
> StreamGraph/JobGraph is for supporting high availability in jm failover
> scenarios.
>
> 2. JobGraph vs StreamGraph Submission:
> For users, whether to submit JobGraph or StreamGraph is mostly transparent.
> In the client mode, this detail is hidden from users. The REST API (/jobs)
> was originally serving the internal JobGraph class, intended for the Flink
> client, not as a strict public API.
>
> 3. Runtime SQL Optimization:
> To facilitate runtime SQL-related optimization, we will introduce a
> strategy mechanism called StreamGraphOptimizationStrategy. SQL can
> implement specific strategies, such as
> AdaptiveBroadcastJoinOptimizationStrategy. More details are available in
> FLIP-469 [1].
>
> 4. Persistence Strategy:
> Initially, we can retain the JobGraph persistence method to ensure it does
> not affect stream jobs. After the new method is validated in batch
> processing scenarios over several versions, we can unify the stream jobs to
> also use StreamGraph persistence.
>
> Your thoughts and insights on these points would be highly appreciated.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-469%3A+Supports+Adaptive+Optimization+of+StreamGraph
>
> Best regards,
>
> Junrui
>
> Junrui Lee <jrlee....@gmail.com> 于2024年7月12日周五 20:29写道：
>
> > Hi all,
> >
> > Thanks for your feedback. Below are my thoughts on the questions you've
> > raised
> >
> > @Fabian
> >
> > - What is the future plan for job submissions in Flink? With the current
> >> proposal, Flink will support JobGraph/StreamGraph/compiled plan
> >> submissions? It might be confusing for users and complicate the existing
> >> job submission logic significantly.
> >
> >
> > In my view, Flink should support submissions via StreamGraph and,
> > eventually, remove support for JobGraph submissions. As for the compiled
> > plan, I consider it to be a concept related to table/sql, which also
> needs
> > to be compiled into the corresponding StreamGraph (a more generalized
> > concept, applicable as well to DataStream API jobs) upon submission.
> > In this FLIP, considering to provide a smoother migration for users, and
> > recognizing that currently, a vast number of Flink UT/IT cases depend on
> > the submission of a specialized JobGraph to achieve test objectives,
> > transitioning these test cases and replacing with StreamGraph would be a
> > highly risky and complicated action. Therefore, I am inclined not to
> > eliminate the pathway of submitting JobGraph within this FLIP.
> >
> >  How do we plan to connect the optimizer with Flink's runtime?
> >
> > @Ron
> >
> >> For batch scenario, if we want to better support dynamic plan tuning
> >> strategies, the fundamental solution is still to put SQL Optimizer to
> >> flink-runtime.
> >
> >
> > Our current solution is to abstract an interface called
> > StreamGraphOptimizationStrategy. For table-related optimization
> strategies,
> > taking the Broadcast Join concept as an example, we can implement an
> > AdaptiveBroadcastJoinOptimizationStrategy at the table layer. When this
> > strategy is needed, during SQL compilation, we will set the configuration
> > item execution.batch.adaptive.stream-graph-optimization.strategies. Then,
> > the runtime layer will load the corresponding Strategy class based on
> this
> > configuration item to be used at runtime.
> >
> > We will describe this part in the StreamGraphOptimizer section of
> > FLIP-469, and the specific introduction of
> > AdaptiveBroadcastJoinOptimizationStrategy will be discussed in subsequent
> > FLIPs, which is expected to happen within a few days.
> >
> > @David
> >
> >> 1. Transformations in StreamGraphGenerator:*
> >>
> >
> > I'm inclined to support submitting StreamGraph because it could already
> > provide the actual logical plan of the job at runtime. To be honest, I'm
> > not sure what additional benefits submitting Transformations would bring
> > compared to StreamGraph. I would appreciate any insights that you might
> > offer on this matter.
> >
> > *2. StreamGraph for Recovery Purposes:
> >
> >
> > In fact, this conflicts with our desire to make adaptive optimization to
> > the StreamGraph at runtime, as under such scenarios, the StreamGraph is
> the
> > complete expression of a job's logic, not the JobGraph. More details can
> > refer to the specific details in FLIP-469.
> >
> > I have reviewed the code related to the JobGraphStore and discovered that
> > it can be extended to store both StreamGraph and JobGraph simultaneously.
> > As for your concerns, can we consider the following: in batch mode, we
> use
> > Job Recovery based on StreamGraph, whereas for stream mode, we continue
> to
> > use the original JobGraph recovery and the StreamGraph would be converted
> > to a JobGraph right at the beginning.
> >
> > 3. Moving Away from Java Serializables:
> >
> >
> > Are you suggesting that Java serialization has limitations, and that we
> > should explore alternative serialization approaches? I agree that this
> is a
> > valuable consideration for the future. Do you think this should be
> included
> > in this FLIP? I would prefer to address it as a separate FLIP.
> >
> > Best,
> > Junrui
> >
> > David Morávek <david.mora...@gmail.com> 于2024年7月12日周五 14:47写道：
> >
> >> >
> >> > For batch scenario, if we want to better support dynamic plan tuning
> >> > strategies, the fundamental solution is still to put SQL Optimizer to
> >> > flink-runtime.
> >>
> >>
> >> One accompanying question is: how do you envision this to work for
> >> streaming where you need to ensure state compatibility after the plan
> >> change? FLIP-496 seems to only focus on batch.
> >>
> >> Best,
> >> D.
> >>
> >> On Fri, Jul 12, 2024 at 4:29 AM Ron Liu <ron9....@gmail.com> wrote:
> >>
> >> > Hi, Junrui
> >> >
> >> > The FLIP proposal looks good to me.
> >> >
> >> > I have the same question as Fabian:
> >> >
> >> > > For join strategies, they are only
> >> > applicable when using an optimizer (that's currently not part of
> Flink's
> >> > runtime) with the Table API or Flink SQL. How do we plan to connect
> the
> >> > optimizer with Flink's runtime?
> >> >
> >> > For batch scenario, if we want to better support dynamic plan tuning
> >> > strategies, the fundamental solution is still to put SQL Optimizer to
> >> > flink-runtime.
> >> >
> >> > Best,
> >> > Ron
> >> >
> >> > David Morávek <d...@apache.org> 于2024年7月11日周四 19:17写道：
> >> >
> >> > > Hi Junrui,
> >> > >
> >> > > Thank you for drafting the FLIP. I really appreciate the direction
> >> it’s
> >> > > taking. We’ve discussed similar approaches multiple times, and it’s
> >> great
> >> > > to see this progress.
> >> > >
> >> > > I have a few questions and thoughts:
> >> > >
> >> > >
> >> > > * 1. Transformations in StreamGraphGenerator:*
> >> > > Should we consider taking this a step further by working on a list
> of
> >> > > transformations (inputs of StreamGraphGenerator)?
> >> > >
> >> > >     public StreamGraphGenerator(
> >> > >             List<Transformation<?>> transformations,
> >> > >             ExecutionConfig executionConfig,
> >> > >             CheckpointConfig checkpointConfig,
> >> > >             ReadableConfig configuration) {
> >> > >
> >> > > We could potentially merge ExecutionConfig and CheckpointConfig into
> >> > > ReadableConfig. This approach might offer us even more flexibility.
> >> > >
> >> > >
> >> > > *2. StreamGraph for Recovery Purposes:*
> >> > > Should we avoid using StreamGraph for recovery purposes? The
> existing
> >> > > JG-based recovery code paths took years to perfect, and it doesn’t
> >> seem
> >> > > necessary to replace them. We only need SG for cases where we want
> to
> >> > > regenerate the JG.
> >> > > Additionally, translating SG into JG before persisting it in HA
> could
> >> be
> >> > > beneficial, as it allows us to catch potential issues early on.
> >> > >
> >> > >
> >> > > * 3. Moving Away from Java Serializables:*
> >> > > It would be great to start moving away from Java Serializables as
> >> much as
> >> > > possible. Could we instead define proper versioned serializers,
> >> possibly
> >> > > based on a well-defined protobuf blueprint? This change could help
> us
> >> > avoid
> >> > > ongoing issues associated with Serializables.
> >> > >
> >> > > Looking forward to your thoughts.
> >> > >
> >> > > Best,
> >> > > D.
> >> > >
> >> > > On Thu, Jul 11, 2024 at 12:58 PM Fabian Paul <fp...@apache.org>
> >> wrote:
> >> > >
> >> > > > Thanks for drafting this FLIP. I really like the idea of
> >> introducing a
> >> > > > concept in Flink that is close to a logical plan submission.
> >> > > >
> >> > > > I have a few questions about the proposal and its future
> >> evolvability.
> >> > > >
> >> > > > - What is the future plan for job submissions in Flink? With the
> >> > current
> >> > > > proposal, Flink will support JobGraph/StreamGraph/compiled plan
> >> > > > submissions? It might be confusing for users and complicate the
> >> > existing
> >> > > > job submission logic significantly.
> >> > > > - The FLIP mentions multiple areas of optimization, first operator
> >> > > chaining
> >> > > > and second dynamic switches between join strategies. I think from
> a
> >> > Flink
> >> > > > perspective, these are, at the moment, separate concerns.  For
> >> operator
> >> > > > chaining, I agree with the current proposal, which is a concept
> that
> >> > > > applies generally to Flink's runtime. For join strategies, they
> are
> >> > only
> >> > > > applicable when using an optimizer (that's currently not part of
> >> > Flink's
> >> > > > runtime) with the Table API or Flink SQL. How do we plan to
> connect
> >> the
> >> > > > optimizer with Flink's runtime?
> >> > > > - With table/SQL API we already expose a compiled plan to support
> >> > stable
> >> > > > version upgrades. It would be great to explore a joined plan to
> also
> >> > > offer
> >> > > > stable version upgrades with a potentially persistent streamgraph.
> >> > > >
> >> > > > Best,
> >> > > > Fabian
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] FLIP-468: Introducing StreamGraph-Based Job Submission.

Reply via email to