Hi everyone,

Thanks Gouwei for the proposal,

Could you please clarify the idea behind FLIP-XXX: Independent Checkpoints
Based on Pipeline Region (the last FLIP-XXX)?

And w.r.t FLIP-XXX: Unaligned Checkpoint Enhancements for Long-Running AI
Workloads
How "lifting the restriction on forward / pointwise edges" could be
achieved?

Regards,
Roman


On Tue, May 5, 2026 at 3:29 PM FeatZhang <[email protected]> wrote:

> Hi Guowei and Dian,
>
> Thanks for the detailed discussion and for raising these important
> architectural questions.
>
> I’d like to connect this thread with an earlier discussion I initiated:
> https://lists.apache.org/thread/rn8r7pqfgncd2k3tmy9qfxs6xcslr7lf
>
>
> https://docs.google.com/document/d/17PpHDjFCeL2wcp9Q7t7snyvKnW2ugBKZ-rtqo_nMKQg/edit
>
>
> and provide more context from real-world AI application scenarios, as
> well as how my current work and proposed design align with Flink’s
> evolution.
>
> ________________________________
>
> 1. Starting point: AI workloads in streaming systems
>
> From current industry usage, AI in streaming systems is not an
> isolated domain, but an extension of existing Flink workloads.
>
> Typical Flink applications already include:
>
> real-time analytics (dashboards, monitoring)
> event-driven systems (fraud detection, anomaly detection)
> data pipelines (ETL, enrichment)
>
> In recent years, these workloads are evolving into:
>
> real-time inference (model scoring / LLM / RAG)
> event-driven decision making
> continuous feature + inference pipelines
>
> For example:
>
> fraud detection → model inference in stream
> monitoring → anomaly detection models
> BI dashboards → AI-assisted insights
>
> So from a system perspective:
>
> AI workloads are not replacing Flink use cases, but extending existing
> BI / streaming analytics workloads into intelligent pipelines
>
> ________________________________
>
> 2. Current gap in Flink (from application perspective)
>
> Today, implementing inference in Flink typically relies on:
>
> AsyncFunction / async I/O
> custom batching logic
> ad-hoc retry / timeout handling
> external model services (e.g., Triton)
>
> Flink already provides building blocks for these:
>
> async processing
> state management
> exactly-once semantics
> integration with inference systems like Triton
>
> However, from an application perspective, we observe:
>
> Same pattern is repeatedly reimplemented:
> - batching
> - concurrency control
> - retry / timeout
> - backend integration
>
> And these are:
>
> not standardized
> not reusable
> not aligned with a unified operator abstraction
>
> ________________________________
>
> 3. My contributions and what problems they address
>
> In previous contributions, I have been working on multiple aspects of
> this problem:
>
> Inference execution
>
> AsyncBatchWaitOperator
> AsyncBatchFunction (DataStream / Table API)
> retry / timeout strategies
> unified metrics
>
> 👉 solving:
>
> async inference execution
> batch-based inference
> reliability baseline
>
> ________________________________
>
> Backend integration
>
> Triton inference integration
> HTTP connection pooling
> fallback / retry mechanisms
>
> 👉 solving:
>
> external inference system integration
> production-grade inference calls
>
> ________________________________
>
> Runtime control (related work)
>
> NodeHealthManager
> slot filtering and isolation
>
> 👉 solving:
>
> runtime stability and resource control
>
> ________________________________
>
> These contributions already cover most of the required capabilities:
>
> async execution
> batching
> retry / timeout
> backend integration
>
> But currently:
>
> they are distributed across multiple components without a unified
> abstraction
>
> ________________________________
>
> 4. Design direction: from BI → AI (incremental evolution)
>
> One important point I want to emphasize is:
>
> This proposal is not introducing an “AI system” into Flink
> but extending Flink along its existing evolution path
>
> Flink has historically evolved as:
>
> ETL → Streaming Analytics → Real-time BI → (now) Intelligent Streaming
>
> So the direction here is:
>
> start from mature BI / SQL / streaming workloads
> incrementally introduce inference capabilities
> unify them into runtime abstractions
>
> This keeps Flink aligned with:
>
> existing users
> existing APIs
> production workloads
>
> ________________________________
>
> 5. Relation to flink-agents
>
> I fully agree that Apache Flink Agents is an important direction.
>
> From my understanding:
>
> flink-agents focuses on:
>
> agent orchestration
> LLM workflows
> tool / memory / reasoning loops
>
> this proposal focuses on:
>
> execution inside Flink runtime
>
> Specifically:
>
> batching at operator level
> concurrency control
> backpressure-aware scheduling
> checkpoint alignment
>
> So:
>
> flink-agents → orchestration layer
> this proposal → execution layer
>
> They are complementary rather than overlapping.
>
> ________________________________
>
> 6. Proposed scope (based on discussion)
>
> Based on feedback, I agree that the scope should be minimal.
>
> The plan is to start with:
>
> a unified InferenceOperator
> batch + async execution model
> concurrency control
> basic timeout / retry
>
> And leave out:
>
> full AI runtime abstraction
> backend SPI
> advanced reliability mechanisms
>
> ________________________________
>
> 7. Next step
>
> Based on the discussion, I plan to draft a minimal FLIP focusing on
> the operator abstraction.
>
> Before doing so, I’d appreciate feedback on:
>
> whether this “BI → AI incremental path” makes sense
> whether introducing an operator-level abstraction is appropriate
> whether this scope is sufficiently minimal
>
> ________________________________
>
> 8. Closing
>
> From my perspective, this work is about:
>
> helping Flink close the gap between existing streaming workloads and
> emerging AI use cases, in a way that is incremental, consistent, and
> production-oriented.
>
> Thanks again for the discussion — it has been very helpful.
>
>
> For easier understanding, I summarized the evolution path and positioning
> below:
>
>
> Best,
> Feat Zhang
>
>
> On Wed, Apr 29, 2026 at 10:05 AM Dian Fu <[email protected]> wrote:
> >
> > Hi Guowei,
> >
> > Thanks for driving this FLIP.  I believe it will be an important step
> > towards moving Flink from the BI stack towards the AI stack.
> >
> > Here are my thoughts on the concerns raised by Gustavo:
> >
> > > - SQL/Table: What is the plan here? How will the new multimodal types
> > > (Tensor, Image, Embedding) work in the type system, codegen, and
> > > plan/savepoint compatibility?
> >
> > I think the new multimodal types will be introduced also in the
> > DataStream API & Table API just as the existing built-in data types in
> > Flink.
> >
> > > Is there a plan for SQL-level model inference
> > > beyond the current ML_PREDICT shape, for example, vector similarity or
> > > multimodal predicates? Today this is still very vendor-specific across
> the
> > > industry, so it would be nice to know if Flink wants to take a clear
> > > position here or how this flip will fit with the sql table vision
> >
> > We intend to introduce more built-in, domain-specific model inference
> > functions beyond ML_PREDICT. This is reflected in the part `FLIP-XXX:
> > Built-in Multimodal Operators and AI Function`. We plan to introduce
> > quite a few
> > built-in functionalities around multimodal data processing and
> > domain-specific model inference functions to ease the life of users.
> > It will be available for both Python users and SQL users. The detailed
> > list will be well discussed in that sub-FLIP.
> >
> > > - DataStream (v1 and v2): Will RpcOperator and the Arrow-batch
> primitives
> > > be exposed as first-class building blocks for Java users, or only as
> > > internal pieces behind the Python DataFrame? Many streaming inference
> use
> > > cases (real-time enrichment, CDC + model scoring) fit very well with
> > > DataStream and would benefit from clear guidance.
> >
> > Good question. Regarding Arrow-batch primitives, the preliminary
> > thinking is to focus on the Python jobs first, since it's very clear
> > that Python users need it. If we see clear use cases where Java users
> > would directly benefit from it, we can absolutely expose them as
> > first-class building blocks. It would be great if you could share more
> > thoughts on how Java users could use it in the above or other
> > scenarios.
> >
> > Regards,
> > Dian
> >
> >
> > On Wed, Apr 29, 2026 at 12:04 AM David Radley <[email protected]>
> wrote:
> > >
> > > Hi Guowei,
> > > This is an interesting proposal. I second Roberts questions. Some
> thoughts.
> > >
> > > Layer 3 does not depend on layers 1 and 2 I think. At the high level I
> wonder, is the idea that Flink could become like an R ML pipeline or SPSS?
> It would be good to compare existing technology solutions and what benefits
> Flink will bring to these scenarios.
> > > FLIP-XXX: Supporting RpcOperator — Independently Deployed and Scaled
> RPC Service Operators - see Robert's comment. I assume this is a
> specialization of the async io operator for RPC. When you say deploying RPC
> services that are fully managed by the Flink runtime, where would these be
> deployed? If it is remote how would this work? It would be interesting to
> see some use cases where Flink would be deploying RPC services that it has
> created.
> > > FLIP-XXX: Multimodal Data Type System and Object Reference Mechanism
> > >
> > > I like the idea of adding these types - the interesting part will be
> the deser.
> > >
> > > FLIP-XXX: A More Pythonic DataFrame API for Python Users - this makes
> sense
> > > FLIP-XXX: Connector API for Multimodal Data Source/Sink - I assume
> this will be renamed new multimodal formats. Are there existing registries
> that these could be looked in - similar to schema registry - so we can
> bring in artifacts via metadata?
> > > FLIP-XXX: Built-in Multimodal Operators and AI Functions - I wonder if
> we could bring in existing implementation libraries and the new work would
> allow us to call them from Flink. i.e. not having to do them one call by
> call but library by library.
> > > FLIP-XXX: Columnar Data Transport and Processing Optimization - this
> seems a big change, events as columns rather than events as rows or CDC
> sequences. I assume this would not be exposed in SQL?
> > >
> > >  kind regards, David.
> > >
> > > From: Robert Metzger <[email protected]>
> > > Date: Tuesday, 28 April 2026 at 07:38
> > > To: [email protected] <[email protected]>
> > > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-577: AI-Native Flink — An
> Umbrella Proposal for Multimodal Data Processing
> > >
> > > Hey Guowei,
> > >
> > > Thanks for the proposal. I just took a brief look, here are some high
> level
> > > questions:
> > >
> > > Regarding the RPC Operator: What is the difference to the async io
> operator
> > > we have already?
> > >
> > > "Connector API for Multimodal Data Source/Sink": Why do we need to
> touch
> > > the connector API for supporting multimodal data? Isn't this more of a
> > > formats concern?
> > >
> > > "Non-Disruptive Scaling for CPU Operators": How do you want to
> guarantee
> > > exactly-once on that kind of scaling? E.g. you need to somehow make a
> > > handover between the old and new new pipeline
> > >
> > > Overall, I find the proposal has some things which seem related to
> making
> > > Flink more AI native, but other changes seem orthogonal to that. For
> > > example the checkpoint or scaling changes are actually unrelated to
> AI, and
> > > just engine improvements.
> > >
> > >
> > > On Tue, Apr 28, 2026 at 5:48 AM Guowei Ma <[email protected]>
> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > I'd like to start a discussion on an umbrella FLIP[1] that lays out a
> > > > direction for evolving Flink into a data engine that natively
> supports AI
> > > > workloads.
> > > >
> > > > The short version: user workloads are shifting from BI analytics to
> > > > multimodal data processing centered on model inference, and this
> triggers
> > > > cascading changes across the stack — multimodal data flowing through
> > > > pipelines, heterogeneous CPU/GPU resources, vectorized execution, and
> > > > inference tasks that run for seconds to minutes on Spot instances.
> The
> > > > proposal sketches an evolution along five directions (development
> paradigm,
> > > > data model, heterogeneous resources, execution engine, fault
> tolerance),
> > > > decomposed into 11 sub-FLIPs organized into three layers: core
> runtime
> > > > primitives, AI workload expression and execution, and
> production-grade
> > > > operational guarantees. Most sub-FLIPs have no hard dependencies on
> each
> > > > other and can be advanced in parallel.
> > > >
> > > > A note on scope, since it's an umbrella:
> > > >
> > > > - In scope here: whether the evolution directions are reasonable,
> whether
> > > > each sub-FLIP's motivation and proposed approach are well-founded,
> and
> > > > whether the boundaries and dependencies between sub-FLIPs are clear.
> > > > - Out of scope here: detailed designs, API specifics, and
> implementation
> > > > plans of individual sub-FLIPs — those will go through their own
> FLIPs.
> > > > - Consensus criteria: agreement on the overall direction is
> sufficient for
> > > > the umbrella to pass; passing it does not lock in any sub-FLIP's
> design —
> > > > sub-FLIPs may still be adjusted, deferred, or withdrawn as they
> progress.
> > > >
> > > > All proposed changes are incremental — no existing API or behavior is
> > > > removed or altered. Compatibility details are covered at the end of
> the
> > > > document.
> > > >
> > > > Looking forward to your feedback on the overall direction and the
> layering.
> > > >
> > > > [1]
> > > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275
> > > >
> > > > Thanks,
> > > > Guowei
> > > >
> > >
> > > Unless otherwise stated above:
> > >
> > > IBM United Kingdom Limited
> > > Registered in England and Wales with number 741598
> > > Registered office: Building C, IBM Hursley Office, Hursley Park Road,
> Winchester, Hampshire SO21 2JN
>

Reply via email to