Re: [DISCUSS] FLIP-577: AI-Native Flink — An Umbrella Proposal for Multimodal Data Processing

Guowei Ma Tue, 05 May 2026 23:41:20 -0700

Hi Martijn,

Thanks. Most of these overlap with what I covered earlier — quick pointers:


(1) *Bundling*: see my reply to Robert — these mechanisms are adequate for
normal worload but break under AI workload characteristics. The umbrella
doesn't bypass the FLIP process; each sub-FLIP stands on its own.

(2) *Feature matrix*: fair. Multi-API coverage should be a hard requirement
for engine-level primitives — Jark and Dian have confirmed the SQL/Table &
Python side in this thread. Please flag any sub-FLIP where it looks
underspecified.

(3) *RpcOperator / Flink ML*: see Zhu Zhu on RpcOperator earlier in this
thread. On Flink ML: as Gen Luo noted, Flink ML tried to own the training
lifecycle (experiment tracking, versioning, etc.), which never aligned with
the Python/notebook workflow ML researchers prefer. This proposal stays on
the inference/data side, and today's standardized serving interfaces make
external model integration much lighter than it was in the Flink ML era. If
specific items still look like they cross the line, happy to dig in.

(4) *Evidence*: fair. I listed concrete signals in my reply to Yaroslav.

Best,
Guowei


On Wed, May 6, 2026 at 11:51 AM Gen Luo <[email protected]> wrote:

> Hi Roman,
>
> 1. Independent Checkpoints Based on Pipeline Region
>
> This is actually a production-proven approach. A similar concept, Regional
> Checkpoint, was already presented at FFA 2020 (Single Task Recovery and
> Regional Checkpoint) to address stability challenges in large-scale data
> integration jobs.
>
> We have found that modern multi-modal offline inference jobs face very
> similar issues. Their common characteristics include:
> - Simple topologies (typically single-operator pipelines or independent
> regions without all-to-all edges)
> - Stateless computation
> - High parallelism/concurrency
> - Unstable execution environments (prone to eviction on cheap/preemptible
> offline resources)
> On Ray, these jobs currently require users to manually implement checkpoint
> logic. After a failover, they must fully re-consume the input data and rely
> on an anti-join with the result table to filter out already processed keys.
>
> We believe Flink’s bounded stream execution mode is highly suitable for
> this scenario. However, currently, a failure in a single region causes the
> entire checkpoint to be cancelled, which can lead to prolonged checkpoint
> failures and significant rollback of task consumption progress. We believe
> Regional Checkpoints can effectively solve this. In this scenario, the
> snapshot of each region is independent, with only the Source Coordinator
> requiring additional handling. Theoretically, merging the latest snapshots
> of each region forms a valid global snapshot. While Flink has evolved
> significantly since 2020, so the implementation details will differ, the
> foundational concept remains intact. We are open to further discussion to
> refine the design.
>
> 2. How "lifting the restriction on forward / pointwise edges" could be
> achieved?
>
> Based on preliminary research, the restriction on applying unaligned
> checkpoints to forward / pointwise edges stems from data ordering
> requirements. The goal is to prevent out-of-order delivery when ordered
> data stored in channel state gets redistributed due to parallelism changes.
> While this is a valid concern, we believe completely banning unaligned
> checkpoints on these edges is overly defensive.In practice, many stateless
> workloads (including AI inference pipelines) do not require strict data
> ordering. Many users even explicitly enable unordered async joins to
> improve throughput.
>
> However, Flink cannot automatically infer whether a source is ordered or
> whether the user's application logic requires strict ordering. Therefore,
> this proposal does not change the default behavior. Instead, it plans to
> introduce a configuration option (or reuse the existing FORCE_UNALIGNED
> flag) to explicitly allow users to enable unaligned checkpoints on forward
> / pointwise edges when they know ordering is not a concern.
>
> Do you consider this approach reasonable? Please let me know if I have
> overlooked any valid technical concerns regarding this restriction.
>
> Best,
> Gen
>
> On Tue, May 5, 2026 at 10:27 PM Roman Khachatryan <[email protected]>
> wrote:
>
> > Hi everyone,
> >
> > Thanks Gouwei for the proposal,
> >
> > Could you please clarify the idea behind FLIP-XXX: Independent
> Checkpoints
> > Based on Pipeline Region (the last FLIP-XXX)?
> >
> > And w.r.t FLIP-XXX: Unaligned Checkpoint Enhancements for Long-Running AI
> > Workloads
> > How "lifting the restriction on forward / pointwise edges" could be
> > achieved?
> >
> > Regards,
> > Roman
> >
> >
> > On Tue, May 5, 2026 at 3:29 PM FeatZhang <[email protected]> wrote:
> >
> > > Hi Guowei and Dian,
> > >
> > > Thanks for the detailed discussion and for raising these important
> > > architectural questions.
> > >
> > > I’d like to connect this thread with an earlier discussion I initiated:
> > > https://lists.apache.org/thread/rn8r7pqfgncd2k3tmy9qfxs6xcslr7lf
> > >
> > >
> > >
> >
> https://docs.google.com/document/d/17PpHDjFCeL2wcp9Q7t7snyvKnW2ugBKZ-rtqo_nMKQg/edit
> > >
> > >
> > > and provide more context from real-world AI application scenarios, as
> > > well as how my current work and proposed design align with Flink’s
> > > evolution.
> > >
> > > ________________________________
> > >
> > > 1. Starting point: AI workloads in streaming systems
> > >
> > > From current industry usage, AI in streaming systems is not an
> > > isolated domain, but an extension of existing Flink workloads.
> > >
> > > Typical Flink applications already include:
> > >
> > > real-time analytics (dashboards, monitoring)
> > > event-driven systems (fraud detection, anomaly detection)
> > > data pipelines (ETL, enrichment)
> > >
> > > In recent years, these workloads are evolving into:
> > >
> > > real-time inference (model scoring / LLM / RAG)
> > > event-driven decision making
> > > continuous feature + inference pipelines
> > >
> > > For example:
> > >
> > > fraud detection → model inference in stream
> > > monitoring → anomaly detection models
> > > BI dashboards → AI-assisted insights
> > >
> > > So from a system perspective:
> > >
> > > AI workloads are not replacing Flink use cases, but extending existing
> > > BI / streaming analytics workloads into intelligent pipelines
> > >
> > > ________________________________
> > >
> > > 2. Current gap in Flink (from application perspective)
> > >
> > > Today, implementing inference in Flink typically relies on:
> > >
> > > AsyncFunction / async I/O
> > > custom batching logic
> > > ad-hoc retry / timeout handling
> > > external model services (e.g., Triton)
> > >
> > > Flink already provides building blocks for these:
> > >
> > > async processing
> > > state management
> > > exactly-once semantics
> > > integration with inference systems like Triton
> > >
> > > However, from an application perspective, we observe:
> > >
> > > Same pattern is repeatedly reimplemented:
> > > - batching
> > > - concurrency control
> > > - retry / timeout
> > > - backend integration
> > >
> > > And these are:
> > >
> > > not standardized
> > > not reusable
> > > not aligned with a unified operator abstraction
> > >
> > > ________________________________
> > >
> > > 3. My contributions and what problems they address
> > >
> > > In previous contributions, I have been working on multiple aspects of
> > > this problem:
> > >
> > > Inference execution
> > >
> > > AsyncBatchWaitOperator
> > > AsyncBatchFunction (DataStream / Table API)
> > > retry / timeout strategies
> > > unified metrics
> > >
> > > 👉 solving:
> > >
> > > async inference execution
> > > batch-based inference
> > > reliability baseline
> > >
> > > ________________________________
> > >
> > > Backend integration
> > >
> > > Triton inference integration
> > > HTTP connection pooling
> > > fallback / retry mechanisms
> > >
> > > 👉 solving:
> > >
> > > external inference system integration
> > > production-grade inference calls
> > >
> > > ________________________________
> > >
> > > Runtime control (related work)
> > >
> > > NodeHealthManager
> > > slot filtering and isolation
> > >
> > > 👉 solving:
> > >
> > > runtime stability and resource control
> > >
> > > ________________________________
> > >
> > > These contributions already cover most of the required capabilities:
> > >
> > > async execution
> > > batching
> > > retry / timeout
> > > backend integration
> > >
> > > But currently:
> > >
> > > they are distributed across multiple components without a unified
> > > abstraction
> > >
> > > ________________________________
> > >
> > > 4. Design direction: from BI → AI (incremental evolution)
> > >
> > > One important point I want to emphasize is:
> > >
> > > This proposal is not introducing an “AI system” into Flink
> > > but extending Flink along its existing evolution path
> > >
> > > Flink has historically evolved as:
> > >
> > > ETL → Streaming Analytics → Real-time BI → (now) Intelligent Streaming
> > >
> > > So the direction here is:
> > >
> > > start from mature BI / SQL / streaming workloads
> > > incrementally introduce inference capabilities
> > > unify them into runtime abstractions
> > >
> > > This keeps Flink aligned with:
> > >
> > > existing users
> > > existing APIs
> > > production workloads
> > >
> > > ________________________________
> > >
> > > 5. Relation to flink-agents
> > >
> > > I fully agree that Apache Flink Agents is an important direction.
> > >
> > > From my understanding:
> > >
> > > flink-agents focuses on:
> > >
> > > agent orchestration
> > > LLM workflows
> > > tool / memory / reasoning loops
> > >
> > > this proposal focuses on:
> > >
> > > execution inside Flink runtime
> > >
> > > Specifically:
> > >
> > > batching at operator level
> > > concurrency control
> > > backpressure-aware scheduling
> > > checkpoint alignment
> > >
> > > So:
> > >
> > > flink-agents → orchestration layer
> > > this proposal → execution layer
> > >
> > > They are complementary rather than overlapping.
> > >
> > > ________________________________
> > >
> > > 6. Proposed scope (based on discussion)
> > >
> > > Based on feedback, I agree that the scope should be minimal.
> > >
> > > The plan is to start with:
> > >
> > > a unified InferenceOperator
> > > batch + async execution model
> > > concurrency control
> > > basic timeout / retry
> > >
> > > And leave out:
> > >
> > > full AI runtime abstraction
> > > backend SPI
> > > advanced reliability mechanisms
> > >
> > > ________________________________
> > >
> > > 7. Next step
> > >
> > > Based on the discussion, I plan to draft a minimal FLIP focusing on
> > > the operator abstraction.
> > >
> > > Before doing so, I’d appreciate feedback on:
> > >
> > > whether this “BI → AI incremental path” makes sense
> > > whether introducing an operator-level abstraction is appropriate
> > > whether this scope is sufficiently minimal
> > >
> > > ________________________________
> > >
> > > 8. Closing
> > >
> > > From my perspective, this work is about:
> > >
> > > helping Flink close the gap between existing streaming workloads and
> > > emerging AI use cases, in a way that is incremental, consistent, and
> > > production-oriented.
> > >
> > > Thanks again for the discussion — it has been very helpful.
> > >
> > >
> > > For easier understanding, I summarized the evolution path and
> positioning
> > > below:
> > >
> > >
> > > Best,
> > > Feat Zhang
> > >
> > >
> > > On Wed, Apr 29, 2026 at 10:05 AM Dian Fu <[email protected]>
> wrote:
> > > >
> > > > Hi Guowei,
> > > >
> > > > Thanks for driving this FLIP.  I believe it will be an important step
> > > > towards moving Flink from the BI stack towards the AI stack.
> > > >
> > > > Here are my thoughts on the concerns raised by Gustavo:
> > > >
> > > > > - SQL/Table: What is the plan here? How will the new multimodal
> types
> > > > > (Tensor, Image, Embedding) work in the type system, codegen, and
> > > > > plan/savepoint compatibility?
> > > >
> > > > I think the new multimodal types will be introduced also in the
> > > > DataStream API & Table API just as the existing built-in data types
> in
> > > > Flink.
> > > >
> > > > > Is there a plan for SQL-level model inference
> > > > > beyond the current ML_PREDICT shape, for example, vector similarity
> > or
> > > > > multimodal predicates? Today this is still very vendor-specific
> > across
> > > the
> > > > > industry, so it would be nice to know if Flink wants to take a
> clear
> > > > > position here or how this flip will fit with the sql table vision
> > > >
> > > > We intend to introduce more built-in, domain-specific model inference
> > > > functions beyond ML_PREDICT. This is reflected in the part `FLIP-XXX:
> > > > Built-in Multimodal Operators and AI Function`. We plan to introduce
> > > > quite a few
> > > > built-in functionalities around multimodal data processing and
> > > > domain-specific model inference functions to ease the life of users.
> > > > It will be available for both Python users and SQL users. The
> detailed
> > > > list will be well discussed in that sub-FLIP.
> > > >
> > > > > - DataStream (v1 and v2): Will RpcOperator and the Arrow-batch
> > > primitives
> > > > > be exposed as first-class building blocks for Java users, or only
> as
> > > > > internal pieces behind the Python DataFrame? Many streaming
> inference
> > > use
> > > > > cases (real-time enrichment, CDC + model scoring) fit very well
> with
> > > > > DataStream and would benefit from clear guidance.
> > > >
> > > > Good question. Regarding Arrow-batch primitives, the preliminary
> > > > thinking is to focus on the Python jobs first, since it's very clear
> > > > that Python users need it. If we see clear use cases where Java users
> > > > would directly benefit from it, we can absolutely expose them as
> > > > first-class building blocks. It would be great if you could share
> more
> > > > thoughts on how Java users could use it in the above or other
> > > > scenarios.
> > > >
> > > > Regards,
> > > > Dian
> > > >
> > > >
> > > > On Wed, Apr 29, 2026 at 12:04 AM David Radley <
> [email protected]
> > >
> > > wrote:
> > > > >
> > > > > Hi Guowei,
> > > > > This is an interesting proposal. I second Roberts questions. Some
> > > thoughts.
> > > > >
> > > > > Layer 3 does not depend on layers 1 and 2 I think. At the high
> level
> > I
> > > wonder, is the idea that Flink could become like an R ML pipeline or
> > SPSS?
> > > It would be good to compare existing technology solutions and what
> > benefits
> > > Flink will bring to these scenarios.
> > > > > FLIP-XXX: Supporting RpcOperator — Independently Deployed and
> Scaled
> > > RPC Service Operators - see Robert's comment. I assume this is a
> > > specialization of the async io operator for RPC. When you say deploying
> > RPC
> > > services that are fully managed by the Flink runtime, where would these
> > be
> > > deployed? If it is remote how would this work? It would be interesting
> to
> > > see some use cases where Flink would be deploying RPC services that it
> > has
> > > created.
> > > > > FLIP-XXX: Multimodal Data Type System and Object Reference
> Mechanism
> > > > >
> > > > > I like the idea of adding these types - the interesting part will
> be
> > > the deser.
> > > > >
> > > > > FLIP-XXX: A More Pythonic DataFrame API for Python Users - this
> makes
> > > sense
> > > > > FLIP-XXX: Connector API for Multimodal Data Source/Sink - I assume
> > > this will be renamed new multimodal formats. Are there existing
> > registries
> > > that these could be looked in - similar to schema registry - so we can
> > > bring in artifacts via metadata?
> > > > > FLIP-XXX: Built-in Multimodal Operators and AI Functions - I wonder
> > if
> > > we could bring in existing implementation libraries and the new work
> > would
> > > allow us to call them from Flink. i.e. not having to do them one call
> by
> > > call but library by library.
> > > > > FLIP-XXX: Columnar Data Transport and Processing Optimization -
> this
> > > seems a big change, events as columns rather than events as rows or CDC
> > > sequences. I assume this would not be exposed in SQL?
> > > > >
> > > > >  kind regards, David.
> > > > >
> > > > > From: Robert Metzger <[email protected]>
> > > > > Date: Tuesday, 28 April 2026 at 07:38
> > > > > To: [email protected] <[email protected]>
> > > > > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-577: AI-Native Flink — An
> > > Umbrella Proposal for Multimodal Data Processing
> > > > >
> > > > > Hey Guowei,
> > > > >
> > > > > Thanks for the proposal. I just took a brief look, here are some
> high
> > > level
> > > > > questions:
> > > > >
> > > > > Regarding the RPC Operator: What is the difference to the async io
> > > operator
> > > > > we have already?
> > > > >
> > > > > "Connector API for Multimodal Data Source/Sink": Why do we need to
> > > touch
> > > > > the connector API for supporting multimodal data? Isn't this more
> of
> > a
> > > > > formats concern?
> > > > >
> > > > > "Non-Disruptive Scaling for CPU Operators": How do you want to
> > > guarantee
> > > > > exactly-once on that kind of scaling? E.g. you need to somehow
> make a
> > > > > handover between the old and new new pipeline
> > > > >
> > > > > Overall, I find the proposal has some things which seem related to
> > > making
> > > > > Flink more AI native, but other changes seem orthogonal to that.
> For
> > > > > example the checkpoint or scaling changes are actually unrelated to
> > > AI, and
> > > > > just engine improvements.
> > > > >
> > > > >
> > > > > On Tue, Apr 28, 2026 at 5:48 AM Guowei Ma <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > I'd like to start a discussion on an umbrella FLIP[1] that lays
> > out a
> > > > > > direction for evolving Flink into a data engine that natively
> > > supports AI
> > > > > > workloads.
> > > > > >
> > > > > > The short version: user workloads are shifting from BI analytics
> to
> > > > > > multimodal data processing centered on model inference, and this
> > > triggers
> > > > > > cascading changes across the stack — multimodal data flowing
> > through
> > > > > > pipelines, heterogeneous CPU/GPU resources, vectorized execution,
> > and
> > > > > > inference tasks that run for seconds to minutes on Spot
> instances.
> > > The
> > > > > > proposal sketches an evolution along five directions (development
> > > paradigm,
> > > > > > data model, heterogeneous resources, execution engine, fault
> > > tolerance),
> > > > > > decomposed into 11 sub-FLIPs organized into three layers: core
> > > runtime
> > > > > > primitives, AI workload expression and execution, and
> > > production-grade
> > > > > > operational guarantees. Most sub-FLIPs have no hard dependencies
> on
> > > each
> > > > > > other and can be advanced in parallel.
> > > > > >
> > > > > > A note on scope, since it's an umbrella:
> > > > > >
> > > > > > - In scope here: whether the evolution directions are reasonable,
> > > whether
> > > > > > each sub-FLIP's motivation and proposed approach are
> well-founded,
> > > and
> > > > > > whether the boundaries and dependencies between sub-FLIPs are
> > clear.
> > > > > > - Out of scope here: detailed designs, API specifics, and
> > > implementation
> > > > > > plans of individual sub-FLIPs — those will go through their own
> > > FLIPs.
> > > > > > - Consensus criteria: agreement on the overall direction is
> > > sufficient for
> > > > > > the umbrella to pass; passing it does not lock in any sub-FLIP's
> > > design —
> > > > > > sub-FLIPs may still be adjusted, deferred, or withdrawn as they
> > > progress.
> > > > > >
> > > > > > All proposed changes are incremental — no existing API or
> behavior
> > is
> > > > > > removed or altered. Compatibility details are covered at the end
> of
> > > the
> > > > > > document.
> > > > > >
> > > > > > Looking forward to your feedback on the overall direction and the
> > > layering.
> > > > > >
> > > > > > [1]
> > > > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275
> > > > > >
> > > > > > Thanks,
> > > > > > Guowei
> > > > > >
> > > > >
> > > > > Unless otherwise stated above:
> > > > >
> > > > > IBM United Kingdom Limited
> > > > > Registered in England and Wales with number 741598
> > > > > Registered office: Building C, IBM Hursley Office, Hursley Park
> Road,
> > > Winchester, Hampshire SO21 2JN
> > >
> >
>

Re: [DISCUSS] FLIP-577: AI-Native Flink — An Umbrella Proposal for Multimodal Data Processing

Reply via email to