Hi everyone, Thanks Gouwei for the proposal,
Could you please clarify the idea behind FLIP-XXX: Independent Checkpoints Based on Pipeline Region (the last FLIP-XXX)? And w.r.t FLIP-XXX: Unaligned Checkpoint Enhancements for Long-Running AI Workloads How "lifting the restriction on forward / pointwise edges" could be achieved? Regards, Roman On Tue, May 5, 2026 at 3:29 PM FeatZhang <[email protected]> wrote: > Hi Guowei and Dian, > > Thanks for the detailed discussion and for raising these important > architectural questions. > > I’d like to connect this thread with an earlier discussion I initiated: > https://lists.apache.org/thread/rn8r7pqfgncd2k3tmy9qfxs6xcslr7lf > > > https://docs.google.com/document/d/17PpHDjFCeL2wcp9Q7t7snyvKnW2ugBKZ-rtqo_nMKQg/edit > > > and provide more context from real-world AI application scenarios, as > well as how my current work and proposed design align with Flink’s > evolution. > > ________________________________ > > 1. Starting point: AI workloads in streaming systems > > From current industry usage, AI in streaming systems is not an > isolated domain, but an extension of existing Flink workloads. > > Typical Flink applications already include: > > real-time analytics (dashboards, monitoring) > event-driven systems (fraud detection, anomaly detection) > data pipelines (ETL, enrichment) > > In recent years, these workloads are evolving into: > > real-time inference (model scoring / LLM / RAG) > event-driven decision making > continuous feature + inference pipelines > > For example: > > fraud detection → model inference in stream > monitoring → anomaly detection models > BI dashboards → AI-assisted insights > > So from a system perspective: > > AI workloads are not replacing Flink use cases, but extending existing > BI / streaming analytics workloads into intelligent pipelines > > ________________________________ > > 2. Current gap in Flink (from application perspective) > > Today, implementing inference in Flink typically relies on: > > AsyncFunction / async I/O > custom batching logic > ad-hoc retry / timeout handling > external model services (e.g., Triton) > > Flink already provides building blocks for these: > > async processing > state management > exactly-once semantics > integration with inference systems like Triton > > However, from an application perspective, we observe: > > Same pattern is repeatedly reimplemented: > - batching > - concurrency control > - retry / timeout > - backend integration > > And these are: > > not standardized > not reusable > not aligned with a unified operator abstraction > > ________________________________ > > 3. My contributions and what problems they address > > In previous contributions, I have been working on multiple aspects of > this problem: > > Inference execution > > AsyncBatchWaitOperator > AsyncBatchFunction (DataStream / Table API) > retry / timeout strategies > unified metrics > > 👉 solving: > > async inference execution > batch-based inference > reliability baseline > > ________________________________ > > Backend integration > > Triton inference integration > HTTP connection pooling > fallback / retry mechanisms > > 👉 solving: > > external inference system integration > production-grade inference calls > > ________________________________ > > Runtime control (related work) > > NodeHealthManager > slot filtering and isolation > > 👉 solving: > > runtime stability and resource control > > ________________________________ > > These contributions already cover most of the required capabilities: > > async execution > batching > retry / timeout > backend integration > > But currently: > > they are distributed across multiple components without a unified > abstraction > > ________________________________ > > 4. Design direction: from BI → AI (incremental evolution) > > One important point I want to emphasize is: > > This proposal is not introducing an “AI system” into Flink > but extending Flink along its existing evolution path > > Flink has historically evolved as: > > ETL → Streaming Analytics → Real-time BI → (now) Intelligent Streaming > > So the direction here is: > > start from mature BI / SQL / streaming workloads > incrementally introduce inference capabilities > unify them into runtime abstractions > > This keeps Flink aligned with: > > existing users > existing APIs > production workloads > > ________________________________ > > 5. Relation to flink-agents > > I fully agree that Apache Flink Agents is an important direction. > > From my understanding: > > flink-agents focuses on: > > agent orchestration > LLM workflows > tool / memory / reasoning loops > > this proposal focuses on: > > execution inside Flink runtime > > Specifically: > > batching at operator level > concurrency control > backpressure-aware scheduling > checkpoint alignment > > So: > > flink-agents → orchestration layer > this proposal → execution layer > > They are complementary rather than overlapping. > > ________________________________ > > 6. Proposed scope (based on discussion) > > Based on feedback, I agree that the scope should be minimal. > > The plan is to start with: > > a unified InferenceOperator > batch + async execution model > concurrency control > basic timeout / retry > > And leave out: > > full AI runtime abstraction > backend SPI > advanced reliability mechanisms > > ________________________________ > > 7. Next step > > Based on the discussion, I plan to draft a minimal FLIP focusing on > the operator abstraction. > > Before doing so, I’d appreciate feedback on: > > whether this “BI → AI incremental path” makes sense > whether introducing an operator-level abstraction is appropriate > whether this scope is sufficiently minimal > > ________________________________ > > 8. Closing > > From my perspective, this work is about: > > helping Flink close the gap between existing streaming workloads and > emerging AI use cases, in a way that is incremental, consistent, and > production-oriented. > > Thanks again for the discussion — it has been very helpful. > > > For easier understanding, I summarized the evolution path and positioning > below: > > > Best, > Feat Zhang > > > On Wed, Apr 29, 2026 at 10:05 AM Dian Fu <[email protected]> wrote: > > > > Hi Guowei, > > > > Thanks for driving this FLIP. I believe it will be an important step > > towards moving Flink from the BI stack towards the AI stack. > > > > Here are my thoughts on the concerns raised by Gustavo: > > > > > - SQL/Table: What is the plan here? How will the new multimodal types > > > (Tensor, Image, Embedding) work in the type system, codegen, and > > > plan/savepoint compatibility? > > > > I think the new multimodal types will be introduced also in the > > DataStream API & Table API just as the existing built-in data types in > > Flink. > > > > > Is there a plan for SQL-level model inference > > > beyond the current ML_PREDICT shape, for example, vector similarity or > > > multimodal predicates? Today this is still very vendor-specific across > the > > > industry, so it would be nice to know if Flink wants to take a clear > > > position here or how this flip will fit with the sql table vision > > > > We intend to introduce more built-in, domain-specific model inference > > functions beyond ML_PREDICT. This is reflected in the part `FLIP-XXX: > > Built-in Multimodal Operators and AI Function`. We plan to introduce > > quite a few > > built-in functionalities around multimodal data processing and > > domain-specific model inference functions to ease the life of users. > > It will be available for both Python users and SQL users. The detailed > > list will be well discussed in that sub-FLIP. > > > > > - DataStream (v1 and v2): Will RpcOperator and the Arrow-batch > primitives > > > be exposed as first-class building blocks for Java users, or only as > > > internal pieces behind the Python DataFrame? Many streaming inference > use > > > cases (real-time enrichment, CDC + model scoring) fit very well with > > > DataStream and would benefit from clear guidance. > > > > Good question. Regarding Arrow-batch primitives, the preliminary > > thinking is to focus on the Python jobs first, since it's very clear > > that Python users need it. If we see clear use cases where Java users > > would directly benefit from it, we can absolutely expose them as > > first-class building blocks. It would be great if you could share more > > thoughts on how Java users could use it in the above or other > > scenarios. > > > > Regards, > > Dian > > > > > > On Wed, Apr 29, 2026 at 12:04 AM David Radley <[email protected]> > wrote: > > > > > > Hi Guowei, > > > This is an interesting proposal. I second Roberts questions. Some > thoughts. > > > > > > Layer 3 does not depend on layers 1 and 2 I think. At the high level I > wonder, is the idea that Flink could become like an R ML pipeline or SPSS? > It would be good to compare existing technology solutions and what benefits > Flink will bring to these scenarios. > > > FLIP-XXX: Supporting RpcOperator — Independently Deployed and Scaled > RPC Service Operators - see Robert's comment. I assume this is a > specialization of the async io operator for RPC. When you say deploying RPC > services that are fully managed by the Flink runtime, where would these be > deployed? If it is remote how would this work? It would be interesting to > see some use cases where Flink would be deploying RPC services that it has > created. > > > FLIP-XXX: Multimodal Data Type System and Object Reference Mechanism > > > > > > I like the idea of adding these types - the interesting part will be > the deser. > > > > > > FLIP-XXX: A More Pythonic DataFrame API for Python Users - this makes > sense > > > FLIP-XXX: Connector API for Multimodal Data Source/Sink - I assume > this will be renamed new multimodal formats. Are there existing registries > that these could be looked in - similar to schema registry - so we can > bring in artifacts via metadata? > > > FLIP-XXX: Built-in Multimodal Operators and AI Functions - I wonder if > we could bring in existing implementation libraries and the new work would > allow us to call them from Flink. i.e. not having to do them one call by > call but library by library. > > > FLIP-XXX: Columnar Data Transport and Processing Optimization - this > seems a big change, events as columns rather than events as rows or CDC > sequences. I assume this would not be exposed in SQL? > > > > > > kind regards, David. > > > > > > From: Robert Metzger <[email protected]> > > > Date: Tuesday, 28 April 2026 at 07:38 > > > To: [email protected] <[email protected]> > > > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-577: AI-Native Flink — An > Umbrella Proposal for Multimodal Data Processing > > > > > > Hey Guowei, > > > > > > Thanks for the proposal. I just took a brief look, here are some high > level > > > questions: > > > > > > Regarding the RPC Operator: What is the difference to the async io > operator > > > we have already? > > > > > > "Connector API for Multimodal Data Source/Sink": Why do we need to > touch > > > the connector API for supporting multimodal data? Isn't this more of a > > > formats concern? > > > > > > "Non-Disruptive Scaling for CPU Operators": How do you want to > guarantee > > > exactly-once on that kind of scaling? E.g. you need to somehow make a > > > handover between the old and new new pipeline > > > > > > Overall, I find the proposal has some things which seem related to > making > > > Flink more AI native, but other changes seem orthogonal to that. For > > > example the checkpoint or scaling changes are actually unrelated to > AI, and > > > just engine improvements. > > > > > > > > > On Tue, Apr 28, 2026 at 5:48 AM Guowei Ma <[email protected]> > wrote: > > > > > > > Hi everyone, > > > > > > > > I'd like to start a discussion on an umbrella FLIP[1] that lays out a > > > > direction for evolving Flink into a data engine that natively > supports AI > > > > workloads. > > > > > > > > The short version: user workloads are shifting from BI analytics to > > > > multimodal data processing centered on model inference, and this > triggers > > > > cascading changes across the stack — multimodal data flowing through > > > > pipelines, heterogeneous CPU/GPU resources, vectorized execution, and > > > > inference tasks that run for seconds to minutes on Spot instances. > The > > > > proposal sketches an evolution along five directions (development > paradigm, > > > > data model, heterogeneous resources, execution engine, fault > tolerance), > > > > decomposed into 11 sub-FLIPs organized into three layers: core > runtime > > > > primitives, AI workload expression and execution, and > production-grade > > > > operational guarantees. Most sub-FLIPs have no hard dependencies on > each > > > > other and can be advanced in parallel. > > > > > > > > A note on scope, since it's an umbrella: > > > > > > > > - In scope here: whether the evolution directions are reasonable, > whether > > > > each sub-FLIP's motivation and proposed approach are well-founded, > and > > > > whether the boundaries and dependencies between sub-FLIPs are clear. > > > > - Out of scope here: detailed designs, API specifics, and > implementation > > > > plans of individual sub-FLIPs — those will go through their own > FLIPs. > > > > - Consensus criteria: agreement on the overall direction is > sufficient for > > > > the umbrella to pass; passing it does not lock in any sub-FLIP's > design — > > > > sub-FLIPs may still be adjusted, deferred, or withdrawn as they > progress. > > > > > > > > All proposed changes are incremental — no existing API or behavior is > > > > removed or altered. Compatibility details are covered at the end of > the > > > > document. > > > > > > > > Looking forward to your feedback on the overall direction and the > layering. > > > > > > > > [1] > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275 > > > > > > > > Thanks, > > > > Guowei > > > > > > > > > > Unless otherwise stated above: > > > > > > IBM United Kingdom Limited > > > Registered in England and Wales with number 741598 > > > Registered office: Building C, IBM Hursley Office, Hursley Park Road, > Winchester, Hampshire SO21 2JN >
