Re: [DISCUSS] FLIP-577: AI-Native Flink — An Umbrella Proposal for Multimodal Data Processing

FeatZhang Tue, 05 May 2026 06:31:58 -0700

Hi Guowei and Dian,

Thanks for the detailed discussion and for raising these important
architectural questions.


I’d like to connect this thread with an earlier discussion I initiated:
https://lists.apache.org/thread/rn8r7pqfgncd2k3tmy9qfxs6xcslr7lf

https://docs.google.com/document/d/17PpHDjFCeL2wcp9Q7t7snyvKnW2ugBKZ-rtqo_nMKQg/edit


and provide more context from real-world AI application scenarios, as
well as how my current work and proposed design align with Flink’s
evolution.

________________________________

1. Starting point: AI workloads in streaming systems

>From current industry usage, AI in streaming systems is not an
isolated domain, but an extension of existing Flink workloads.

Typical Flink applications already include:

real-time analytics (dashboards, monitoring)
event-driven systems (fraud detection, anomaly detection)
data pipelines (ETL, enrichment)

In recent years, these workloads are evolving into:

real-time inference (model scoring / LLM / RAG)
event-driven decision making
continuous feature + inference pipelines

For example:

fraud detection → model inference in stream
monitoring → anomaly detection models
BI dashboards → AI-assisted insights

So from a system perspective:

AI workloads are not replacing Flink use cases, but extending existing
BI / streaming analytics workloads into intelligent pipelines

________________________________

2. Current gap in Flink (from application perspective)

Today, implementing inference in Flink typically relies on:

AsyncFunction / async I/O
custom batching logic
ad-hoc retry / timeout handling
external model services (e.g., Triton)

Flink already provides building blocks for these:

async processing
state management
exactly-once semantics
integration with inference systems like Triton

However, from an application perspective, we observe:

Same pattern is repeatedly reimplemented:
- batching
- concurrency control
- retry / timeout
- backend integration

And these are:

not standardized
not reusable
not aligned with a unified operator abstraction

________________________________

3. My contributions and what problems they address

In previous contributions, I have been working on multiple aspects of
this problem:

Inference execution

AsyncBatchWaitOperator
AsyncBatchFunction (DataStream / Table API)
retry / timeout strategies
unified metrics

👉 solving:

async inference execution
batch-based inference
reliability baseline

________________________________

Backend integration

Triton inference integration
HTTP connection pooling
fallback / retry mechanisms

👉 solving:

external inference system integration
production-grade inference calls

________________________________

Runtime control (related work)

NodeHealthManager
slot filtering and isolation

👉 solving:

runtime stability and resource control

________________________________

These contributions already cover most of the required capabilities:

async execution
batching
retry / timeout
backend integration

But currently:

they are distributed across multiple components without a unified abstraction

________________________________

4. Design direction: from BI → AI (incremental evolution)

One important point I want to emphasize is:

This proposal is not introducing an “AI system” into Flink
but extending Flink along its existing evolution path

Flink has historically evolved as:

ETL → Streaming Analytics → Real-time BI → (now) Intelligent Streaming

So the direction here is:

start from mature BI / SQL / streaming workloads
incrementally introduce inference capabilities
unify them into runtime abstractions

This keeps Flink aligned with:

existing users
existing APIs
production workloads

________________________________

5. Relation to flink-agents

I fully agree that Apache Flink Agents is an important direction.

>From my understanding:

flink-agents focuses on:

agent orchestration
LLM workflows
tool / memory / reasoning loops

this proposal focuses on:

execution inside Flink runtime

Specifically:

batching at operator level
concurrency control
backpressure-aware scheduling
checkpoint alignment

So:

flink-agents → orchestration layer
this proposal → execution layer

They are complementary rather than overlapping.

________________________________

6. Proposed scope (based on discussion)

Based on feedback, I agree that the scope should be minimal.

The plan is to start with:

a unified InferenceOperator
batch + async execution model
concurrency control
basic timeout / retry

And leave out:

full AI runtime abstraction
backend SPI
advanced reliability mechanisms

________________________________

7. Next step

Based on the discussion, I plan to draft a minimal FLIP focusing on
the operator abstraction.

Before doing so, I’d appreciate feedback on:

whether this “BI → AI incremental path” makes sense
whether introducing an operator-level abstraction is appropriate
whether this scope is sufficiently minimal

________________________________

8. Closing

>From my perspective, this work is about:

helping Flink close the gap between existing streaming workloads and
emerging AI use cases, in a way that is incremental, consistent, and
production-oriented.

Thanks again for the discussion — it has been very helpful.


For easier understanding, I summarized the evolution path and positioning below:


Best,
Feat Zhang


On Wed, Apr 29, 2026 at 10:05 AM Dian Fu <[email protected]> wrote:
>
> Hi Guowei,
>
> Thanks for driving this FLIP.  I believe it will be an important step
> towards moving Flink from the BI stack towards the AI stack.
>
> Here are my thoughts on the concerns raised by Gustavo:
>
> > - SQL/Table: What is the plan here? How will the new multimodal types
> > (Tensor, Image, Embedding) work in the type system, codegen, and
> > plan/savepoint compatibility?
>
> I think the new multimodal types will be introduced also in the
> DataStream API & Table API just as the existing built-in data types in
> Flink.
>
> > Is there a plan for SQL-level model inference
> > beyond the current ML_PREDICT shape, for example, vector similarity or
> > multimodal predicates? Today this is still very vendor-specific across the
> > industry, so it would be nice to know if Flink wants to take a clear
> > position here or how this flip will fit with the sql table vision
>
> We intend to introduce more built-in, domain-specific model inference
> functions beyond ML_PREDICT. This is reflected in the part `FLIP-XXX:
> Built-in Multimodal Operators and AI Function`. We plan to introduce
> quite a few
> built-in functionalities around multimodal data processing and
> domain-specific model inference functions to ease the life of users.
> It will be available for both Python users and SQL users. The detailed
> list will be well discussed in that sub-FLIP.
>
> > - DataStream (v1 and v2): Will RpcOperator and the Arrow-batch primitives
> > be exposed as first-class building blocks for Java users, or only as
> > internal pieces behind the Python DataFrame? Many streaming inference use
> > cases (real-time enrichment, CDC + model scoring) fit very well with
> > DataStream and would benefit from clear guidance.
>
> Good question. Regarding Arrow-batch primitives, the preliminary
> thinking is to focus on the Python jobs first, since it's very clear
> that Python users need it. If we see clear use cases where Java users
> would directly benefit from it, we can absolutely expose them as
> first-class building blocks. It would be great if you could share more
> thoughts on how Java users could use it in the above or other
> scenarios.
>
> Regards,
> Dian
>
>
> On Wed, Apr 29, 2026 at 12:04 AM David Radley <[email protected]> wrote:
> >
> > Hi Guowei,
> > This is an interesting proposal. I second Roberts questions. Some thoughts.
> >
> > Layer 3 does not depend on layers 1 and 2 I think. At the high level I 
> > wonder, is the idea that Flink could become like an R ML pipeline or SPSS? 
> > It would be good to compare existing technology solutions and what benefits 
> > Flink will bring to these scenarios.
> > FLIP-XXX: Supporting RpcOperator — Independently Deployed and Scaled RPC 
> > Service Operators - see Robert's comment. I assume this is a specialization 
> > of the async io operator for RPC. When you say deploying RPC services that 
> > are fully managed by the Flink runtime, where would these be deployed? If 
> > it is remote how would this work? It would be interesting to see some use 
> > cases where Flink would be deploying RPC services that it has created.
> > FLIP-XXX: Multimodal Data Type System and Object Reference Mechanism
> >
> > I like the idea of adding these types - the interesting part will be the 
> > deser.
> >
> > FLIP-XXX: A More Pythonic DataFrame API for Python Users - this makes sense
> > FLIP-XXX: Connector API for Multimodal Data Source/Sink - I assume this 
> > will be renamed new multimodal formats. Are there existing registries that 
> > these could be looked in - similar to schema registry - so we can bring in 
> > artifacts via metadata?
> > FLIP-XXX: Built-in Multimodal Operators and AI Functions - I wonder if we 
> > could bring in existing implementation libraries and the new work would 
> > allow us to call them from Flink. i.e. not having to do them one call by 
> > call but library by library.
> > FLIP-XXX: Columnar Data Transport and Processing Optimization - this seems 
> > a big change, events as columns rather than events as rows or CDC 
> > sequences. I assume this would not be exposed in SQL?
> >
> >  kind regards, David.
> >
> > From: Robert Metzger <[email protected]>
> > Date: Tuesday, 28 April 2026 at 07:38
> > To: [email protected] <[email protected]>
> > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-577: AI-Native Flink — An Umbrella 
> > Proposal for Multimodal Data Processing
> >
> > Hey Guowei,
> >
> > Thanks for the proposal. I just took a brief look, here are some high level
> > questions:
> >
> > Regarding the RPC Operator: What is the difference to the async io operator
> > we have already?
> >
> > "Connector API for Multimodal Data Source/Sink": Why do we need to touch
> > the connector API for supporting multimodal data? Isn't this more of a
> > formats concern?
> >
> > "Non-Disruptive Scaling for CPU Operators": How do you want to guarantee
> > exactly-once on that kind of scaling? E.g. you need to somehow make a
> > handover between the old and new new pipeline
> >
> > Overall, I find the proposal has some things which seem related to making
> > Flink more AI native, but other changes seem orthogonal to that. For
> > example the checkpoint or scaling changes are actually unrelated to AI, and
> > just engine improvements.
> >
> >
> > On Tue, Apr 28, 2026 at 5:48 AM Guowei Ma <[email protected]> wrote:
> >
> > > Hi everyone,
> > >
> > > I'd like to start a discussion on an umbrella FLIP[1] that lays out a
> > > direction for evolving Flink into a data engine that natively supports AI
> > > workloads.
> > >
> > > The short version: user workloads are shifting from BI analytics to
> > > multimodal data processing centered on model inference, and this triggers
> > > cascading changes across the stack — multimodal data flowing through
> > > pipelines, heterogeneous CPU/GPU resources, vectorized execution, and
> > > inference tasks that run for seconds to minutes on Spot instances. The
> > > proposal sketches an evolution along five directions (development 
> > > paradigm,
> > > data model, heterogeneous resources, execution engine, fault tolerance),
> > > decomposed into 11 sub-FLIPs organized into three layers: core runtime
> > > primitives, AI workload expression and execution, and production-grade
> > > operational guarantees. Most sub-FLIPs have no hard dependencies on each
> > > other and can be advanced in parallel.
> > >
> > > A note on scope, since it's an umbrella:
> > >
> > > - In scope here: whether the evolution directions are reasonable, whether
> > > each sub-FLIP's motivation and proposed approach are well-founded, and
> > > whether the boundaries and dependencies between sub-FLIPs are clear.
> > > - Out of scope here: detailed designs, API specifics, and implementation
> > > plans of individual sub-FLIPs — those will go through their own FLIPs.
> > > - Consensus criteria: agreement on the overall direction is sufficient for
> > > the umbrella to pass; passing it does not lock in any sub-FLIP's design —
> > > sub-FLIPs may still be adjusted, deferred, or withdrawn as they progress.
> > >
> > > All proposed changes are incremental — no existing API or behavior is
> > > removed or altered. Compatibility details are covered at the end of the
> > > document.
> > >
> > > Looking forward to your feedback on the overall direction and the 
> > > layering.
> > >
> > > [1]
> > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275
> > >
> > > Thanks,
> > > Guowei
> > >
> >
> > Unless otherwise stated above:
> >
> > IBM United Kingdom Limited
> > Registered in England and Wales with number 741598
> > Registered office: Building C, IBM Hursley Office, Hursley Park Road, 
> > Winchester, Hampshire SO21 2JN

Re: [DISCUSS] FLIP-577: AI-Native Flink — An Umbrella Proposal for Multimodal Data Processing

Reply via email to