Re: [DISCUSS] FLIP-577: AI-Native Flink — An Umbrella Proposal for Multimodal Data Processing

Gen Luo Mon, 04 May 2026 07:28:45 -0700

Hi all,

Thank you Guowei Ma for driving this discussion, and thanks everyone for
the valuable insights. Inspired by this exchange, I’d like to share a few
thoughts.


While “AI-Native” covers broad ground, I believe this FLIP does not
overextend Flink’s scope. It’s a necessary iteration driven by evolving
user scenarios and AI advancements, particularly multimodal processing.
Given the growing adoption of multimodal applications and increasing
interest in low-latency inference, initiating these enhancements is a
timely step to better align Flink with evolving AI workloads.

>From our engagements with customers and developers, we observe a clear
shift in both workloads and user expectations. Model inference is
increasingly central to data pipelines, with multimodal AI tasks growing
rapidly. Traditional real-time scenarios (e.g., monitoring and analytics)
now leverage models and agent frameworks like Flink Agent for intelligent,
multi-turn decision-making, while large-scale offline compute is also
shifting toward LLMs and vision models. Alongside this workload evolution,
developer workflows have adapted: AI practitioners naturally prefer Python
and DataFrame-style APIs. As AI-assisted coding matures, aligning system
interfaces with these familiar patterns will directly improve AI-generated
code quality and significantly lower adoption barriers for the AI community.

Today, many AI evaluation tools don’t yet recommend Flink for AI
workloads—largely due to limited visibility of Flink’s relevant
capabilities rather than fundamental incompatibility. In reality, Flink has
unique strengths here. For example, generating multimodal samples is often
a multi-day, GPU-heavy process. Flink’s streaming model, combined with
checkpointing and reduced disk I/O, is well-suited for such long-running
tasks—a direction also pursued by engines like Daft and Ray Data. With
Flink’s proven production stability, we’re well-positioned for both batch
and future real-time multimodal streaming inference. Targeted improvements
can make these advantages visible, driving better user experiences and
healthier ecosystem growth.

I’d also note a lesson from FlinkML. It attempted to cover model training
but struggled to align with the fast-iteration, Python/notebook-centric
workflows preferred by ML researchers. Flink’s core strength lies in
high-concurrency, production-grade inference orchestration—not training
lifecycle management (e.g., experiment tracking, versioning). This mismatch
limited its adoption.

This proposal, however, takes a different path. It doesn’t aim to replace
training frameworks. Instead, it introduces modern AI concepts (multimodal
data, LLMs) as first-class citizens for inference, built atop Flink’s
computation strengths. Think Ray Data’s scope (plus simple co-located
serving), not Train/Tune. Crucially, unlike the FlinkML era, today’s models
use standardized interfaces and mature serving frameworks, allowing Flink
to integrate external models seamlessly without heavy
customization—significantly lowering project risk.

This FLIP marks Flink’s another starting point for the AI era. While
details need refinement, I believe this direction aligns with both current
and future user needs and Flink’s evolution.

Best,
Gen

On Mon, May 4, 2026 at 12:15 PM Jark Wu <[email protected]> wrote:

> Hi Guowei,
>
> Thanks for driving this. +1 on the overall direction. Flink's
> streaming processing and checkpoint mechanism give it a structural
> advantage over systems like Daft and Ray. But today, these runtime
> strengths are held back by gaps in Python API, GPU scheduling, and
> native multimodal data handling. This umbrella FLIP addresses exactly
> that gap, comprehensively and systematically. I believe multimodal
> data processing is the biggest opportunity for traditional data infra
> to transition into AI infra, and this is one of the most important
> FLIPs for Flink in the AI era.
>
> As one of the Table/SQL module maintainers, we would like to
> contribute the built-in multimodal processing UDFs (audio, video,
> image, text) and native multimodal data types (Tensor, Image,
> Embedding, etc.) as first-class citizens in the type system. Looking
> forward to the sub-FLIP discussions.
>
> Best,
> Jark
>
> On Thu, 30 Apr 2026 at 18:42, Guowei Ma <[email protected]> wrote:
> >
> > Hi,Yaroslav
> >
> > Thanks for taking the time to write this detailed feedback. Let me
> clarify
> > the intent of the proposal first.
> >
> > I am not saying that Flink should become an AI framework, an ML platform,
> > or a model serving system. The way I use "AI-Native" in this proposal is
> to
> > say that Flink should support, as first-class citizens, the core objects
> > and execution patterns that frequently show up in AI-oriented data
> > processing — instead of leaving them entirely to external systems or ad
> hoc
> > user-defined integrations.
> >
> > These objects and execution patterns include:
> >
> >    - Multimodal and unstructured data objects such as images, video,
> audio,
> >    tensors, embeddings, and object references.
> >    - Model inference as part of the data flow, rather than an entirely
> >    external black-box service call.
> >    - Operators backed by heterogeneous resources such as GPUs.
> >    - Pythonic and vectorized processing styles.
> >    - Long-running, long-tailed asynchronous computation.
> >
> > "AI-Native" is just a shorthand here, meaning that Flink should natively
> > understand and support the core abstractions of this class of workloads.
> > The FLIP needs to make the target workload class clearer. What we care
> > about is not any specific model paradigm — LLM, CV, recommendation, or
> > traditional ML inference — but a class of data processing workloads with
> > shared runtime and topology characteristics:
> >
> >    - A single computation may take seconds or even minutes, instead of
> >    microseconds as in traditional row-at-a-time processing.
> >    - Execution often involves heterogeneous resources such as CPU + GPU,
> >    where GPUs are expensive and scarce.
> >    - Data is often multimodal large objects (images, video, audio,
> tensors,
> >    embeddings), rather than structured small records.
> >    - Computation logic often includes model inference or service-style
> >    invocations as part of the pipeline.
> >    - Many target topologies are relatively shuffle-light and don't
> >    necessarily involve complex keyed-state migration, e.g. URI →
> preprocessing
> >    → inference → sink.
> >
> > Ten years ago, many ML workloads took the form of offline training plus
> > online feature serving. Flink already played a strong role in feature
> > engineering, streaming feature computation, and real-time data
> preparation,
> > so there was no strong need to reshape Flink into an "ML-Native" engine.
> >
> > What is changing today is that model inference itself is increasingly
> > becoming part of the data processing pipeline; multimodal objects are no
> > longer just opaque blobs in external storage, but data objects that need
> to
> > be referenced, passed, transformed, inferred over, and landed inside the
> > engine. This is not simply one more ML use case — it is a change in the
> > shape of workloads Flink needs to support.
> >
> > On whether the user demand is real, the validation signals we are
> currently
> > seeing include:
> >
> >    - Within Alibaba, multimodal data processing is already in production,
> >    covering image, video, audio, and text modalities.
> >    - In offline conversations with several companies (including ByteDance
> >    and Tencent), we have heard substantial demand for Flink to support
> AI data
> >    processing / multimodal data processing.
> >    - On the ecosystem side, we are working with NVIDIA on a joint demo
> >    focused on multimodal data processing, planned for Flink Forward Asia.
> >    - The emergence and growth of systems such as Daft, Ray Data,
> >    Data-Juicer, and LAS also reflect rapidly growing demand for
> multimodal
> >    data processing.
> >    - There have also been independent discussions in this direction
> within
> >    the community — for example, the "Streaming-native AI Inference
> Runtime
> >    Layer" proposal on the dev list.
> >
> > On "why now, instead of waiting for standardization" — I understand the
> > concern. LLM-related frameworks, APIs, and application-level patterns are
> > indeed changing quickly. If this FLIP were trying to bake a specific LLM
> > API, agent framework, or prompt protocol into Flink, the risk would be
> high.
> >
> > But most of the capabilities in this proposal are not LLM-specific. They
> > are more fundamental data processing and runtime capabilities: Pipeline
> > Region-level checkpointing, Object Reference, GPU resource declaration,
> > columnar data transfer, service-style operator invocation, long-running
> > async execution. These are useful for today's LLM workloads, and equally
> > useful for future AI workloads in shapes we cannot fully predict yet. The
> > fast-changing parts should live in the ecosystem and SDK layer; the FLIP
> > should focus on more stable engine-level capabilities.
> >
> > On tactical changes vs. umbrella, I partly agree with you. Each sub-FLIP
> > should be discussed, reviewed, and accepted or rejected on its own
> merits.
> > The umbrella should not bypass the normal FLIP process, and accepting the
> > umbrella does not mean accepting all sub-FLIPs. That said, I still think
> > the umbrella is valuable. Its purpose is not to bind the 11 changes into
> a
> > single inseparable package, but to help the community align on
> principles,
> > clarify boundaries and dependencies, and avoid conflicting or duplicated
> > abstractions across related capabilities.
> >
> > For example, if RpcOperator is not considered together with
> non-disruptive
> > scaling, it is hard to give GPU operator elasticity coherent semantics.
> > Deploying inference services independently is only the first step; the
> > harder question is how Flink uniformly handles service discovery,
> in-flight
> > request draining, backpressure, and failover during scaling. Without an
> > umbrella, these capabilities can certainly be advanced as tactical
> changes,
> > but we may end up with a set of abstractions that are locally usable but
> > globally inconsistent.
> >
> > On RpcOperator, I agree that we need to be very careful in defining the
> > boundary between the Flink runtime and external orchestration systems.
> > Kubernetes or the Kubernetes Operator may well be the right choice at the
> > physical deployment level. But I still believe Flink needs a first-class
> > RpcOperator abstraction, because deployment is only part of the problem —
> > the harder part is its semantic integration with the Flink job.
> >
> > If model inference is part of the logical data flow, Flink needs at
> minimum
> > to be aware of its service discovery, backpressure behavior, failover
> > behavior, in-flight request draining, and scaling coordination. If it is
> > hidden entirely behind an external black-box service, it is hard for
> Flink
> > to provide consistent job-level semantics and operational experience.
> >
> > So the point of RpcOperator is not necessarily that "every physical
> process
> > must be directly launched and managed by Flink core," but that Flink
> needs
> > to define a service-style operator contract that allows such operators to
> > be invoked correctly by the data flow, coordinated correctly by the
> > runtime, and understood and operated by users as part of a Flink job.
> >
> > On vectorized batch processing, I agree the long-term direction should
> not
> > stop at Python. Native columnar / vectorized execution is an end-to-end
> > problem that touches connectors, formats, the type system, runtime, Java,
> > SQL, and Python. The current proposal starts from the Java/Python
> boundary
> > because that is where the row/column conversion overhead is most visible.
> > End-to-end columnar execution on the Java and SQL side deserves to be
> > discussed further as a separate, larger FLIP.
> >
> > On multimodal types and SerDes complexity, I agree this needs to be
> handled
> > carefully. Making AI-related objects first-class does not imply that
> every
> > connector must immediately and fully support image, video, audio, tensor,
> > and so on. The concrete incremental path, fallback strategy, and the
> > boundary between formats, connector API, and the type system will be
> > discussed further in the corresponding sub-FLIPs.
> >
> > Coming back to the core of the proposal: it is not about turning Flink
> into
> > an AI framework. It is about making the core objects and execution
> patterns
> > of AI-oriented data processing first-class citizens in Flink.
> >
> > Best,
> > Guowei
> >
> >
> > On Thu, Apr 30, 2026 at 5:37 AM Yaroslav Tkachenko <[email protected]
> >
> > wrote:
> >
> > > Hi Guowei,
> > >
> > > Thank you for writing this proposal.
> > >
> > > I may be in the minority here, but I hope my voice will be heard. I
> > > disagree with turning Flink into an "AI-Native" engine.
> > >
> > > Regarding your "Data processing is entering the AI era, and Flink
> needs to
> > > evolve from a traditional BI compute engine into a data engine that
> > > natively supports AI workloads" claim:
> > >
> > > - How exactly do you define "AI"? I don't believe there is a standard
> > > definition. For example, Machine Learning have been around for more
> than a
> > > decade, but there were no proposals (or need, in my opinion) to turn
> Flink
> > > into an "ML-Native" engine. Flink, in its current state, has
> > > been successfully used in many systems alongside dedicated ML
> technologies,
> > > like feature stores. Based on the context of your proposal, it looks
> like
> > > you mostly mean LLMs, so could you be specific about the language?
> > > - I wouldn't call Flink "a traditional BI compute engine". Flink is a
> > > general data processing technology which can be used for a variety of
> use
> > > cases without any BI involvement.
> > > - Do you have any proof that "Users' core workloads are rapidly
> evolving"
> > > and that they require your proposed changes? Case studies, user
> surveys, or
> > > submitted issues about the lack of support? Big changes like that
> require
> > > extensive validation.
> > > - And even if there is a real need to adopt some LLM-driven changes,
> why
> > > now? The LLM-related tooling has been changing so rapidly, and it's
> hard to
> > > predict what will be needed tomorrow. Why does it make sense to
> introduce
> > > changes now, and not wait for more standardization and consolidation?
> > >
> > > To summarize, I think there are a lot of great ideas in the proposal,
> but
> > > in my mind, they need to be addressed as tactical, focused changes, not
> > > under the "AI-Native" umbrella.
> > >
> > > I also wanted to address a few more specific points:
> > >
> > > - RpcOperator, why does it need to be managed by Flink? I see
> absolutely no
> > > need to introduce the additional complexity of orchestrating standalone
> > > components into the core Flink engine. I can imagine a separate
> sub-project
> > > for an RpcOperator, which could potentially be managed by the
> Kubernetes
> > > Operator.
> > > - You make the case for the vectorized batch processing, but only on
> the
> > > Python side. Why stop there? Native columnar vectorized execution will
> > > require end-to-end changes, including connectors, data format support,
> Type
> > > system support, runtime changes, etc. It seems logical to me to support
> > > this execution mode for Java and SQL as well.
> > > - Supporting many more data types natively (images, video, audio,
> tensors)
> > > will make connector serializers and deserializers (SerDes) much more
> > > challenging to implement. Even today, many SerDes in officially
> supported
> > > connectors don't fully implement types like arrays and structs.
> > >
> > > Thank you.
> > >
> > > On Wed, Apr 29, 2026 at 1:18 AM Guowei Ma <[email protected]>
> wrote:
> > >
> > > > Hi Z
> > > >
> > > > Thanks for the kind words and the thoughtful questions. Let me take
> them
> > > > one by one.
> > > >
> > > >    1. Throughput and latency targets
> > > >
> > > > To be honest, I don't have concrete numbers to share yet. What I can
> say
> > > is
> > > > that our internal testing has already surfaced several directions
> where
> > > > Flink can be improved, and at the same time we want to fully leverage
> > > > Flink's existing streaming shuffle capabilities. As the multimodal
> > > operator
> > > > library matures, we'll progressively publish benchmark results.
> > > >
> > > >    2. Built-in operators
> > > >
> > > > You're absolutely right. From what I've seen, our internal users
> already
> > > > rely on a fairly large set of multimodal operators — potentially
> 100+.
> > > The
> > > > exact set the community should provide is best discussed in FLIP-XXX:
> > > > Built-in Multimodal Operators and AI Functions, and contributions
> from
> > > the
> > > > community are very welcome there.
> > > >
> > > >    3. Plan for the 11 sub-FLIPs
> > > >
> > > > The sequencing follows the layering in the umbrella:
> > > >
> > > >    - Layer 1 (Core Primitives) should be discussed and aligned first,
> > > since
> > > >    the second and third layers build on it.
> > > >    - Layer 2 (API + compilation + single-node execution) starts with
> > > >    getting the API discussion right — the Python API, how UDFs
> declare
> > > >    resources, etc. — after which the single-node execution work can
> build
> > > > on
> > > >    top.
> > > >    - Layer 3 (distributed scheduling and checkpointing) can largely
> > > proceed
> > > >    independently in parallel.
> > > >
> > > > So while each sub-FLIP is indeed a substantial piece of work, most of
> > > them
> > > > can be advanced in parallel by different contributors once the Layer
> 1
> > > > primitives are settled.
> > > >
> > > >    4. GPU scheduling roadmap
> > > >
> > > > Could you expand a bit on which aspect of GPU scheduling you have in
> mind
> > > > as the complex one? "GPU scheduling" covers a fairly wide surface
> area
> > > > (resource declaration, operator-level deployment, elastic scaling,
> > > > heterogeneous GPU types, fine-grained partitioning, etc.), and the
> answer
> > > > differs quite a bit depending on which dimension we're discussing.
> Once I
> > > > understand your specific concern I can give a more useful response.
> > > >
> > > > Thanks again for the support — looking forward to the continued
> > > discussion.
> > > >
> > > > Best,
> > > > Guowei
> > > >
> > > >
> > > > On Tue, Apr 28, 2026 at 4:34 PM zl z <[email protected]> wrote:
> > > >
> > > > > Hey Guowei,
> > > > >
> > > > > Thanks for the proposal, and I think this is very valuable. I have
> some
> > > > > question about it:
> > > > >
> > > > > 1. What are our expected throughput and latency targets? Do we
> have any
> > > > > forward-looking tests for this?
> > > > >
> > > > > 2. AI involves a very large number of operators. Besides allowing
> users
> > > > to
> > > > > use them through UDFs, will we also provide commonly used built-in
> > > > > operators?
> > > > >
> > > > > 3. Each of the 11 sub-FLIPs is a major project involving a
> significant
> > > > > amount of changes. What is our plan for this?
> > > > >
> > > > > 4. GPU scheduling is extremely complex. What is our current
> roadmap for
> > > > > this?
> > > > >
> > > > > This is a very high-quality and exciting proposal. Making Flink an
> > > > > AI-native data processing engine will make it far more valuable in
> the
> > > AI
> > > > > era. Look forward to seeing it land and come to fruition soon.
> > > > >
> > > > > Robert Metzger <[email protected]> 于2026年4月28日周二 14:38写道：
> > > > >
> > > > > > Hey Guowei,
> > > > > >
> > > > > > Thanks for the proposal. I just took a brief look, here are some
> high
> > > > > level
> > > > > > questions:
> > > > > >
> > > > > > Regarding the RPC Operator: What is the difference to the async
> io
> > > > > operator
> > > > > > we have already?
> > > > > >
> > > > > > "Connector API for Multimodal Data Source/Sink": Why do we need
> to
> > > > touch
> > > > > > the connector API for supporting multimodal data? Isn't this
> more of
> > > a
> > > > > > formats concern?
> > > > > >
> > > > > > "Non-Disruptive Scaling for CPU Operators": How do you want to
> > > > guarantee
> > > > > > exactly-once on that kind of scaling? E.g. you need to somehow
> make a
> > > > > > handover between the old and new new pipeline
> > > > > >
> > > > > > Overall, I find the proposal has some things which seem related
> to
> > > > making
> > > > > > Flink more AI native, but other changes seem orthogonal to that.
> For
> > > > > > example the checkpoint or scaling changes are actually unrelated
> to
> > > AI,
> > > > > and
> > > > > > just engine improvements.
> > > > > >
> > > > > >
> > > > > > On Tue, Apr 28, 2026 at 5:48 AM Guowei Ma <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > I'd like to start a discussion on an umbrella FLIP[1] that lays
> > > out a
> > > > > > > direction for evolving Flink into a data engine that natively
> > > > supports
> > > > > AI
> > > > > > > workloads.
> > > > > > >
> > > > > > > The short version: user workloads are shifting from BI
> analytics to
> > > > > > > multimodal data processing centered on model inference, and
> this
> > > > > triggers
> > > > > > > cascading changes across the stack — multimodal data flowing
> > > through
> > > > > > > pipelines, heterogeneous CPU/GPU resources, vectorized
> execution,
> > > and
> > > > > > > inference tasks that run for seconds to minutes on Spot
> instances.
> > > > The
> > > > > > > proposal sketches an evolution along five directions
> (development
> > > > > > paradigm,
> > > > > > > data model, heterogeneous resources, execution engine, fault
> > > > > tolerance),
> > > > > > > decomposed into 11 sub-FLIPs organized into three layers: core
> > > > runtime
> > > > > > > primitives, AI workload expression and execution, and
> > > > production-grade
> > > > > > > operational guarantees. Most sub-FLIPs have no hard
> dependencies on
> > > > > each
> > > > > > > other and can be advanced in parallel.
> > > > > > >
> > > > > > > A note on scope, since it's an umbrella:
> > > > > > >
> > > > > > > - In scope here: whether the evolution directions are
> reasonable,
> > > > > whether
> > > > > > > each sub-FLIP's motivation and proposed approach are
> well-founded,
> > > > and
> > > > > > > whether the boundaries and dependencies between sub-FLIPs are
> > > clear.
> > > > > > > - Out of scope here: detailed designs, API specifics, and
> > > > > implementation
> > > > > > > plans of individual sub-FLIPs — those will go through their own
> > > > FLIPs.
> > > > > > > - Consensus criteria: agreement on the overall direction is
> > > > sufficient
> > > > > > for
> > > > > > > the umbrella to pass; passing it does not lock in any
> sub-FLIP's
> > > > > design —
> > > > > > > sub-FLIPs may still be adjusted, deferred, or withdrawn as they
> > > > > progress.
> > > > > > >
> > > > > > > All proposed changes are incremental — no existing API or
> behavior
> > > is
> > > > > > > removed or altered. Compatibility details are covered at the
> end of
> > > > the
> > > > > > > document.
> > > > > > >
> > > > > > > Looking forward to your feedback on the overall direction and
> the
> > > > > > layering.
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Guowei
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] FLIP-577: AI-Native Flink — An Umbrella Proposal for Multimodal Data Processing

Reply via email to