Hi devs,

Thanks Guowei for driving this proposal. +1 on the overall direction. I want to 
push back gently on the "is there real demand?" concern by walking through one 
very specific
scenario where we (and every team I've talked to) repeatedly hit the
same wall with today's Flink, and where the primitives in this FLIP
line up remarkably well with what's actually needed.

1. The scenario: real-time AI inference pipelines

A typical shape we run in production:

    Kafka / CDC / object storage
      -> preprocess (decode, resize, tokenize)
      -> model inference
      -> sink (Paimon / Iceberg / Kafka / KV store)

The inference step has two common deployments, and both matter:

  (a) Remote serving — call out to Triton / vLLM / an internal LLM
      gateway over gRPC/HTTP.
  (b) Local GPU — load the model inside a UDF, run on a GPU attached
      to the TM (vision models, small rerankers, embedding models).

Volume is typically 1k–100k QPS per pipeline, each request costs
10ms–2s (orders of magnitude more than a normal Flink record), and
payloads are 10KB–10MB (images, embeddings, token tensors).

2. The core problem today: "high utilization without blowing up the backend"

This is a single, concrete problem that every team I know solves
badly today:

  - If concurrency is set too low → GPUs / serving instances sit
    idle, throughput misses SLA, cost per event is high.
  - If concurrency is set too high → the backend tips over: P99
    latency collapses, GPU OOMs, and (for shared gateways) you take
    down other teams' jobs with you.

The gap between "underutilized" and "blown up" is often only 2–3x
in concurrency, and it shifts with model version, batch size, prompt
length, and input payload size. Static tuning cannot sit in that
window.

With today's Flink (AsyncFunction + external serving) we run into
specific structural limits:

  - Concurrency is a static config. It does not adapt to backend
     pressure. Back-pressure inside Flink does not cross the RPC
     boundary — when the gateway degrades, Flink keeps pushing.

  - There is no first-class micro-batching. Every team re-implements
     "gather up to N requests or wait up to T ms" inside their UDF,
     often incorrectly (timers, watermarks, shutdown races). Inference
     backends like vLLM need this to get any real GPU utilization.

  - No unified timeout / retry / circuit-breaker semantics. Each UDF
     reinvents them, usually without proper budgeting or jitter, so
     retries amplify load exactly when the backend is already hot.

  - Resource declaration for the local-GPU path is missing. Today we
     rely on the external-resource mechanism + slot-sharing hacks,
     which in practice can't express "this operator needs 0.5 GPU
     and 8GiB of VRAM", so we either over-provision (one operator per
     GPU) or let collisions cause OOM crashes.

  - Job-level elasticity does not exist. When QPS doubles at peak
     hours, you either permanently over-provision GPUs (expensive) or
     restart the job with a new parallelism (downtime). Neither fits
     an inference SLA.

  - Failure blast radius is too large. A single GPU OOM in one
     subtask fails the region → cancels the global checkpoint → the
     whole long-running inference job rolls back. For a job where
     every record costs seconds of GPU time, this is catastrophic.

3. Why this FLIP's primitives, taken together, actually solve it

I want to emphasize *taken together*, because several reviewers have
asked why individual items are AI-specific. Individually some aren't.
Together, they compose into exactly the control loop this scenario
needs:

  - RpcOperator + AsyncOperator as a client/server pair (thanks to
    Zhu Zhu's clarification): brings the serving tier *inside* the
    job's failure domain. Back-pressure, health signals, and scaling
    can now be coordinated end-to-end — which a standalone K8s
    operator fundamentally cannot do, because it has no visibility
    into Flink's in-flight requests or checkpoint state.

  - UDF-level batch_size / concurrency / resources / retry / timeout:
    turns "utilize without blowing up" from an application concern
    into an engine contract. One operator declaration replaces
    dozens of lines of bespoke, buggy code per team.

  - GPU independent deployment + non-disruptive scaling: lets us
    scale the GPU tier with actual QPS instead of provisioning for
    peak. For a 3x peak-to-trough ratio (common for consumer-facing
    inference), this is a direct 2–3x cost win — and it's only
    possible if scaling is non-disruptive, because GPU warmup takes
    30–120s.

  - Pipeline Region independent checkpoints: caps the blast radius of
    a GPU-side failure to one region, so a transient OOM no longer
    rolls back an entire multi-hour inference job. This is not just
    a generic checkpoint polish — the *reason* it becomes load-
    bearing is that per-event cost in AI workloads is 10³–10⁴x higher
    than in BI, so the cost of a global rollback is 10³–10⁴x higher
    too.

  - Built-in multimodal / AI functions: today everyone re-implements
    AsyncFunction + batching + retry + connection pooling. Lifting
    these into the engine unlocks optimizations only the engine can
    do: shared model handles, request coalescing across subtasks,
    connection pool reuse, adaptive batching based on observed
    latency.

None of these, by themselves, is "AI-only". But the composition is
what AI inference workloads *need* and generic BI workloads do not
need at the same intensity.

4. On "why must this be in Flink?"

For remote serving, one fair answer is: because back-pressure,
checkpoint, and failover have to cross the RPC boundary for the
system to stay in the utilize-vs-overload window. A K8s operator
outside Flink cannot see in-flight mail queues, cannot coordinate
barrier alignment with scaling events, and cannot bound the failure
domain of a GPU fault. Putting RpcOperator inside the runtime is
exactly what lets those signals flow both ways.

For local GPU, the answer is even clearer: resource declaration and
scheduling have to happen where the task scheduler lives. No external
system can fix this.

Thanks again for driving this — the scenarios above are exactly the
ones our users run into and are currently solving with copy-pasted
AsyncFunction code. Having them be first-class in Flink would be a
big step.

Best,
Raorao Xiong

> 2026年5月7日 11:09,Yong Fang <[email protected]> 写道:
> 
> Hi devs,
> 
> Thanks Guowei for initiating this proposal. I think this is an important
> step for Flink towards the era of AI data processing, very big +1.
> 
> I’d like to share some scenarios and requirements of leveraging PyFlink for
> AI data processing at ByteDance. Currently, we run tens of thousands of
> pyflink/flink jobs, using millions of CPU cores.
> 
> 1) Multimodal Data Processing
> We want to use PyFlink to generate multimodal feature data. The typical
> workflow starts by reading ID-based raw data and performing large table
> joins and ETL computations. We then fetch multimodal assets such as images,
> videos, texts and audios from object storage by ID. These multimodal data
> are either sent to an RPC service (backed by local models or remote large
> models), or processed via local GPU computing for frame extraction,
> embedding generation and other tasks. After multimodal computation, output
> results including embeddings, processed images and multimodal metadata are
> generated and persisted into the downstream multimodal data lake.
> 
> 2) Stream-Batch Unified Data Training
> We use PyFlink to consume processed sample data from MQ or data lakes.
> Within the data pipeline, data may be shuffled by key, then fed into
> parameter servers or local services for CPU-based or GPU-based model
> training. Such workloads strongly demand optimized CPU & GPU hybrid
> scheduling, worker node restart capability, fast scaling, as well as native
> support for unified streaming and batch training.
> 
> While supporting the above two categories of workloads, we have several
> common requirements for PyFlink and Flink Core:
> 
> 1) Native Python Computing Capability
> We need more user-friendly DataFrame APIs, comprehensive built-in data
> types for image and audio, as well as richer multimodal computing
> operators. This enables users to develop multimodal data processing jobs
> more efficiently, and allows better optimization and scheduling at the
> execution plan layer.
> 
> 2) CPU & GPU Scheduling and Resource Management
> It is expected to tag resource requirements at fine-grained levels such as
> python user-defined functions. The scheduler should provide enhanced
> resource orchestration and task scheduling, enabling more flexible
> heterogeneous resource management.
> 
> 3) Embedded Server Node Capability in Pipeline
> We hope to launch dedicated server nodes inside a single pipeline, to load
> local models or access remote models, which may be shared between different
> operators in the pipeline. This unifies data ETL and multimodal computing
> within one end-to-end pipeline, greatly simplifying operation and
> maintenance for business teams.
> 
> 4) Performance and Stability Optimization
> Key enhancements include zero-copy data transfer between Flink TM processes
> and Python processes, fast scaling of compute nodes, fast checkpointing
> mechanism, and shuffle optimization for large-scale datasets. These
> improvements will significantly boost the performance and stability of
> PyFlink multimodal workloads.
> 
> I’m really excited to see that FLIP-577 has covered all the real production
> scenarios and requirements mentioned above. It's a good chance to iterate
> and enrich the core capabilities of PyFlink and Flink Core targeting these
> AI data processing scenarios, and build Flink into a first-class AI data
> processing engine.
> 
> I’m very much looking forward to the progress.
> 
> Best,
> FangYong
> 
> On Wed, May 6, 2026 at 8:36 PM FeatZhang <[email protected]> wrote:
> 
>> Hi all,
>> 
>> This is a great topic, and honestly long overdue.
>> 
>> With the rapid growth of AI applications, we have been seeing a
>> significant increase in real-world demands from users who are already
>> building on Flink and other traditional data processing or BI engines.
>> From a platform perspective, more and more teams are trying to
>> integrate AI capabilities directly into their existing streaming
>> pipelines, rather than treating them as separate systems.
>> 
>> This is not an isolated trend — it is becoming a common requirement
>> across industries.
>> 
>> ________________________________
>> 
>> 1. What is changing in real systems
>> 
>> We are observing a consistent shift:
>> 
>> AI is moving from offline analysis or request-time scoring to
>> continuous, event-driven decision making.
>> 
>> In other words:
>> 
>> AI is becoming part of the data stream itself.
>> 
>> ________________________________
>> 
>> 2. Representative production scenarios
>> 
>> 2.1 Real-time fraud detection (per-event decision under strict latency)
>> 
>> Typical setup
>> 
>> Continuous transaction stream (payments, logins, transfers)
>> Each event must be evaluated within milliseconds
>> Decision depends on:
>> 
>> recent user behavior
>> device / IP patterns
>> short-term aggregates
>> 
>> What is happening in practice
>> 
>> Models are already deeply integrated into decision flow
>> Feature freshness directly impacts detection accuracy
>> 
>> Current pain
>> 
>> Features computed in streaming, but inference is remote
>> Network overhead adds to critical path latency
>> Hard to ensure training/serving consistency
>> 
>> ________________________________
>> 
>> 2.2 Real-time recommendation and ads (continuous re-ranking)
>> 
>> Typical setup
>> 
>> User interaction stream (click, view, dwell time)
>> Continuous feature updates (session + short-term behavior)
>> Inference triggered per interaction
>> 
>> What is happening in practice
>> 
>> Increasing reliance on real-time context
>> Model-based ranking becomes core logic
>> 
>> Current pain
>> 
>> Offline and online feature pipelines diverge
>> Training-serving skew is common
>> Inference orchestration is ad-hoc
>> 
>> ________________________________
>> 
>> 2.3 Streaming RAG / knowledge systems (continuous indexing)
>> 
>> Typical setup
>> 
>> Continuous ingestion of documents, logs, or knowledge
>> Pipeline:
>> 
>> chunking → embedding → indexing → retrieval
>> 
>> Typical use cases
>> 
>> AI copilots
>> enterprise knowledge assistants
>> observability systems
>> 
>> Current pain
>> 
>> Built via loosely coupled services or scripts
>> No strong consistency guarantees
>> Difficult to scale and recover
>> 
>> ________________________________
>> 
>> 2.4 Real-time feedback loop (continuous evaluation)
>> 
>> Typical setup
>> 
>> Prediction at time T
>> Label arrives at T + Δ
>> 
>> Required processing
>> 
>> prediction stream JOIN label stream → metrics → optimization
>> 
>> Current pain
>> 
>> Alignment logic duplicated across systems
>> Late data handling is complex
>> No reusable evaluation abstraction
>> 
>> ________________________________
>> 
>> 2.5 AI replacing rule-based decision logic
>> 
>> Evolution
>> 
>> From:
>> 
>> rule engine / CEP
>> 
>> To:
>> 
>> model / LLM → decision
>> 
>> Implication
>> 
>> AI is becoming the core decision layer inside streaming systems.
>> 
>> ________________________________
>> 
>> 3. Architectural shift
>> 
>> Across all scenarios:
>> 
>> From:
>> 
>> data processing → feature system → model serving → evaluation
>> 
>> To:
>> 
>> stream = feature + inference + decision + feedback
>> 
>> This reflects a fundamental change in system boundaries.
>> 
>> ________________________________
>> 
>> 4. Why externalized architectures break down
>> 
>> Most current implementations rely on multiple systems:
>> 
>> stream processing
>> feature store
>> model serving
>> vector database
>> 
>> This introduces several fundamental issues in real-time scenarios.
>> 
>> 4.1 Latency dominated by system boundaries
>> 
>> stream → network → model service → response
>> 
>> network overhead is unavoidable
>> batching is not controlled by the stream runtime
>> no end-to-end backpressure
>> 
>> Latency becomes a system-level artifact rather than a compute property.
>> 
>> ________________________________
>> 
>> 4.2 Inconsistent data between training and serving
>> 
>> offline vs online features
>> different definitions or time windows
>> 
>> Models operate on inconsistent data distributions.
>> 
>> ________________________________
>> 
>> 4.3 State fragmentation
>> 
>> user/session context must be rebuilt or fetched
>> loss of data locality
>> processing becomes call-driven
>> 
>> ________________________________
>> 
>> 4.4 Feedback loop is not composable
>> 
>> difficult alignment of prediction and label streams
>> no unified handling of late data
>> duplicated evaluation logic
>> 
>> ________________________________
>> 
>> 4.5 Operational complexity
>> 
>> multiple systems to scale
>> multiple failure domains
>> complex debugging paths
>> 
>> ________________________________
>> 
>> 5. Why this aligns with Flink
>> 
>> These workloads require:
>> 
>> event-driven execution
>> strong state management
>> precise time semantics
>> continuous feedback
>> 
>> These are exactly Flink’s core strengths.
>> 
>> The key insight is:
>> 
>> Inference should be modeled as a dataflow operator, not an external
>> service.
>> 
>> ________________________________
>> 
>> 6. Implication
>> 
>> If we model:
>> 
>> data → feature → inference → decision → feedback
>> 
>> within Flink, we can achieve:
>> 
>> unified scheduling
>> shared state
>> consistent time semantics
>> end-to-end fault tolerance
>> 
>> ________________________________
>> 
>> 7. Conclusion
>> 
>> This is not simply about adding AI support to Flink.
>> 
>> It is about recognizing that:
>> 
>> Real-time AI systems are fundamentally streaming systems.
>> 
>> The question is whether Flink evolves to support this natively, or
>> remains a preprocessing layer in front of external AI stacks.
>> 
>> ________________________________
>> 
>> Happy to follow up with a more concrete proposal (e.g., inference
>> operator abstraction) if there is interest.
>> 
>> Thanks.
>> 
>> On Mon, May 4, 2026 at 10:31 PM Gen Luo <[email protected]> wrote:
>>> 
>>> Hi all,
>>> 
>>> Thank you Guowei Ma for driving this discussion, and thanks everyone for
>>> the valuable insights. Inspired by this exchange, I’d like to share a few
>>> thoughts.
>>> 
>>> While “AI-Native” covers broad ground, I believe this FLIP does not
>>> overextend Flink’s scope. It’s a necessary iteration driven by evolving
>>> user scenarios and AI advancements, particularly multimodal processing.
>>> Given the growing adoption of multimodal applications and increasing
>>> interest in low-latency inference, initiating these enhancements is a
>>> timely step to better align Flink with evolving AI workloads.
>>> 
>>> From our engagements with customers and developers, we observe a clear
>>> shift in both workloads and user expectations. Model inference is
>>> increasingly central to data pipelines, with multimodal AI tasks growing
>>> rapidly. Traditional real-time scenarios (e.g., monitoring and analytics)
>>> now leverage models and agent frameworks like Flink Agent for
>> intelligent,
>>> multi-turn decision-making, while large-scale offline compute is also
>>> shifting toward LLMs and vision models. Alongside this workload
>> evolution,
>>> developer workflows have adapted: AI practitioners naturally prefer
>> Python
>>> and DataFrame-style APIs. As AI-assisted coding matures, aligning system
>>> interfaces with these familiar patterns will directly improve
>> AI-generated
>>> code quality and significantly lower adoption barriers for the AI
>> community.
>>> 
>>> Today, many AI evaluation tools don’t yet recommend Flink for AI
>>> workloads—largely due to limited visibility of Flink’s relevant
>>> capabilities rather than fundamental incompatibility. In reality, Flink
>> has
>>> unique strengths here. For example, generating multimodal samples is
>> often
>>> a multi-day, GPU-heavy process. Flink’s streaming model, combined with
>>> checkpointing and reduced disk I/O, is well-suited for such long-running
>>> tasks—a direction also pursued by engines like Daft and Ray Data. With
>>> Flink’s proven production stability, we’re well-positioned for both batch
>>> and future real-time multimodal streaming inference. Targeted
>> improvements
>>> can make these advantages visible, driving better user experiences and
>>> healthier ecosystem growth.
>>> 
>>> I’d also note a lesson from FlinkML. It attempted to cover model training
>>> but struggled to align with the fast-iteration, Python/notebook-centric
>>> workflows preferred by ML researchers. Flink’s core strength lies in
>>> high-concurrency, production-grade inference orchestration—not training
>>> lifecycle management (e.g., experiment tracking, versioning). This
>> mismatch
>>> limited its adoption.
>>> 
>>> This proposal, however, takes a different path. It doesn’t aim to replace
>>> training frameworks. Instead, it introduces modern AI concepts
>> (multimodal
>>> data, LLMs) as first-class citizens for inference, built atop Flink’s
>>> computation strengths. Think Ray Data’s scope (plus simple co-located
>>> serving), not Train/Tune. Crucially, unlike the FlinkML era, today’s
>> models
>>> use standardized interfaces and mature serving frameworks, allowing Flink
>>> to integrate external models seamlessly without heavy
>>> customization—significantly lowering project risk.
>>> 
>>> This FLIP marks Flink’s another starting point for the AI era. While
>>> details need refinement, I believe this direction aligns with both
>> current
>>> and future user needs and Flink’s evolution.
>>> 
>>> Best,
>>> Gen
>>> 
>>> On Mon, May 4, 2026 at 12:15 PM Jark Wu <[email protected]> wrote:
>>> 
>>>> Hi Guowei,
>>>> 
>>>> Thanks for driving this. +1 on the overall direction. Flink's
>>>> streaming processing and checkpoint mechanism give it a structural
>>>> advantage over systems like Daft and Ray. But today, these runtime
>>>> strengths are held back by gaps in Python API, GPU scheduling, and
>>>> native multimodal data handling. This umbrella FLIP addresses exactly
>>>> that gap, comprehensively and systematically. I believe multimodal
>>>> data processing is the biggest opportunity for traditional data infra
>>>> to transition into AI infra, and this is one of the most important
>>>> FLIPs for Flink in the AI era.
>>>> 
>>>> As one of the Table/SQL module maintainers, we would like to
>>>> contribute the built-in multimodal processing UDFs (audio, video,
>>>> image, text) and native multimodal data types (Tensor, Image,
>>>> Embedding, etc.) as first-class citizens in the type system. Looking
>>>> forward to the sub-FLIP discussions.
>>>> 
>>>> Best,
>>>> Jark
>>>> 
>>>> On Thu, 30 Apr 2026 at 18:42, Guowei Ma <[email protected]> wrote:
>>>>> 
>>>>> Hi,Yaroslav
>>>>> 
>>>>> Thanks for taking the time to write this detailed feedback. Let me
>>>> clarify
>>>>> the intent of the proposal first.
>>>>> 
>>>>> I am not saying that Flink should become an AI framework, an ML
>> platform,
>>>>> or a model serving system. The way I use "AI-Native" in this
>> proposal is
>>>> to
>>>>> say that Flink should support, as first-class citizens, the core
>> objects
>>>>> and execution patterns that frequently show up in AI-oriented data
>>>>> processing — instead of leaving them entirely to external systems or
>> ad
>>>> hoc
>>>>> user-defined integrations.
>>>>> 
>>>>> These objects and execution patterns include:
>>>>> 
>>>>>   - Multimodal and unstructured data objects such as images, video,
>>>> audio,
>>>>>   tensors, embeddings, and object references.
>>>>>   - Model inference as part of the data flow, rather than an
>> entirely
>>>>>   external black-box service call.
>>>>>   - Operators backed by heterogeneous resources such as GPUs.
>>>>>   - Pythonic and vectorized processing styles.
>>>>>   - Long-running, long-tailed asynchronous computation.
>>>>> 
>>>>> "AI-Native" is just a shorthand here, meaning that Flink should
>> natively
>>>>> understand and support the core abstractions of this class of
>> workloads.
>>>>> The FLIP needs to make the target workload class clearer. What we
>> care
>>>>> about is not any specific model paradigm — LLM, CV, recommendation,
>> or
>>>>> traditional ML inference — but a class of data processing workloads
>> with
>>>>> shared runtime and topology characteristics:
>>>>> 
>>>>>   - A single computation may take seconds or even minutes, instead
>> of
>>>>>   microseconds as in traditional row-at-a-time processing.
>>>>>   - Execution often involves heterogeneous resources such as CPU +
>> GPU,
>>>>>   where GPUs are expensive and scarce.
>>>>>   - Data is often multimodal large objects (images, video, audio,
>>>> tensors,
>>>>>   embeddings), rather than structured small records.
>>>>>   - Computation logic often includes model inference or
>> service-style
>>>>>   invocations as part of the pipeline.
>>>>>   - Many target topologies are relatively shuffle-light and don't
>>>>>   necessarily involve complex keyed-state migration, e.g. URI →
>>>> preprocessing
>>>>>   → inference → sink.
>>>>> 
>>>>> Ten years ago, many ML workloads took the form of offline training
>> plus
>>>>> online feature serving. Flink already played a strong role in feature
>>>>> engineering, streaming feature computation, and real-time data
>>>> preparation,
>>>>> so there was no strong need to reshape Flink into an "ML-Native"
>> engine.
>>>>> 
>>>>> What is changing today is that model inference itself is increasingly
>>>>> becoming part of the data processing pipeline; multimodal objects
>> are no
>>>>> longer just opaque blobs in external storage, but data objects that
>> need
>>>> to
>>>>> be referenced, passed, transformed, inferred over, and landed inside
>> the
>>>>> engine. This is not simply one more ML use case — it is a change in
>> the
>>>>> shape of workloads Flink needs to support.
>>>>> 
>>>>> On whether the user demand is real, the validation signals we are
>>>> currently
>>>>> seeing include:
>>>>> 
>>>>>   - Within Alibaba, multimodal data processing is already in
>> production,
>>>>>   covering image, video, audio, and text modalities.
>>>>>   - In offline conversations with several companies (including
>> ByteDance
>>>>>   and Tencent), we have heard substantial demand for Flink to
>> support
>>>> AI data
>>>>>   processing / multimodal data processing.
>>>>>   - On the ecosystem side, we are working with NVIDIA on a joint
>> demo
>>>>>   focused on multimodal data processing, planned for Flink Forward
>> Asia.
>>>>>   - The emergence and growth of systems such as Daft, Ray Data,
>>>>>   Data-Juicer, and LAS also reflect rapidly growing demand for
>>>> multimodal
>>>>>   data processing.
>>>>>   - There have also been independent discussions in this direction
>>>> within
>>>>>   the community — for example, the "Streaming-native AI Inference
>>>> Runtime
>>>>>   Layer" proposal on the dev list.
>>>>> 
>>>>> On "why now, instead of waiting for standardization" — I understand
>> the
>>>>> concern. LLM-related frameworks, APIs, and application-level
>> patterns are
>>>>> indeed changing quickly. If this FLIP were trying to bake a specific
>> LLM
>>>>> API, agent framework, or prompt protocol into Flink, the risk would
>> be
>>>> high.
>>>>> 
>>>>> But most of the capabilities in this proposal are not LLM-specific.
>> They
>>>>> are more fundamental data processing and runtime capabilities:
>> Pipeline
>>>>> Region-level checkpointing, Object Reference, GPU resource
>> declaration,
>>>>> columnar data transfer, service-style operator invocation,
>> long-running
>>>>> async execution. These are useful for today's LLM workloads, and
>> equally
>>>>> useful for future AI workloads in shapes we cannot fully predict
>> yet. The
>>>>> fast-changing parts should live in the ecosystem and SDK layer; the
>> FLIP
>>>>> should focus on more stable engine-level capabilities.
>>>>> 
>>>>> On tactical changes vs. umbrella, I partly agree with you. Each
>> sub-FLIP
>>>>> should be discussed, reviewed, and accepted or rejected on its own
>>>> merits.
>>>>> The umbrella should not bypass the normal FLIP process, and
>> accepting the
>>>>> umbrella does not mean accepting all sub-FLIPs. That said, I still
>> think
>>>>> the umbrella is valuable. Its purpose is not to bind the 11 changes
>> into
>>>> a
>>>>> single inseparable package, but to help the community align on
>>>> principles,
>>>>> clarify boundaries and dependencies, and avoid conflicting or
>> duplicated
>>>>> abstractions across related capabilities.
>>>>> 
>>>>> For example, if RpcOperator is not considered together with
>>>> non-disruptive
>>>>> scaling, it is hard to give GPU operator elasticity coherent
>> semantics.
>>>>> Deploying inference services independently is only the first step;
>> the
>>>>> harder question is how Flink uniformly handles service discovery,
>>>> in-flight
>>>>> request draining, backpressure, and failover during scaling. Without
>> an
>>>>> umbrella, these capabilities can certainly be advanced as tactical
>>>> changes,
>>>>> but we may end up with a set of abstractions that are locally usable
>> but
>>>>> globally inconsistent.
>>>>> 
>>>>> On RpcOperator, I agree that we need to be very careful in defining
>> the
>>>>> boundary between the Flink runtime and external orchestration
>> systems.
>>>>> Kubernetes or the Kubernetes Operator may well be the right choice
>> at the
>>>>> physical deployment level. But I still believe Flink needs a
>> first-class
>>>>> RpcOperator abstraction, because deployment is only part of the
>> problem —
>>>>> the harder part is its semantic integration with the Flink job.
>>>>> 
>>>>> If model inference is part of the logical data flow, Flink needs at
>>>> minimum
>>>>> to be aware of its service discovery, backpressure behavior, failover
>>>>> behavior, in-flight request draining, and scaling coordination. If
>> it is
>>>>> hidden entirely behind an external black-box service, it is hard for
>>>> Flink
>>>>> to provide consistent job-level semantics and operational experience.
>>>>> 
>>>>> So the point of RpcOperator is not necessarily that "every physical
>>>> process
>>>>> must be directly launched and managed by Flink core," but that Flink
>>>> needs
>>>>> to define a service-style operator contract that allows such
>> operators to
>>>>> be invoked correctly by the data flow, coordinated correctly by the
>>>>> runtime, and understood and operated by users as part of a Flink job.
>>>>> 
>>>>> On vectorized batch processing, I agree the long-term direction
>> should
>>>> not
>>>>> stop at Python. Native columnar / vectorized execution is an
>> end-to-end
>>>>> problem that touches connectors, formats, the type system, runtime,
>> Java,
>>>>> SQL, and Python. The current proposal starts from the Java/Python
>>>> boundary
>>>>> because that is where the row/column conversion overhead is most
>> visible.
>>>>> End-to-end columnar execution on the Java and SQL side deserves to be
>>>>> discussed further as a separate, larger FLIP.
>>>>> 
>>>>> On multimodal types and SerDes complexity, I agree this needs to be
>>>> handled
>>>>> carefully. Making AI-related objects first-class does not imply that
>>>> every
>>>>> connector must immediately and fully support image, video, audio,
>> tensor,
>>>>> and so on. The concrete incremental path, fallback strategy, and the
>>>>> boundary between formats, connector API, and the type system will be
>>>>> discussed further in the corresponding sub-FLIPs.
>>>>> 
>>>>> Coming back to the core of the proposal: it is not about turning
>> Flink
>>>> into
>>>>> an AI framework. It is about making the core objects and execution
>>>> patterns
>>>>> of AI-oriented data processing first-class citizens in Flink.
>>>>> 
>>>>> Best,
>>>>> Guowei
>>>>> 
>>>>> 
>>>>> On Thu, Apr 30, 2026 at 5:37 AM Yaroslav Tkachenko <
>> [email protected]
>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi Guowei,
>>>>>> 
>>>>>> Thank you for writing this proposal.
>>>>>> 
>>>>>> I may be in the minority here, but I hope my voice will be heard. I
>>>>>> disagree with turning Flink into an "AI-Native" engine.
>>>>>> 
>>>>>> Regarding your "Data processing is entering the AI era, and Flink
>>>> needs to
>>>>>> evolve from a traditional BI compute engine into a data engine that
>>>>>> natively supports AI workloads" claim:
>>>>>> 
>>>>>> - How exactly do you define "AI"? I don't believe there is a
>> standard
>>>>>> definition. For example, Machine Learning have been around for more
>>>> than a
>>>>>> decade, but there were no proposals (or need, in my opinion) to
>> turn
>>>> Flink
>>>>>> into an "ML-Native" engine. Flink, in its current state, has
>>>>>> been successfully used in many systems alongside dedicated ML
>>>> technologies,
>>>>>> like feature stores. Based on the context of your proposal, it
>> looks
>>>> like
>>>>>> you mostly mean LLMs, so could you be specific about the language?
>>>>>> - I wouldn't call Flink "a traditional BI compute engine". Flink
>> is a
>>>>>> general data processing technology which can be used for a variety
>> of
>>>> use
>>>>>> cases without any BI involvement.
>>>>>> - Do you have any proof that "Users' core workloads are rapidly
>>>> evolving"
>>>>>> and that they require your proposed changes? Case studies, user
>>>> surveys, or
>>>>>> submitted issues about the lack of support? Big changes like that
>>>> require
>>>>>> extensive validation.
>>>>>> - And even if there is a real need to adopt some LLM-driven
>> changes,
>>>> why
>>>>>> now? The LLM-related tooling has been changing so rapidly, and it's
>>>> hard to
>>>>>> predict what will be needed tomorrow. Why does it make sense to
>>>> introduce
>>>>>> changes now, and not wait for more standardization and
>> consolidation?
>>>>>> 
>>>>>> To summarize, I think there are a lot of great ideas in the
>> proposal,
>>>> but
>>>>>> in my mind, they need to be addressed as tactical, focused
>> changes, not
>>>>>> under the "AI-Native" umbrella.
>>>>>> 
>>>>>> I also wanted to address a few more specific points:
>>>>>> 
>>>>>> - RpcOperator, why does it need to be managed by Flink? I see
>>>> absolutely no
>>>>>> need to introduce the additional complexity of orchestrating
>> standalone
>>>>>> components into the core Flink engine. I can imagine a separate
>>>> sub-project
>>>>>> for an RpcOperator, which could potentially be managed by the
>>>> Kubernetes
>>>>>> Operator.
>>>>>> - You make the case for the vectorized batch processing, but only
>> on
>>>> the
>>>>>> Python side. Why stop there? Native columnar vectorized execution
>> will
>>>>>> require end-to-end changes, including connectors, data format
>> support,
>>>> Type
>>>>>> system support, runtime changes, etc. It seems logical to me to
>> support
>>>>>> this execution mode for Java and SQL as well.
>>>>>> - Supporting many more data types natively (images, video, audio,
>>>> tensors)
>>>>>> will make connector serializers and deserializers (SerDes) much
>> more
>>>>>> challenging to implement. Even today, many SerDes in officially
>>>> supported
>>>>>> connectors don't fully implement types like arrays and structs.
>>>>>> 
>>>>>> Thank you.
>>>>>> 
>>>>>> On Wed, Apr 29, 2026 at 1:18 AM Guowei Ma <[email protected]>
>>>> wrote:
>>>>>> 
>>>>>>> Hi Z
>>>>>>> 
>>>>>>> Thanks for the kind words and the thoughtful questions. Let me
>> take
>>>> them
>>>>>>> one by one.
>>>>>>> 
>>>>>>>   1. Throughput and latency targets
>>>>>>> 
>>>>>>> To be honest, I don't have concrete numbers to share yet. What I
>> can
>>>> say
>>>>>> is
>>>>>>> that our internal testing has already surfaced several directions
>>>> where
>>>>>>> Flink can be improved, and at the same time we want to fully
>> leverage
>>>>>>> Flink's existing streaming shuffle capabilities. As the
>> multimodal
>>>>>> operator
>>>>>>> library matures, we'll progressively publish benchmark results.
>>>>>>> 
>>>>>>>   2. Built-in operators
>>>>>>> 
>>>>>>> You're absolutely right. From what I've seen, our internal users
>>>> already
>>>>>>> rely on a fairly large set of multimodal operators — potentially
>>>> 100+.
>>>>>> The
>>>>>>> exact set the community should provide is best discussed in
>> FLIP-XXX:
>>>>>>> Built-in Multimodal Operators and AI Functions, and contributions
>>>> from
>>>>>> the
>>>>>>> community are very welcome there.
>>>>>>> 
>>>>>>>   3. Plan for the 11 sub-FLIPs
>>>>>>> 
>>>>>>> The sequencing follows the layering in the umbrella:
>>>>>>> 
>>>>>>>   - Layer 1 (Core Primitives) should be discussed and aligned
>> first,
>>>>>> since
>>>>>>>   the second and third layers build on it.
>>>>>>>   - Layer 2 (API + compilation + single-node execution) starts
>> with
>>>>>>>   getting the API discussion right — the Python API, how UDFs
>>>> declare
>>>>>>>   resources, etc. — after which the single-node execution work
>> can
>>>> build
>>>>>>> on
>>>>>>>   top.
>>>>>>>   - Layer 3 (distributed scheduling and checkpointing) can
>> largely
>>>>>> proceed
>>>>>>>   independently in parallel.
>>>>>>> 
>>>>>>> So while each sub-FLIP is indeed a substantial piece of work,
>> most of
>>>>>> them
>>>>>>> can be advanced in parallel by different contributors once the
>> Layer
>>>> 1
>>>>>>> primitives are settled.
>>>>>>> 
>>>>>>>   4. GPU scheduling roadmap
>>>>>>> 
>>>>>>> Could you expand a bit on which aspect of GPU scheduling you
>> have in
>>>> mind
>>>>>>> as the complex one? "GPU scheduling" covers a fairly wide surface
>>>> area
>>>>>>> (resource declaration, operator-level deployment, elastic
>> scaling,
>>>>>>> heterogeneous GPU types, fine-grained partitioning, etc.), and
>> the
>>>> answer
>>>>>>> differs quite a bit depending on which dimension we're
>> discussing.
>>>> Once I
>>>>>>> understand your specific concern I can give a more useful
>> response.
>>>>>>> 
>>>>>>> Thanks again for the support — looking forward to the continued
>>>>>> discussion.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Guowei
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Apr 28, 2026 at 4:34 PM zl z <[email protected]>
>> wrote:
>>>>>>> 
>>>>>>>> Hey Guowei,
>>>>>>>> 
>>>>>>>> Thanks for the proposal, and I think this is very valuable. I
>> have
>>>> some
>>>>>>>> question about it:
>>>>>>>> 
>>>>>>>> 1. What are our expected throughput and latency targets? Do we
>>>> have any
>>>>>>>> forward-looking tests for this?
>>>>>>>> 
>>>>>>>> 2. AI involves a very large number of operators. Besides
>> allowing
>>>> users
>>>>>>> to
>>>>>>>> use them through UDFs, will we also provide commonly used
>> built-in
>>>>>>>> operators?
>>>>>>>> 
>>>>>>>> 3. Each of the 11 sub-FLIPs is a major project involving a
>>>> significant
>>>>>>>> amount of changes. What is our plan for this?
>>>>>>>> 
>>>>>>>> 4. GPU scheduling is extremely complex. What is our current
>>>> roadmap for
>>>>>>>> this?
>>>>>>>> 
>>>>>>>> This is a very high-quality and exciting proposal. Making
>> Flink an
>>>>>>>> AI-native data processing engine will make it far more
>> valuable in
>>>> the
>>>>>> AI
>>>>>>>> era. Look forward to seeing it land and come to fruition soon.
>>>>>>>> 
>>>>>>>> Robert Metzger <[email protected]> 于2026年4月28日周二 14:38写道:
>>>>>>>> 
>>>>>>>>> Hey Guowei,
>>>>>>>>> 
>>>>>>>>> Thanks for the proposal. I just took a brief look, here are
>> some
>>>> high
>>>>>>>> level
>>>>>>>>> questions:
>>>>>>>>> 
>>>>>>>>> Regarding the RPC Operator: What is the difference to the
>> async
>>>> io
>>>>>>>> operator
>>>>>>>>> we have already?
>>>>>>>>> 
>>>>>>>>> "Connector API for Multimodal Data Source/Sink": Why do we
>> need
>>>> to
>>>>>>> touch
>>>>>>>>> the connector API for supporting multimodal data? Isn't this
>>>> more of
>>>>>> a
>>>>>>>>> formats concern?
>>>>>>>>> 
>>>>>>>>> "Non-Disruptive Scaling for CPU Operators": How do you want
>> to
>>>>>>> guarantee
>>>>>>>>> exactly-once on that kind of scaling? E.g. you need to
>> somehow
>>>> make a
>>>>>>>>> handover between the old and new new pipeline
>>>>>>>>> 
>>>>>>>>> Overall, I find the proposal has some things which seem
>> related
>>>> to
>>>>>>> making
>>>>>>>>> Flink more AI native, but other changes seem orthogonal to
>> that.
>>>> For
>>>>>>>>> example the checkpoint or scaling changes are actually
>> unrelated
>>>> to
>>>>>> AI,
>>>>>>>> and
>>>>>>>>> just engine improvements.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Apr 28, 2026 at 5:48 AM Guowei Ma <
>> [email protected]>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi everyone,
>>>>>>>>>> 
>>>>>>>>>> I'd like to start a discussion on an umbrella FLIP[1] that
>> lays
>>>>>> out a
>>>>>>>>>> direction for evolving Flink into a data engine that
>> natively
>>>>>>> supports
>>>>>>>> AI
>>>>>>>>>> workloads.
>>>>>>>>>> 
>>>>>>>>>> The short version: user workloads are shifting from BI
>>>> analytics to
>>>>>>>>>> multimodal data processing centered on model inference, and
>>>> this
>>>>>>>> triggers
>>>>>>>>>> cascading changes across the stack — multimodal data
>> flowing
>>>>>> through
>>>>>>>>>> pipelines, heterogeneous CPU/GPU resources, vectorized
>>>> execution,
>>>>>> and
>>>>>>>>>> inference tasks that run for seconds to minutes on Spot
>>>> instances.
>>>>>>> The
>>>>>>>>>> proposal sketches an evolution along five directions
>>>> (development
>>>>>>>>> paradigm,
>>>>>>>>>> data model, heterogeneous resources, execution engine,
>> fault
>>>>>>>> tolerance),
>>>>>>>>>> decomposed into 11 sub-FLIPs organized into three layers:
>> core
>>>>>>> runtime
>>>>>>>>>> primitives, AI workload expression and execution, and
>>>>>>> production-grade
>>>>>>>>>> operational guarantees. Most sub-FLIPs have no hard
>>>> dependencies on
>>>>>>>> each
>>>>>>>>>> other and can be advanced in parallel.
>>>>>>>>>> 
>>>>>>>>>> A note on scope, since it's an umbrella:
>>>>>>>>>> 
>>>>>>>>>> - In scope here: whether the evolution directions are
>>>> reasonable,
>>>>>>>> whether
>>>>>>>>>> each sub-FLIP's motivation and proposed approach are
>>>> well-founded,
>>>>>>> and
>>>>>>>>>> whether the boundaries and dependencies between sub-FLIPs
>> are
>>>>>> clear.
>>>>>>>>>> - Out of scope here: detailed designs, API specifics, and
>>>>>>>> implementation
>>>>>>>>>> plans of individual sub-FLIPs — those will go through
>> their own
>>>>>>> FLIPs.
>>>>>>>>>> - Consensus criteria: agreement on the overall direction is
>>>>>>> sufficient
>>>>>>>>> for
>>>>>>>>>> the umbrella to pass; passing it does not lock in any
>>>> sub-FLIP's
>>>>>>>> design —
>>>>>>>>>> sub-FLIPs may still be adjusted, deferred, or withdrawn as
>> they
>>>>>>>> progress.
>>>>>>>>>> 
>>>>>>>>>> All proposed changes are incremental — no existing API or
>>>> behavior
>>>>>> is
>>>>>>>>>> removed or altered. Compatibility details are covered at
>> the
>>>> end of
>>>>>>> the
>>>>>>>>>> document.
>>>>>>>>>> 
>>>>>>>>>> Looking forward to your feedback on the overall direction
>> and
>>>> the
>>>>>>>>> layering.
>>>>>>>>>> 
>>>>>>>>>> [1]
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Guowei
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>> 


Reply via email to