Hi all, Apologies — I forgot to include the document link in my previous mail.
FLIP draft: FLIP-XXX: Streaming-native AI Inference Runtime <https://drive.google.com/open?id=17PpHDjFCeL2wcp9Q7t7snyvKnW2ugBKZ-rtqo_nMKQg> For convenience, the structure of the document is: §1 Motivation — why a unified inference runtime, and what's missing today. §2 Public Interfaces — InferenceFunction / InferenceBackend / LoadProbe SPI, DataStream and SQL surfaces, configuration keys, module split. §3 Proposed Changes — operator internals, batching, retry/timeout, two-tier circuit breaker, fault-tolerance semantics, load-aware scheduling loop, multi-model demo. §4 Compatibility / §5 Migration — incremental, opt-in; no breaking changes to existing Async I/O or ML_PREDICT users. §6 Test Plan. §7 Rejected Alternatives — including why this is not done as an external library, why per-call SQL hints are not used for per-model config, and why a per-subtask-only circuit breaker is insufficient. §8 Future Work. Appendix A — concrete PRs (merged and in-review) this FLIP builds on, so the net-new code surface is bounded. Looking forward to your feedback, especially on the four points raised in the previous mail (scope vs ML_PREDICT/flink-agents, LoadProbe SPI surface and scoping, two-tier circuit breaker, and module split). Thanks, featzhang On Sun, Apr 19, 2026 at 12:42 AM FeatZhang <[email protected]> wrote: > Hi Shekhar, > > Thanks for the question — this is a very important point to clarify. > > I think the key distinction here is between *flink-agents as an > orchestration layer* and the *streaming inference runtime as an execution > layer inside Flink core*. > 1. What flink-agents typically covers > > From my understanding, flink-agents focuses on: > > - high-level AI agent orchestration > - tool/function calling patterns > - workflow composition of LLM / AI tasks > - user-defined inference logic and chaining > > In other words, it is primarily an *application / orchestration framework > built on top of Flink*. > ------------------------------ > 2. What this FLIP is targeting (runtime layer) > > This proposal is focused on *execution-level capabilities inside Flink > runtime*, which are typically not handled by flink-agents: > 2.1 Streaming-aware execution semantics > > - adaptive batching under backpressure > - latency-aware scheduling of inference requests > - concurrency control per operator instance > > ------------------------------ > 2.2 Fault tolerance at runtime level > > - standardized retry / fallback policies > - circuit breaker integration > - timeout isolation at operator level > > These are currently implemented inconsistently in user code or async > functions. > ------------------------------ > 2.3 Backend abstraction and lifecycle management > > - unified inference backend abstraction (HTTP / Triton / custom) > - connection pooling and lifecycle management > - resource-aware execution coordination > > ------------------------------ > 2.4 Integration with Flink execution model > > - checkpoint-aware inference execution > - operator lifecycle integration > - coordination with task scheduling and backpressure > > ------------------------------ > 3. Why this cannot be fully handled in flink-agents > > The main limitation is that flink-agents operates at a *logical > orchestration level*, while these requirements are deeply tied to: > > - task scheduling > - operator execution semantics > - backpressure model > - runtime fault tolerance > > These are *core streaming engine concerns*, not orchestration logic. > ------------------------------ > 4. Relationship between the two > > To be clear, I see these as complementary: > > - flink-agents → AI workflow orchestration layer > - inference runtime (this FLIP) → execution layer for AI workloads > inside Flink > > A natural integration point would be: > flink-agents generating workflows that execute on top of this runtime > abstraction. > ------------------------------ > 5. Open question > > It would be great to align on: > > - whether we agree on this layering separation > - whether existing flink-agents scope already includes any > runtime-level execution concerns that I might have missed > > Happy to adjust the proposal if there is overlap. > > Thanks, > featzhang > > Shekhar Rajak <[email protected]> 于2026年4月19日周日 00:17写道: > >> Hi, >> I’m trying to understand - what are the features we are talking about >> which can not be implemented in flink-agents sub project ? >> >> Shekhar >> >> On Saturday, April 18, 2026, 21:39, FeatZhang <[email protected]> >> wrote: >> >> Hi Flink devs, >> >> I would like to start a discussion on a missing piece in Flink’s current >> AI/ML inference capabilities and propose a FLIP for a *streaming-native AI >> inference runtime layer*. >> Motivation >> >> Apache Flink currently provides basic AI inference capabilities through >> SQL-level constructs such as ML_PREDICT and related functions. These are >> useful for integrating external models into batch and streaming pipelines. >> >> However, in production AI workloads (especially real-time inference and >> LLM >> serving), we observe several gaps: >> >> - No unified runtime abstraction for inference execution >> - No streaming-native batching or latency-aware scheduling >> - Limited support for backpressure-aware inference control >> - No built-in retry, fallback, or circuit breaker mechanisms >> - Fragmented integration with external inference systems (e.g., HTTP >> services, Triton, LLM endpoints) >> >> As a result, users often re-implement these capabilities in user-defined >> functions, leading to inconsistent behavior and duplicated complexity. >> ------------------------------ >> Proposal (High-level) >> >> This FLIP proposes introducing a *Streaming-native AI Inference Runtime >> Layer* in Flink, providing: >> >> - A unified inference operator abstraction >> - Adaptive batching and concurrency control >> - Backpressure-aware request scheduling >> - Pluggable inference backends (HTTP / Triton / custom services) >> - Built-in reliability mechanisms (retry, timeout, circuit breaker) >> - Standard metrics and observability hooks >> >> ------------------------------ >> Design Overview >> >> The high-level architecture would look like: >> >> DataStream / Table API >> ↓ >> Inference Operator Layer >> ↓ >> Inference Execution Engine >> ↓ >> Pluggable Inference Backend >> >> This layer would integrate with Flink’s existing streaming runtime and >> remain fully compatible with current SQL/Table APIs. >> ------------------------------ >> Non-goals >> >> - This does NOT replace ML_PREDICT or existing SQL semantics >> - This does NOT introduce a new ML training framework >> - This is not tied to any specific inference engine >> >> ------------------------------ >> Why now >> >> We see increasing adoption of Flink for real-time AI workloads, including: >> >> - streaming inference >> - LLM-based pipelines >> - hybrid AI + data processing workflows >> >> However, the lack of a standardized runtime abstraction makes production >> deployments complex and inconsistent. >> ------------------------------ >> Request for feedback >> >> I would like feedback on: >> >> 1. Whether a dedicated inference runtime layer fits within Flink’s >> architectural direction >> 2. Preferred integration approach (Table API, DataStream, or both) >> 3. Scope of built-in features vs user-defined extensibility >> 4. Any existing efforts or ongoing work in this direction >> >> If there is agreement on direction, I will follow up with a more detailed >> FLIP design document. >> ------------------------------ >> >> Thanks, >> featzhang >> >> >> >>
