Hi all,

Apologies — I forgot to include the document link in my previous mail.

FLIP draft:
 FLIP-XXX: Streaming-native AI Inference Runtime
<https://drive.google.com/open?id=17PpHDjFCeL2wcp9Q7t7snyvKnW2ugBKZ-rtqo_nMKQg>


For convenience, the structure of the document is:

§1 Motivation — why a unified inference runtime, and what's missing today.
§2 Public Interfaces — InferenceFunction / InferenceBackend / LoadProbe
SPI, DataStream and SQL surfaces, configuration keys, module split.
§3 Proposed Changes — operator internals, batching, retry/timeout, two-tier
circuit breaker, fault-tolerance semantics, load-aware scheduling loop,
multi-model demo.
§4 Compatibility / §5 Migration — incremental, opt-in; no breaking changes
to existing Async I/O or ML_PREDICT users.
§6 Test Plan.
§7 Rejected Alternatives — including why this is not done as an external
library, why per-call SQL hints are not used for per-model config, and why
a per-subtask-only circuit breaker is insufficient.
§8 Future Work.
Appendix A — concrete PRs (merged and in-review) this FLIP builds on, so
the net-new code surface is bounded.
Looking forward to your feedback, especially on the four points raised in
the previous mail (scope vs ML_PREDICT/flink-agents, LoadProbe SPI surface
and scoping, two-tier circuit breaker, and module split).

Thanks, featzhang

On Sun, Apr 19, 2026 at 12:42 AM FeatZhang <[email protected]> wrote:

> Hi Shekhar,
>
> Thanks for the question — this is a very important point to clarify.
>
> I think the key distinction here is between *flink-agents as an
> orchestration layer* and the *streaming inference runtime as an execution
> layer inside Flink core*.
> 1. What flink-agents typically covers
>
> From my understanding, flink-agents focuses on:
>
>    - high-level AI agent orchestration
>    - tool/function calling patterns
>    - workflow composition of LLM / AI tasks
>    - user-defined inference logic and chaining
>
> In other words, it is primarily an *application / orchestration framework
> built on top of Flink*.
> ------------------------------
> 2. What this FLIP is targeting (runtime layer)
>
> This proposal is focused on *execution-level capabilities inside Flink
> runtime*, which are typically not handled by flink-agents:
> 2.1 Streaming-aware execution semantics
>
>    - adaptive batching under backpressure
>    - latency-aware scheduling of inference requests
>    - concurrency control per operator instance
>
> ------------------------------
> 2.2 Fault tolerance at runtime level
>
>    - standardized retry / fallback policies
>    - circuit breaker integration
>    - timeout isolation at operator level
>
> These are currently implemented inconsistently in user code or async
> functions.
> ------------------------------
> 2.3 Backend abstraction and lifecycle management
>
>    - unified inference backend abstraction (HTTP / Triton / custom)
>    - connection pooling and lifecycle management
>    - resource-aware execution coordination
>
> ------------------------------
> 2.4 Integration with Flink execution model
>
>    - checkpoint-aware inference execution
>    - operator lifecycle integration
>    - coordination with task scheduling and backpressure
>
> ------------------------------
> 3. Why this cannot be fully handled in flink-agents
>
> The main limitation is that flink-agents operates at a *logical
> orchestration level*, while these requirements are deeply tied to:
>
>    - task scheduling
>    - operator execution semantics
>    - backpressure model
>    - runtime fault tolerance
>
> These are *core streaming engine concerns*, not orchestration logic.
> ------------------------------
> 4. Relationship between the two
>
> To be clear, I see these as complementary:
>
>    - flink-agents → AI workflow orchestration layer
>    - inference runtime (this FLIP) → execution layer for AI workloads
>    inside Flink
>
> A natural integration point would be:
> flink-agents generating workflows that execute on top of this runtime
> abstraction.
> ------------------------------
> 5. Open question
>
> It would be great to align on:
>
>    - whether we agree on this layering separation
>    - whether existing flink-agents scope already includes any
>    runtime-level execution concerns that I might have missed
>
> Happy to adjust the proposal if there is overlap.
>
> Thanks,
> featzhang
>
> Shekhar Rajak <[email protected]> 于2026年4月19日周日 00:17写道:
>
>> Hi,
>> I’m trying to understand - what are the features we are talking about
>> which can not be implemented in flink-agents sub project ?
>>
>> Shekhar
>>
>> On Saturday, April 18, 2026, 21:39, FeatZhang <[email protected]>
>> wrote:
>>
>> Hi Flink devs,
>>
>> I would like to start a discussion on a missing piece in Flink’s current
>> AI/ML inference capabilities and propose a FLIP for a *streaming-native AI
>> inference runtime layer*.
>> Motivation
>>
>> Apache Flink currently provides basic AI inference capabilities through
>> SQL-level constructs such as ML_PREDICT and related functions. These are
>> useful for integrating external models into batch and streaming pipelines.
>>
>> However, in production AI workloads (especially real-time inference and
>> LLM
>> serving), we observe several gaps:
>>
>>   - No unified runtime abstraction for inference execution
>>   - No streaming-native batching or latency-aware scheduling
>>   - Limited support for backpressure-aware inference control
>>   - No built-in retry, fallback, or circuit breaker mechanisms
>>   - Fragmented integration with external inference systems (e.g., HTTP
>>   services, Triton, LLM endpoints)
>>
>> As a result, users often re-implement these capabilities in user-defined
>> functions, leading to inconsistent behavior and duplicated complexity.
>> ------------------------------
>> Proposal (High-level)
>>
>> This FLIP proposes introducing a *Streaming-native AI Inference Runtime
>> Layer* in Flink, providing:
>>
>>   - A unified inference operator abstraction
>>   - Adaptive batching and concurrency control
>>   - Backpressure-aware request scheduling
>>   - Pluggable inference backends (HTTP / Triton / custom services)
>>   - Built-in reliability mechanisms (retry, timeout, circuit breaker)
>>   - Standard metrics and observability hooks
>>
>> ------------------------------
>> Design Overview
>>
>> The high-level architecture would look like:
>>
>> DataStream / Table API
>>         ↓
>> Inference Operator Layer
>>         ↓
>> Inference Execution Engine
>>         ↓
>> Pluggable Inference Backend
>>
>> This layer would integrate with Flink’s existing streaming runtime and
>> remain fully compatible with current SQL/Table APIs.
>> ------------------------------
>> Non-goals
>>
>>   - This does NOT replace ML_PREDICT or existing SQL semantics
>>   - This does NOT introduce a new ML training framework
>>   - This is not tied to any specific inference engine
>>
>> ------------------------------
>> Why now
>>
>> We see increasing adoption of Flink for real-time AI workloads, including:
>>
>>   - streaming inference
>>   - LLM-based pipelines
>>   - hybrid AI + data processing workflows
>>
>> However, the lack of a standardized runtime abstraction makes production
>> deployments complex and inconsistent.
>> ------------------------------
>> Request for feedback
>>
>> I would like feedback on:
>>
>>   1. Whether a dedicated inference runtime layer fits within Flink’s
>>   architectural direction
>>   2. Preferred integration approach (Table API, DataStream, or both)
>>   3. Scope of built-in features vs user-defined extensibility
>>   4. Any existing efforts or ongoing work in this direction
>>
>> If there is agreement on direction, I will follow up with a more detailed
>> FLIP design document.
>> ------------------------------
>>
>> Thanks,
>> featzhang
>>
>>
>>
>>

Reply via email to