Hi Yuan,

Thank you for bringing up the remote compaction and shared state perspectives — 
these 
are excellent examples of how RPC Operator can serve beyond AI workloads. That 
said, 
making it fully usable in these scenarios still requires significant follow-up 
work. This FLIP 
is the first step — establishing the core primitive and interfaces.


1. Remote Compaction & SQL integration
Remote compaction is a great use case. RPC Operator can naturally satisfy its 
core 
requirements, but SQL integration is indeed a key question. We see two 
approaches for 
using RPC Operator services from SQL:


a. Explicit usage
Users define an RPC service through extended SQL DDL and access it explicitly 
from 
UDFs via `FunctionContext.getRosClient()`. This gives users full control over 
service 
definition and invocation.
b. Implicit rewriting
Users annotate a SQL UDF to indicate that it should be decoupled from the data 
plane. 
The engine automatically rewrites the UDF into an independent RPC Operator 
service 
and replaces the original topology node with a lightweight proxy operator that 
forwards 
requests to the service. This is precisely the "Automatic Rewriting" capability 
described 
in the Follow-up Tasks section.


Both approaches build on top of the core interfaces proposed in this FLIP. 
However, each 
involves substantial design work of its own — SQL DDL extensions, planner 
integration, 
and rewriting rules — and deserves dedicated discussion in follow-up FLIPs.


2. Stateful service & checkpoint support
The current FLIP intentionally scopes RPC Operator as stateless, which is a 
pragmatic 
trade-off between implementation complexity and scenario coverage. Looking 
ahead, 
stateful service support is an important direction. Even without requiring 
global consistency 
with the data plane, enabling an independent consistency model on the RPC 
Operator 
side (e.g., independent snapshots with at-least-once or best-effort guarantees) 
would 
significantly broaden the range of use cases. I think this is well worth a 
dedicated FLIP to 
explore in depth.


Thank you again for the feedback! Please feel free to follow up if you have any 
further 
questions or suggestions.




Best,
Yi



At 2026-06-03 17:53:32, "Yuan Mei" <[email protected]> wrote:
>Hi Yi and everyone,
>
>Thanks for the proposal! RPC Operator is a great extension to Flink and it
>can unlock a lot of Flink's capabilities, potentially, such as doing remote
>compaction for disaggregation and providing Flink with global/shareable
>state capabilities (the same state data can be accessed across parallelisms
>and across operators).
>
>However, I have a few questions and thoughts regarding the design:
>
>   -
>
>   Remote Compaction can be reused under the current RPC service setting.
>   The current Remote Compaction service (PoC), due to HA and other
>   considerations, *is set up as another Flink job.* The basic requirements
>   for this service are:
>   -
>
>      Multiple operators and instances can access the same RPC service.
>      -
>
>      Configurable with different types of CUs and needs a basic load
>      balancing policy.
>
>                But a major issue here is that disaggregation currently
>mainly integrate with SQL operators. How & what is the plan to integrate
>RPC Service with SQL?                  Are there any potential issues (I
>didn't see any discussion regarding SQL integration in this FLIP)
>
>   -
>
>   From the proposal, the RpcOperator appears to be stateless and does not
>   support checkpoints; it specifically mentioned the difference with* data
>   plane*. If we want to support sharable state, checkpoint mechanisms
>   might also be needed under different semantic consistency guarantees. Of
>   course, this part can be discussed separately and does not need to block
>   this FLIP.
>
>
>Best
>Yuan
>
>On Wed, Jun 3, 2026 at 2:21 PM Lei Yang <[email protected]> wrote:
>
>> Hi Yi,
>>
>> Thank you for the great work on FLIP-582. As GPU resources become
>> increasingly valuable, RpcOperator can significantly improve resource
>> utilization and strengthening Flink’s competitiveness for AI workloads.I am
>> particularly interested in the failover and flexible scaling aspects of
>> this
>> FLIP, and I have two questions that I hope you can help clarify.
>>
>> 1. Relationship between RpcOperator and JobGraph recovery semantics
>>
>> The current FLIP models RpcOperator as an independent JobVertex in
>> the JobGraph. This means that although it can be isolated from the data
>> plane in terms of resources and regions, it still belongs to the same job
>> graph in terms of scheduling, recovery, and rescaling semantics.
>>
>> For example, in AdaptiveScheduler, rescaling or failure recovery may enter
>> the Restarting -> CreatingExecutionGraph path and rebuild the
>> ExecutionGraph.
>> An external autoscaler can also adjust the parallelism requirements of job
>> vertices through the JobResourceRequirements REST API; with
>> AdaptiveScheduler,
>> this may further trigger rescale / restart. In such cases, RpcOperator may
>> also
>> be restarted together with the data-plane tasks, since it is part of the
>> same
>> JobGraph / ExecutionGraph.
>>
>> This seems to leave some gap with the goal of RpcOperator being an
>> independent
>> service that is not affected by the data plane. Therefore, I would like to
>> confirm
>> whether the FLIP plans to introduce a more fine-grained recovery and
>> restart
>> mechanism, so that RpcOperator can restart, fail over, or rescale
>> independently
>> from data processing vertices.
>>
>> 2. Client support for future flexible scaling
>>
>> The FLIP mentions that RpcOperator instances are independent service
>> instances, and that an instance going online or offline should not affect
>> other
>> instances or the data processing flow. I understand that the current FLIP
>> may
>> not need to fully support dynamic scaling in the first phase. However, if
>> flexible
>> scaling of RpcOperator is expected in the future, the client may need to be
>> aware of changes in RpcOperator parallelism and the instance list.
>>
>> For example, during a future scale-out, the system may start a new
>> RpcOperator
>>  instance without restarting existing ones. After the new instance becomes
>> ready,
>> the client needs to discover it in time and include it in request routing.
>> During scale-in,
>> the client also needs to detect instance removal in time and avoid sending
>> new
>> requests to instances that are about to exit.
>>
>> Therefore, I would like to confirm whether the current ROSClient design can
>> support push-based discovery of RpcOperator instance additions and removals
>>  for future no-restart dynamic scaling.
>>
>> Best,
>> Lei
>>
>> Yi Zhang <[email protected]> 于2026年5月27日周三 14:12写道:
>>
>> > Hi everyone,
>> >
>> >
>> >
>> > I would like to start a discussion on FLIP-582: Support RpcOperator
>> > Service [1].
>> >
>> >
>> > AI-oriented workloads like multimodal data processing and model inference
>> > are
>> > growing rapidly in recent years. These workloads are characterized by
>> > expensive
>> > resources (GPUs) and high initialization costs (seconds to minutes for
>> > model
>> > loading). In today's Flink, embedding them in the data plane couples
>> their
>> > parallelism and failover with surrounding operators; deploying them as
>> > external
>> > services disconnects their lifecycle from the job and doubles operational
>> > overhead.
>> >
>> >
>> > This FLIP introduces RpcOperator Service — a framework-level primitive
>> > that runs
>> > user-defined compute as RPC services in an independent Pipelined Region
>> > within
>> > the Flink job. Because the service is isolated at the scheduling level,
>> it
>> > can achieve
>> > fault isolation, independent scaling, and dedicated resource allocation.
>> > As a native
>> > Flink primitive, it also lays the foundation for automatic flow control,
>> > flexible load
>> > balancing, and coordinated auto-scaling — all without introducing
>> external
>> > infrastructure or additional operational burden.
>> >
>> >
>> >
>> >
>> > Looking forward to your feedback and suggestions!
>> >
>> >
>> >
>> >
>> > [1]
>> >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-582%3A+Support+RpcOperator+Service
>> >
>> >
>> >
>> >
>> >
>> > Best Regards,
>> > Yi Zhang
>>

Reply via email to