Re: [DISCUSS] FLIP-531: Initiate Flink Agents as a new Sub-Project

Xintong Song Sun, 25 May 2025 18:49:30 -0700

@Robert,

1. I wasn't aware of the ASF subproject concept. Yes, the intention here is
to create a repository, just like flink-cdc, flink-kubernetes-operator.
I'll add a clarification in the FLIP.


2. Sorry for the confusion. I think we are just using Kafka as an example
here. I'll correct it in the FLIP.

I think there are two different ways to build a multi-agent system, and we
plan to support both.

   - Running multiple agents running in the same Flink job. This means
   managing the lifecycle of the agents as a whole, and end-to-end
   checkpointing consistency across them. For this case, we do plan to
   leverage StateFun for the communication. In the first step, we probably
   will simply depend on StateFun, to quickly get it work. In the long term, I
   think it makes sense to move codes that we want to reuse from StateFun into
   the new project, rather than depending on a no-longer-maintained project.
   - Running agents as separated Flink jobs. In this way, I agree we should
   leverage Flink's connector framework.

Thanks for pointing out the issues.

Best,

Xintong



On Sun, May 25, 2025 at 10:38 AM Yuan Mei <yuanmei.w...@gmail.com> wrote:

> Thanks Xintong, Sean and Chris.
>
> This is a great step forward for the future of Flink. I'm really looking
> forward to it!
>
> Best,
> Yuan
>
> On Sat, May 24, 2025 at 10:00 PM Robert Metzger <rmetz...@apache.org>
> wrote:
>
> > Thanks for the nice proposal.
> >
> > One question: The proposal talks a lot about establishing a "sub
> project".
> > If I understand correctly, the ASF has a concept of subprojects, with
> > sub-project committers, mailing lists, jira projects, .. etc. [1][2].
> >
> > Is the intention of this proposal to establish such a sub project?
> > Or is the intention to basically create a "flink-agents" git repository,
> > where all existing Flink committers have access to, and the Flink PMC
> votes
> > on releases? (I assume this is the intention). If so, I would update the
> > proposal to talk about a new repository? or at least clarify the
> immediate
> > implications for the project.
> >
> > My second question is about this key feature:
> > > *Inter-Agent Communication:* Built-in support for asynchronous
> > agent-to-agent communication using Kafka.
> >
> > Does this mean the code from the flink-agents repo will have a dependency
> > on AK? One of the big benefits of Flink is that it is independent of the
> > underlying message streaming system. Wouldn't it be more elegant and
> > actually easier to rely on the Flink connector framework here, and leave
> > the concrete implementation to the user?
> > Also, I wonder why we need to rely on an external message streaming
> system
> > at all? Is it because we want to be able to send messages into arbitrary
> > directions? if so, maybe we can re-use code from Flink Statefun? I
> > personally would think that relying on Flink's internal data transfer
> model
> > by default brings a lot of cost, performance, operations and
> implementation
> > benefits ... and users can still manually setup a connector using a
> Kafka,
> > Pulsar or PubSub connection. WDYT?
> >
> > Best,
> > Robert
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Sub+Projects
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/HADOOP/Apache+Hadoop+Ozone+-+sub-project+to+Apache+TLP+proposal
> >
> >
> > On Fri, May 23, 2025 at 6:14 AM Xintong Song <tonysong...@gmail.com>
> > wrote:
> >
> > > @Jing,
> > >
> > > I think the FLIP already included the high-level design goals, by
> listing
> > > the key features that we plan to support in the Proposed Solution
> > section,
> > > and demonstrating how using the framework may look like with the code
> > > examples. Of course the high-level goals need to be further detailed,
> > which
> > > will be the next step. The purpose of this FLIP is to get community
> > > consensus on initiating this new project. On the other hand, technical
> > > design takes time to discuss, and likely requires continuous iteration
> as
> > > the project is being developed. So I think it makes sense to separate
> the
> > > design discussions from the initiation proposal.
> > >
> > > Of course any contributor's thoughts and inputs are valuable to the
> > > project. And efficiency also matters, as the agentic ai industry grows
> > > fast, we really need to keep up with the pace. I believe it would be
> more
> > > efficient to come up with some initial draft design / implementation
> that
> > > everyone can comment on, compared to just randomly collecting ideas
> when
> > we
> > > have nothing. Fortunately, the project is at the early stage with no
> > > historical burdens, which means we don't need to carefully make sure
> > > everything is perfect in advance, and can always correct / change /
> > rework
> > > things if needed. We can at least do that before we commit to the
> product
> > > compatibility with the first formal release. This is why we suggested
> > > applying a light, execution-first process, as mentioned in the
> Operating
> > > Model section. I would not be concerned too much about not collecting
> > > enough inputs at the beginning, because we can always adjust things
> > > afterwards based on new suggestions and opinions.
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Fri, May 23, 2025 at 12:13 AM Jing Ge <j...@ververica.com.invalid>
> > > wrote:
> > >
> > > > It is great to see that everyone in this thread agreed with the
> > > high-level
> > > > proposal. Just so excited and could not stop asking questions :-)
> > Thanks
> > > > Xintong for the update!
> > > >
> > > > I'd like to share a little bit more thoughts with my questions and
> your
> > > > additional input. And then lead to a small suggestion.
> > > >
> > > > 1. It is great to support freestyle tools beyond MCP protocol, from
> > users
> > > > perspective. However, if we consider agent framework design, there
> > might
> > > be
> > > > some choices to make. For example, either we stick to MCP internally
> > and
> > > > turn such external freestyle tools into MCP internally or we will
> > design
> > > a
> > > > new abstraction to handle diverse function calls offered by different
> > > > LLMs, kind of repeating what MCP did.  Another thought, which I feel,
> > is
> > > > that the sample API in the FLIP shows somehow, as a user, after a MCP
> > > > server registration, I could use the close follow-up prompt() method
> to
> > > > modify/extend the standard out-of-box context provided by the MCP
> > server.
> > > > But it is too detailed and should not be discussed in this high-level
> > > > thread. Happy to join any (offline) discussion and contribute.
> > > >
> > > > 3. Similar to microservices, there are a few use cases that are
> > sensitive
> > > > to the response latency, e.g. stock trading, etc. But it is totally
> > fine
> > > to
> > > > focus on asynchronous communication.
> > > >
> > > > 4. because each of them has individual focus and needs effort to
> build.
> > > It
> > > > was a question of priorities. Good to know Flink Agent wants to cover
> > > both.
> > > >
> > > > 5. Great to know. I had a similar thought and was a little bit
> > confused,
> > > > because state is more or less a low level concept for operators.
> > Looking
> > > > forward to understanding how to use it as agent memory.
> > > >
> > > > What I actually tried to suggest with all these questions is: Does it
> > > make
> > > > sense to define some high-level design goals/criterias/guidelines?
> > > like(as
> > > > an example):
> > > >
> > > > 1. support MCP natively
> > > > 2. single Agent development (for the first stage)
> > > > 3. only support event-driven asynchronous communication
> > > > 4. agent framework for both embedding and workflow development(same
> > > > priority)
> > > > 5. Flink state as memory
> > > > 6. support ReAct, don't support ReWOO (just as an example to show my
> > > > thought. In reality, ReWOO might be useful for some enterprise agents
> > > > considering the deterministic process. An example topic to be
> > discussed.)
> > > >
> > > > Any contributors in the community can also share their thoughts about
> > any
> > > > high level design guidelines to be collected at an early stage.
> > > >
> > > > The final chosen high-level guidelines could help let everyone on the
> > > same
> > > > page to understand and design the upcoming architecture and might
> also
> > > have
> > > > influence on the future API design. WDYT?
> > > >
> > > > Best regards,
> > > > Jing
> > > >
> > > >
> > > > On Thu, May 22, 2025 at 4:55 AM Xintong Song <tonysong...@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks everyone for the positive feedback.
> > > > >
> > > > > As I said, this FLIP is intended for discussing high-level plans
> for
> > > the
> > > > > new project. The project itself is still at an early stage, and
> some
> > of
> > > > the
> > > > > technical designs and solutions are not completely ready yet. So
> atm
> > I
> > > > can
> > > > > only share some personal thoughts on the raised questions, and we
> are
> > > > open
> > > > > to suggestions and opinions.
> > > > >
> > > > > @Jing
> > > > >
> > > > > 1. Regarding MCP, I think it's just one way (and likely a major
> way)
> > > for
> > > > > providing LLMs with context, but not the only way. E.g., a user may
> > > > write a
> > > > > dedicated python function and provide it to the LLM as a tool,
> which
> > > > > doesn't necessarily need to go through the MCP protocol. At the
> same,
> > > the
> > > > > LLM may discover more available tools from a MCP server. These are
> > > just 2
> > > > > different sources that the tools come from, and they can co-exist.
> > > > >
> > > > > 2. In the long-term, yes, I think. As a first step, we probably
> will
> > be
> > > > > more focused on how to build individual agents, less on
> interactions
> > > > across
> > > > > multiple agents.  Not saying we won't support MAS in the first
> step,
> > > but
> > > > > maybe not as complex as the A2A protocol.
> > > > >
> > > > > 3. Interactions between agents will be event-driven, so they are
> > > > naturally
> > > > > asynchronous. I'm not entirely sure about use cases that prefer
> > > > > asynchronous agent calls. Could you share some examples?
> > > > >
> > > > > 4. I think I didn't fully get the taxonomy here. I mean why
> embedding
> > > vs.
> > > > > workflow? From my understanding, I think Flink Agents should cover
> > both
> > > > use
> > > > > cases.
> > > > >
> > > > > 5. Yes, memory is considered. Actually, Flink's state management
> > makes
> > > a
> > > > > good foundation for supporting agent memory.
> > > > >
> > > > > @Nishita
> > > > >
> > > > > 1. I think calling an external LLM is similar to an async operator
> in
> > > > > Flink, in terms of potential latency and backpressure issues.
> Flink's
> > > > async
> > > > > operator already supports concurrent async calls, rate control,
> > timeout
> > > > > handling, etc. But eventually, the bottleneck is at the external
> > > service
> > > > > side, and we expect the model techniques will keep improving, with
> > > larger
> > > > > throughput, less latency, and better stability.
> > > > >
> > > > > 2. Good question. I think real-time event-driven processing is
> > somehow
> > > in
> > > > > conflict with asynchronous human-in-the-loop feedback. One idea is
> > > that,
> > > > > I've seen people doing this way, to build another agent for
> > validating
> > > > > results and generating feedback. Another idea is to collect samples
> > of
> > > > > results for asynchronous human-in-the-loop validations. But these
> are
> > > > just
> > > > > rough ideas. I don't have sophisticated answers at the moment.
> > > > >
> > > > > Best,
> > > > >
> > > > > Xintong
> > > > >
> > > > >
> > > > >
> > > > > On Thu, May 22, 2025 at 3:26 AM Yash Anand
> > <yan...@confluent.io.invalid
> > > >
> > > > > wrote:
> > > > >
> > > > > > Thank you for the proposal—this initiative will make it much
> easier
> > > to
> > > > > > build event-driven AI agents seamlessly.
> > > > > >
> > > > > > +1 for the proposed Flink Agents sub-project!
> > > > > >
> > > > > > On Wed, May 21, 2025 at 9:43 AM Mayank Juneja <
> > > > mayankjunej...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > +1 on the FLIP. This is a solid step toward building an agentic
> > > > > offering
> > > > > > > that really leans into Flink’s strengths, and builds on the
> > > momentum
> > > > > from
> > > > > > > recent API improvements like FLIP-437 and the proposed
> FLIP-529.
> > > > > > >
> > > > > > > Also wanted to echo the point around agent memory. More
> advanced
> > > > > agentic
> > > > > > > systems really benefit from both short-term and long-term
> memory.
> > > > While
> > > > > > > long-term memory can live in databases (including vector
> stores),
> > > > > having
> > > > > > a
> > > > > > > built-in abstraction for managing short-term memory would be
> > super
> > > > > > useful.
> > > > > > > Doesn’t need to be in the MVP, but definitely worth considering
> > for
> > > > the
> > > > > > > roadmap.
> > > > > > > Best,
> > > > > > > Mayank
> > > > > > >
> > > > > > >
> > > > > > > On Wed, May 21, 2025 at 4:54 PM Lincoln Lee <
> > > lincoln.8...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1 for the proposed flink agents sub-project!
> > > > > > > >
> > > > > > > > This aligns perfectly with flink's core strengths in
> real-time
> > > > event
> > > > > > > > processing and stateful computations.
> > > > > > > >
> > > > > > > > Thanks for driving this initiative and looking forward to the
> > > > > > > > detailed technical designs.
> > > > > > > >
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Lincoln Lee
> > > > > > > >
> > > > > > > >
> > > > > > > > Hao Li <lihao3...@gmail.com> 于2025年5月21日周三 23:28写道：
> > > > > > > >
> > > > > > > > > Hi Xintong, Sean and Chris,
> > > > > > > > >
> > > > > > > > > Thanks for driving the initiative. Very exciting to bring
> AI
> > > > Agent
> > > > > to
> > > > > > > > Flink
> > > > > > > > > to empower the streaming use cases.
> > > > > > > > >
> > > > > > > > > +1 to the FLIP.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Hao
> > > > > > > > >
> > > > > > > > > On Wed, May 21, 2025 at 7:35 AM Nishita Pattanayak <
> > > > > > > > > nishita.pattana...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Hi Sean, Chris and Xintong. This seems to be a very
> > exciting
> > > > > > > > sub-project.
> > > > > > > > > > +1 for "flink-agents" sub-project.
> > > > > > > > > >
> > > > > > > > > > I was going through the FLIP , and had some questions
> > > regarding
> > > > > the
> > > > > > > > same:
> > > > > > > > > > 1. How would the external model calls (e.g., OpenAI or
> > > internal
> > > > > > LLMs)
> > > > > > > > > > integrated into Flink tasks without introducing
> > backpressure
> > > or
> > > > > > > latency
> > > > > > > > > > issues?
> > > > > > > > > > In my experience, calling an external LLM has the
> following
> > > > > > > > > > risks: Latency-sensitive (LLM inference can take hundreds
> > of
> > > > > > > > milliseconds
> > > > > > > > > > to seconds), Flaky (network issues, rate limits) as well
> as
> > > it
> > > > > > > > > > is Non-deterministic (with timeouts, retries, etc.). It
> > would
> > > > be
> > > > > > > great
> > > > > > > > to
> > > > > > > > > > work/brainstorm on how we solve these issues.
> > > > > > > > > > 2. In traditional agent workflows, user feedback often
> > plays
> > > a
> > > > > key
> > > > > > > role
> > > > > > > > > in
> > > > > > > > > > validating and improving agent outputs. In a continuous,
> > > > > > long-running
> > > > > > > > > > Flink-based agent system, where interactions might not be
> > > > > > user-facing
> > > > > > > > or
> > > > > > > > > > synchronous, how do we incorporate human-in-the-loop
> > feedback
> > > > or
> > > > > > > > > > correctness signals to validate and iteratively improve
> > agent
> > > > > > > behavior?
> > > > > > > > > >
> > > > > > > > > > This is a really exciting direction for the Flink
> > ecosystem.
> > > > The
> > > > > > idea
> > > > > > > > of
> > > > > > > > > > building long-running, context-aware agents natively on
> > Flink
> > > > > feels
> > > > > > > > like
> > > > > > > > > a
> > > > > > > > > > natural evolution of stream processing. I'd love to see
> > this
> > > > > mature
> > > > > > > and
> > > > > > > > > > would be excited to contribute in any way I can to help
> > > > > > productionize
> > > > > > > > and
> > > > > > > > > > validate this in real-world use cases.
> > > > > > > > > >
> > > > > > > > > > On Wed, May 21, 2025 at 8:52 AM Xintong Song <
> > > > > > tonysong...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi devs,
> > > > > > > > > > >
> > > > > > > > > > > Sean, Chris and I would like to start a discussion on
> > > > FLIP-531
> > > > > > [1],
> > > > > > > > > about
> > > > > > > > > > > introducing a new sub-project, Flink Agents.
> > > > > > > > > > >
> > > > > > > > > > > With the rise of agentic AI, we have identified great
> new
> > > > > > > > opportunities
> > > > > > > > > > for
> > > > > > > > > > > Flink, particularly in the system-triggered agent
> > > scenarios.
> > > > We
> > > > > > > > believe
> > > > > > > > > > the
> > > > > > > > > > > future of AI agent applications is industrialized,
> where
> > > > agents
> > > > > > > will
> > > > > > > > > not
> > > > > > > > > > > only be triggered by users, but increasingly by systems
> > as
> > > > > well.
> > > > > > > > > Flink's
> > > > > > > > > > > event capabilities in real-time distributed event
> > > processing,
> > > > > > state
> > > > > > > > > > > management and exact-once consistency fault tolerance
> > make
> > > it
> > > > > > > > > well-suited
> > > > > > > > > > > as a framework for building such system-triggered
> agents.
> > > > > > > > Furthermore,
> > > > > > > > > > > system-triggered agents are often tightly coupled with
> > data
> > > > > > > > processing.
> > > > > > > > > > > Flink's outstanding data processing capabilities allows
> > > > > seamless
> > > > > > > > > > > integration between data and agentic processing. These
> > > > > > capabilities
> > > > > > > > > > > differentiate Flink from other agent frameworks with
> > unique
> > > > > > > > advantages
> > > > > > > > > in
> > > > > > > > > > > the context of system-triggered agents.
> > > > > > > > > > >
> > > > > > > > > > > We propose this effort as a sub-project of Apache
> Flink,
> > > > with a
> > > > > > > > > separate
> > > > > > > > > > > code repository and lightweight developing process, for
> > > rapid
> > > > > > > > iteration
> > > > > > > > > > > during the early stage.
> > > > > > > > > > >
> > > > > > > > > > > Please note that this FLIP is focused on the high-level
> > > > plans,
> > > > > > > > > including
> > > > > > > > > > > motivation, positioning, goals, roadmap, and operating
> > > model
> > > > of
> > > > > > the
> > > > > > > > > > > project. Detailed technical design is out of the scope
> > and
> > > > will
> > > > > > be
> > > > > > > > > > > discussed during the rapid prototyping and iterations.
> > > > > > > > > > >
> > > > > > > > > > > For more details, please check the FLIP [1]. Looking
> > > forward
> > > > to
> > > > > > > your
> > > > > > > > > > > feedback.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > >
> > > > > > > > > > > Xintong
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > [1]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-531%3A+Initiate+Flink+Agents+as+a+new+Sub-Peoject
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > *Mayank Juneja*
> > > > > > > Product Manager | Data Streaming and AI
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-531: Initiate Flink Agents as a new Sub-Project

Reply via email to