Thanks Xintong, Sean and Chris. This is a great step forward for the future of Flink. I'm really looking forward to it!
Best, Yuan On Sat, May 24, 2025 at 10:00 PM Robert Metzger <rmetz...@apache.org> wrote: > Thanks for the nice proposal. > > One question: The proposal talks a lot about establishing a "sub project". > If I understand correctly, the ASF has a concept of subprojects, with > sub-project committers, mailing lists, jira projects, .. etc. [1][2]. > > Is the intention of this proposal to establish such a sub project? > Or is the intention to basically create a "flink-agents" git repository, > where all existing Flink committers have access to, and the Flink PMC votes > on releases? (I assume this is the intention). If so, I would update the > proposal to talk about a new repository? or at least clarify the immediate > implications for the project. > > My second question is about this key feature: > > *Inter-Agent Communication:* Built-in support for asynchronous > agent-to-agent communication using Kafka. > > Does this mean the code from the flink-agents repo will have a dependency > on AK? One of the big benefits of Flink is that it is independent of the > underlying message streaming system. Wouldn't it be more elegant and > actually easier to rely on the Flink connector framework here, and leave > the concrete implementation to the user? > Also, I wonder why we need to rely on an external message streaming system > at all? Is it because we want to be able to send messages into arbitrary > directions? if so, maybe we can re-use code from Flink Statefun? I > personally would think that relying on Flink's internal data transfer model > by default brings a lot of cost, performance, operations and implementation > benefits ... and users can still manually setup a connector using a Kafka, > Pulsar or PubSub connection. WDYT? > > Best, > Robert > > > [1] > > https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Sub+Projects > [2] > > https://cwiki.apache.org/confluence/display/HADOOP/Apache+Hadoop+Ozone+-+sub-project+to+Apache+TLP+proposal > > > On Fri, May 23, 2025 at 6:14 AM Xintong Song <tonysong...@gmail.com> > wrote: > > > @Jing, > > > > I think the FLIP already included the high-level design goals, by listing > > the key features that we plan to support in the Proposed Solution > section, > > and demonstrating how using the framework may look like with the code > > examples. Of course the high-level goals need to be further detailed, > which > > will be the next step. The purpose of this FLIP is to get community > > consensus on initiating this new project. On the other hand, technical > > design takes time to discuss, and likely requires continuous iteration as > > the project is being developed. So I think it makes sense to separate the > > design discussions from the initiation proposal. > > > > Of course any contributor's thoughts and inputs are valuable to the > > project. And efficiency also matters, as the agentic ai industry grows > > fast, we really need to keep up with the pace. I believe it would be more > > efficient to come up with some initial draft design / implementation that > > everyone can comment on, compared to just randomly collecting ideas when > we > > have nothing. Fortunately, the project is at the early stage with no > > historical burdens, which means we don't need to carefully make sure > > everything is perfect in advance, and can always correct / change / > rework > > things if needed. We can at least do that before we commit to the product > > compatibility with the first formal release. This is why we suggested > > applying a light, execution-first process, as mentioned in the Operating > > Model section. I would not be concerned too much about not collecting > > enough inputs at the beginning, because we can always adjust things > > afterwards based on new suggestions and opinions. > > > > Best, > > > > Xintong > > > > > > > > On Fri, May 23, 2025 at 12:13 AM Jing Ge <j...@ververica.com.invalid> > > wrote: > > > > > It is great to see that everyone in this thread agreed with the > > high-level > > > proposal. Just so excited and could not stop asking questions :-) > Thanks > > > Xintong for the update! > > > > > > I'd like to share a little bit more thoughts with my questions and your > > > additional input. And then lead to a small suggestion. > > > > > > 1. It is great to support freestyle tools beyond MCP protocol, from > users > > > perspective. However, if we consider agent framework design, there > might > > be > > > some choices to make. For example, either we stick to MCP internally > and > > > turn such external freestyle tools into MCP internally or we will > design > > a > > > new abstraction to handle diverse function calls offered by different > > > LLMs, kind of repeating what MCP did. Another thought, which I feel, > is > > > that the sample API in the FLIP shows somehow, as a user, after a MCP > > > server registration, I could use the close follow-up prompt() method to > > > modify/extend the standard out-of-box context provided by the MCP > server. > > > But it is too detailed and should not be discussed in this high-level > > > thread. Happy to join any (offline) discussion and contribute. > > > > > > 3. Similar to microservices, there are a few use cases that are > sensitive > > > to the response latency, e.g. stock trading, etc. But it is totally > fine > > to > > > focus on asynchronous communication. > > > > > > 4. because each of them has individual focus and needs effort to build. > > It > > > was a question of priorities. Good to know Flink Agent wants to cover > > both. > > > > > > 5. Great to know. I had a similar thought and was a little bit > confused, > > > because state is more or less a low level concept for operators. > Looking > > > forward to understanding how to use it as agent memory. > > > > > > What I actually tried to suggest with all these questions is: Does it > > make > > > sense to define some high-level design goals/criterias/guidelines? > > like(as > > > an example): > > > > > > 1. support MCP natively > > > 2. single Agent development (for the first stage) > > > 3. only support event-driven asynchronous communication > > > 4. agent framework for both embedding and workflow development(same > > > priority) > > > 5. Flink state as memory > > > 6. support ReAct, don't support ReWOO (just as an example to show my > > > thought. In reality, ReWOO might be useful for some enterprise agents > > > considering the deterministic process. An example topic to be > discussed.) > > > > > > Any contributors in the community can also share their thoughts about > any > > > high level design guidelines to be collected at an early stage. > > > > > > The final chosen high-level guidelines could help let everyone on the > > same > > > page to understand and design the upcoming architecture and might also > > have > > > influence on the future API design. WDYT? > > > > > > Best regards, > > > Jing > > > > > > > > > On Thu, May 22, 2025 at 4:55 AM Xintong Song <tonysong...@gmail.com> > > > wrote: > > > > > > > Thanks everyone for the positive feedback. > > > > > > > > As I said, this FLIP is intended for discussing high-level plans for > > the > > > > new project. The project itself is still at an early stage, and some > of > > > the > > > > technical designs and solutions are not completely ready yet. So atm > I > > > can > > > > only share some personal thoughts on the raised questions, and we are > > > open > > > > to suggestions and opinions. > > > > > > > > @Jing > > > > > > > > 1. Regarding MCP, I think it's just one way (and likely a major way) > > for > > > > providing LLMs with context, but not the only way. E.g., a user may > > > write a > > > > dedicated python function and provide it to the LLM as a tool, which > > > > doesn't necessarily need to go through the MCP protocol. At the same, > > the > > > > LLM may discover more available tools from a MCP server. These are > > just 2 > > > > different sources that the tools come from, and they can co-exist. > > > > > > > > 2. In the long-term, yes, I think. As a first step, we probably will > be > > > > more focused on how to build individual agents, less on interactions > > > across > > > > multiple agents. Not saying we won't support MAS in the first step, > > but > > > > maybe not as complex as the A2A protocol. > > > > > > > > 3. Interactions between agents will be event-driven, so they are > > > naturally > > > > asynchronous. I'm not entirely sure about use cases that prefer > > > > asynchronous agent calls. Could you share some examples? > > > > > > > > 4. I think I didn't fully get the taxonomy here. I mean why embedding > > vs. > > > > workflow? From my understanding, I think Flink Agents should cover > both > > > use > > > > cases. > > > > > > > > 5. Yes, memory is considered. Actually, Flink's state management > makes > > a > > > > good foundation for supporting agent memory. > > > > > > > > @Nishita > > > > > > > > 1. I think calling an external LLM is similar to an async operator in > > > > Flink, in terms of potential latency and backpressure issues. Flink's > > > async > > > > operator already supports concurrent async calls, rate control, > timeout > > > > handling, etc. But eventually, the bottleneck is at the external > > service > > > > side, and we expect the model techniques will keep improving, with > > larger > > > > throughput, less latency, and better stability. > > > > > > > > 2. Good question. I think real-time event-driven processing is > somehow > > in > > > > conflict with asynchronous human-in-the-loop feedback. One idea is > > that, > > > > I've seen people doing this way, to build another agent for > validating > > > > results and generating feedback. Another idea is to collect samples > of > > > > results for asynchronous human-in-the-loop validations. But these are > > > just > > > > rough ideas. I don't have sophisticated answers at the moment. > > > > > > > > Best, > > > > > > > > Xintong > > > > > > > > > > > > > > > > On Thu, May 22, 2025 at 3:26 AM Yash Anand > <yan...@confluent.io.invalid > > > > > > > wrote: > > > > > > > > > Thank you for the proposal—this initiative will make it much easier > > to > > > > > build event-driven AI agents seamlessly. > > > > > > > > > > +1 for the proposed Flink Agents sub-project! > > > > > > > > > > On Wed, May 21, 2025 at 9:43 AM Mayank Juneja < > > > mayankjunej...@gmail.com> > > > > > wrote: > > > > > > > > > > > +1 on the FLIP. This is a solid step toward building an agentic > > > > offering > > > > > > that really leans into Flink’s strengths, and builds on the > > momentum > > > > from > > > > > > recent API improvements like FLIP-437 and the proposed FLIP-529. > > > > > > > > > > > > Also wanted to echo the point around agent memory. More advanced > > > > agentic > > > > > > systems really benefit from both short-term and long-term memory. > > > While > > > > > > long-term memory can live in databases (including vector stores), > > > > having > > > > > a > > > > > > built-in abstraction for managing short-term memory would be > super > > > > > useful. > > > > > > Doesn’t need to be in the MVP, but definitely worth considering > for > > > the > > > > > > roadmap. > > > > > > Best, > > > > > > Mayank > > > > > > > > > > > > > > > > > > On Wed, May 21, 2025 at 4:54 PM Lincoln Lee < > > lincoln.8...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > +1 for the proposed flink agents sub-project! > > > > > > > > > > > > > > This aligns perfectly with flink's core strengths in real-time > > > event > > > > > > > processing and stateful computations. > > > > > > > > > > > > > > Thanks for driving this initiative and looking forward to the > > > > > > > detailed technical designs. > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > Lincoln Lee > > > > > > > > > > > > > > > > > > > > > Hao Li <lihao3...@gmail.com> 于2025年5月21日周三 23:28写道: > > > > > > > > > > > > > > > Hi Xintong, Sean and Chris, > > > > > > > > > > > > > > > > Thanks for driving the initiative. Very exciting to bring AI > > > Agent > > > > to > > > > > > > Flink > > > > > > > > to empower the streaming use cases. > > > > > > > > > > > > > > > > +1 to the FLIP. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Hao > > > > > > > > > > > > > > > > On Wed, May 21, 2025 at 7:35 AM Nishita Pattanayak < > > > > > > > > nishita.pattana...@gmail.com> wrote: > > > > > > > > > > > > > > > > > Hi Sean, Chris and Xintong. This seems to be a very > exciting > > > > > > > sub-project. > > > > > > > > > +1 for "flink-agents" sub-project. > > > > > > > > > > > > > > > > > > I was going through the FLIP , and had some questions > > regarding > > > > the > > > > > > > same: > > > > > > > > > 1. How would the external model calls (e.g., OpenAI or > > internal > > > > > LLMs) > > > > > > > > > integrated into Flink tasks without introducing > backpressure > > or > > > > > > latency > > > > > > > > > issues? > > > > > > > > > In my experience, calling an external LLM has the following > > > > > > > > > risks: Latency-sensitive (LLM inference can take hundreds > of > > > > > > > milliseconds > > > > > > > > > to seconds), Flaky (network issues, rate limits) as well as > > it > > > > > > > > > is Non-deterministic (with timeouts, retries, etc.). It > would > > > be > > > > > > great > > > > > > > to > > > > > > > > > work/brainstorm on how we solve these issues. > > > > > > > > > 2. In traditional agent workflows, user feedback often > plays > > a > > > > key > > > > > > role > > > > > > > > in > > > > > > > > > validating and improving agent outputs. In a continuous, > > > > > long-running > > > > > > > > > Flink-based agent system, where interactions might not be > > > > > user-facing > > > > > > > or > > > > > > > > > synchronous, how do we incorporate human-in-the-loop > feedback > > > or > > > > > > > > > correctness signals to validate and iteratively improve > agent > > > > > > behavior? > > > > > > > > > > > > > > > > > > This is a really exciting direction for the Flink > ecosystem. > > > The > > > > > idea > > > > > > > of > > > > > > > > > building long-running, context-aware agents natively on > Flink > > > > feels > > > > > > > like > > > > > > > > a > > > > > > > > > natural evolution of stream processing. I'd love to see > this > > > > mature > > > > > > and > > > > > > > > > would be excited to contribute in any way I can to help > > > > > productionize > > > > > > > and > > > > > > > > > validate this in real-world use cases. > > > > > > > > > > > > > > > > > > On Wed, May 21, 2025 at 8:52 AM Xintong Song < > > > > > tonysong...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi devs, > > > > > > > > > > > > > > > > > > > > Sean, Chris and I would like to start a discussion on > > > FLIP-531 > > > > > [1], > > > > > > > > about > > > > > > > > > > introducing a new sub-project, Flink Agents. > > > > > > > > > > > > > > > > > > > > With the rise of agentic AI, we have identified great new > > > > > > > opportunities > > > > > > > > > for > > > > > > > > > > Flink, particularly in the system-triggered agent > > scenarios. > > > We > > > > > > > believe > > > > > > > > > the > > > > > > > > > > future of AI agent applications is industrialized, where > > > agents > > > > > > will > > > > > > > > not > > > > > > > > > > only be triggered by users, but increasingly by systems > as > > > > well. > > > > > > > > Flink's > > > > > > > > > > event capabilities in real-time distributed event > > processing, > > > > > state > > > > > > > > > > management and exact-once consistency fault tolerance > make > > it > > > > > > > > well-suited > > > > > > > > > > as a framework for building such system-triggered agents. > > > > > > > Furthermore, > > > > > > > > > > system-triggered agents are often tightly coupled with > data > > > > > > > processing. > > > > > > > > > > Flink's outstanding data processing capabilities allows > > > > seamless > > > > > > > > > > integration between data and agentic processing. These > > > > > capabilities > > > > > > > > > > differentiate Flink from other agent frameworks with > unique > > > > > > > advantages > > > > > > > > in > > > > > > > > > > the context of system-triggered agents. > > > > > > > > > > > > > > > > > > > > We propose this effort as a sub-project of Apache Flink, > > > with a > > > > > > > > separate > > > > > > > > > > code repository and lightweight developing process, for > > rapid > > > > > > > iteration > > > > > > > > > > during the early stage. > > > > > > > > > > > > > > > > > > > > Please note that this FLIP is focused on the high-level > > > plans, > > > > > > > > including > > > > > > > > > > motivation, positioning, goals, roadmap, and operating > > model > > > of > > > > > the > > > > > > > > > > project. Detailed technical design is out of the scope > and > > > will > > > > > be > > > > > > > > > > discussed during the rapid prototyping and iterations. > > > > > > > > > > > > > > > > > > > > For more details, please check the FLIP [1]. Looking > > forward > > > to > > > > > > your > > > > > > > > > > feedback. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > > > Xintong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-531%3A+Initiate+Flink+Agents+as+a+new+Sub-Peoject > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > *Mayank Juneja* > > > > > > Product Manager | Data Streaming and AI > > > > > > > > > > > > > > > > > > > > >