Hi Kartikey, Thank you for looking into this.
I might not be very familiar with the naming conventions in Flink, so please bear with me if my suggestion doesn't make complete sense. I suggest introducing a feature flag, something like: > events.reporter.<name>.dispatcher.type which would default to *sync* to make this change backwards compatible. Also, are there any reasons why we would not want to introduce an interface with two implementations? 1. sync: for the existing behaviour. 2. memory-queue: for the proposed implementation with the queue. This way: - we don't break anything by default - we can change the default in future releases once it has been proven to be stable - we keep the door open for other implementations (e.g. file-based queue or spillover to logs). I look forward to hearing your thoughts on it. Kind regards, Aleksandr Iushmanov On Fri, 22 Aug 2025 at 09:54, Kartikey Pant <kartikeypant....@gmail.com> wrote: > Hi Aleksandr, > > Thanks for the great feedback. Your points on guaranteed delivery and the > *FileEventsReporter* are spot on, and I agree with your reasoning. I'll > update the FLIP to incorporate them, as it will make the proposal much > stronger. > > Regarding the delivery guarantee, I'll add a new configuration key, > *events.reporter.<name>.delivery.guarantee*, to allow a choice between two > modes. The default will be best-effort for the asynchronous, non-blocking > dispatch. I'll also add a guaranteed mode for a synchronous, blocking > dispatch that bypasses the queue, perfect for the critical autoscaling use > case you mentioned. > > On your question about the *FileEventsReporter*, you're right that a local > file append is cheap. The async core isn't really designed for the > *FileEventsReporter* specifically, but for the general case where reporters > write to network sinks (e.g., *OpenTelemetry*) where latency and > backpressure are real concerns. The file reporter is just meant to be a > simple, built-in option for users. > > I'll get these changes into the design doc shortly and will follow up on > this thread once it's updated. Thanks again for helping improve the FLIP. > > Best, > Kartikey > > On Thu, Aug 21, 2025 at 11:19 PM Aleksandr Iushmanov <izeren...@gmail.com> > wrote: > > > Hi Kartikey, > > > > I like the idea and I agree with general direction, thank you for > > putting it together! > > > > I have one concern about making this modification "forced", imho there > > should be a room for "guaranteed important events delivery" from the > > operations point of view. If Flink job is struggling/backpressured it > > may make sense to emit some events at priority that would be used for > > external triggers like "autoscaling" or external dynamic configuration > > tuning. > > > > Imho, interfaces should either allow to choose "sync" vs "non guaranteed > > async" delivery for different events (or event reporters). With proposal > > "as is" it won't be possible to "ensure" that important messages have > > been delivered and can be actioned by external monitoring system. Could > > we make "queue / async" behaviour opt-in? > > Second question I had was around FileEventReporter implementation, at a > > glance, "append to file" is a fairly cheap operation, do you have a > > concern that amount of events is large enough to have significant > > bottleneck on disk IO and requires memory queue? > > > > Kind regards, > > > > Aleksandr Iushmanov > > > > > > On 2025/08/19 06:56:36 Kartikey Pant wrote: > > > Hi everyone, > > > > > > I'd like to propose a new FLIP that builds directly on the excellent > > > foundation laid by FLIP-481 (Introduce Event Reporting). For anyone > > > needing context, the original proposal is available here: > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-481%3A+Introduce+Event+Reporting > > > > > > Now that the community has this powerful API, the logical next step is > > > to ensure it's fully robust for large-scale production environments > > > where users will be writing their own diverse, custom reporters. > > > > > > This proposal focuses on one key enhancement: introducing a resilient, > > > asynchronous dispatch core. The goal is to decouple event generation > > > from the reporter's execution, ensuring that a slow or experimental > > > sink can never impact Flink's core stability. > > > > > > I've drafted a detailed design document that I hope can form the basis > > > of this new FLIP: > > > > > > > > https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?usp=sharing > > > > > > I'm keen to get the community's initial feedback on this direction > > > before moving forward with the formal process. > > > > > > Thanks, > > > Kartikey Pant > > > > > >