Re: [DISCUSS] FLIP-XXX: Hardening the Event Reporter with an Asynchronous Core

Aleksandr Iushmanov Tue, 26 Aug 2025 10:43:57 -0700

Hi Kartikey,

Thank you for looking into this.


I might not be very familiar with the naming conventions in Flink,
so please bear with me if my suggestion doesn't make complete sense.
I suggest introducing a feature flag, something like:

> events.reporter.<name>.dispatcher.type

which would default to *sync* to make this change backwards compatible.

Also, are there any reasons why we would not want to introduce an
interface with two implementations?
1. sync: for the existing behaviour.
2. memory-queue: for the proposed implementation with the queue.

This way:

   - we don't break anything by default
   - we can change the default in future releases once it has been proven
   to be stable
   - we keep the door open for other implementations (e.g. file-based queue
   or spillover to logs).


I look forward to hearing your thoughts on it.

Kind regards,
Aleksandr Iushmanov


On Fri, 22 Aug 2025 at 09:54, Kartikey Pant <[email protected]>
wrote:

> Hi Aleksandr,
>
> Thanks for the great feedback. Your points on guaranteed delivery and the
> *FileEventsReporter* are spot on, and I agree with your reasoning. I'll
> update the FLIP to incorporate them, as it will make the proposal much
> stronger.
>
> Regarding the delivery guarantee, I'll add a new configuration key,
> *events.reporter.<name>.delivery.guarantee*, to allow a choice between two
> modes. The default will be best-effort for the asynchronous, non-blocking
> dispatch. I'll also add a guaranteed mode for a synchronous, blocking
> dispatch that bypasses the queue, perfect for the critical autoscaling use
> case you mentioned.
>
> On your question about the *FileEventsReporter*, you're right that a local
> file append is cheap. The async core isn't really designed for the
> *FileEventsReporter* specifically, but for the general case where reporters
> write to network sinks (e.g., *OpenTelemetry*) where latency and
> backpressure are real concerns. The file reporter is just meant to be a
> simple, built-in option for users.
>
> I'll get these changes into the design doc shortly and will follow up on
> this thread once it's updated. Thanks again for helping improve the FLIP.
>
> Best,
> Kartikey
>
> On Thu, Aug 21, 2025 at 11:19 PM Aleksandr Iushmanov <[email protected]>
> wrote:
>
> > Hi Kartikey,
> >
> > I like the idea and I agree with general direction, thank you for
> > putting it together!
> >
> > I have one concern about making this modification "forced", imho there
> > should be a room for "guaranteed important events delivery" from the
> > operations point of view. If Flink job is struggling/backpressured it
> > may make sense to emit some events at priority that would be used for
> > external triggers like "autoscaling" or external dynamic configuration
> > tuning.
> >
> > Imho, interfaces should either allow to choose "sync" vs "non guaranteed
> > async" delivery for different events (or event reporters). With proposal
> > "as is" it won't be possible to "ensure" that important messages have
> > been delivered and can be actioned by external monitoring system. Could
> > we make "queue / async" behaviour opt-in?
> > Second question I had was around FileEventReporter implementation, at a
> > glance, "append to file" is a fairly cheap operation, do you have a
> > concern that amount of events is large enough to have significant
> > bottleneck on disk IO and requires memory queue?
> >
> > Kind regards,
> >
> > Aleksandr Iushmanov
> >
> >
> > On 2025/08/19 06:56:36 Kartikey Pant wrote:
> >  > Hi everyone,
> >  >
> >  > I'd like to propose a new FLIP that builds directly on the excellent
> >  > foundation laid by FLIP-481 (Introduce Event Reporting). For anyone
> >  > needing context, the original proposal is available here:
> >  >
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-481%3A+Introduce+Event+Reporting
> >  >
> >  > Now that the community has this powerful API, the logical next step is
> >  > to ensure it's fully robust for large-scale production environments
> >  > where users will be writing their own diverse, custom reporters.
> >  >
> >  > This proposal focuses on one key enhancement: introducing a resilient,
> >  > asynchronous dispatch core. The goal is to decouple event generation
> >  > from the reporter's execution, ensuring that a slow or experimental
> >  > sink can never impact Flink's core stability.
> >  >
> >  > I've drafted a detailed design document that I hope can form the basis
> >  > of this new FLIP:
> >  >
> >
> >
> https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?usp=sharing
> >  >
> >  > I'm keen to get the community's initial feedback on this direction
> >  > before moving forward with the formal process.
> >  >
> >  > Thanks,
> >  > Kartikey Pant
> >  >
> >
>

Re: [DISCUSS] FLIP-XXX: Hardening the Event Reporter with an Asynchronous Core

Reply via email to