Hello Folks, We have some pipelines that contain multiple hops of Flink jobs with Kafka transport layer in between. Data is finally written to analytical stores. We want to measure e-2-e from the first source all the way to the last sink(that writes to the analytical stores) and possibly also at other hops in the middle.
Here are some strategies I'm thinking about, would love your thoughts on the various approaches: 1. *<Processing_time - Event_time>* at various hops. Each data will contain event_time based on the when it is written to the first kafka source -> When there are windowed aggregations and such it's tricky to translate correct event time to the derived event. So this is tightly coupled with user logic and hence not favorable. 2. *Latency markers introduced in the Kafka stream *that will be consumed by the Flink jobs -> We can potentially introduce latency markers along with regular data, this will share the same data envelope schema so it can travel with the regular data. Operators will need to identify it and forward it appropriately and also exclude it from aggregations and such which makes this approach complex. Unless there is an elegant way to piggyback on the internal Flink latency marker movement for e-2-e latency tracking? *Would love to hear your thoughts about this.* 3. *Sum of Kafka consumer lag* across all the Kafka topics in the pipeline - Will give us tail latencies. We would ideally love to get a histogram of latencies across all the events. 4. *Global minimum watermark *- In this approach, I'm thinking about periodically checking the global minimum watermark and using that to determine tail latency - this would probably not give a good histogram of latencies across all partitions and data. But so far this approach seems like the easiest to generalize across all types of operators and aggregations. But would love to hear your thoughts on the feasibility of this. Let me know what you think. And if there are better ways to measure end-2-end latency that would be lovely! Thanks, Sherin