Hey there! We are seeing a second Flink pipeline encountering similar issues when configuring both `withWatermarkAlignment` and `withIdleness`. The unexpected behaviour gets triggered after a Kafka cluster failover. Any thoughts on there being an incompatibility between the two?
Thanks! On Wed, Nov 9, 2022 at 6:42 PM Reem Razak <reem.ra...@shopify.com> wrote: > Hi there, > > We are integrating the watermark alignment feature into a pipeline with a > Kafka source during a "backfill"- i.e. playing from an earlier Kafka > offset. While testing, we noticed some unexpected behaviour in the > watermark advancement which was resolved by removing `withIdleness` from > our watermark strategy. > > > val watermarkStrategy = WatermarkStrategy > .forBoundedOutOfOrderness(Duration.ofMinutes(1)) > .withTimestampAssigner(new TimestampedEventTimestampAssigner[Event]) > .withWatermarkAlignment("alignment-group-1", Duration.ofMinutes(1)) > .withIdleness(Duration.ofMinutes(5)) > > I have attached a couple of screenshots of the watermarkAlignmentDrift > metric. As you can see, the behaviour seems normal until a sudden drop in > the value to ~ Long.MIN_VALUE, which causes the pipeline to stop emitting > records completely from the source. Furthermore, the logs originating from > from > https://github.com/dawidwys/flink/blob/888162083fbb5f62f809e5270ad968db58cc9c5b/flink-runtime/src/main/java/org/apache/flink/runtime/source/coordinator/SourceCoordinator.java#L174 > also indicate that the `maxAllowedWatermark` switches to ~ Long.MIN_VALUE. > > We found that modifying the `updateInterval` passed into the alignment > parameters seemed to correlate with how long the pipeline would operate > before stopping - a larger interval of 20 minutes would encounter the issue > later than an interval of 1 second. > > We are wondering if a bug exists when using both `withIdleness` and > `withWatermarkAlignment`. Might it be related to > https://issues.apache.org/jira/browse/FLINK-28975, or is there possibly a > race condition in the watermark emission? We do not necessarily need to > have both configured at the same time, but we were also surprised by the > behaviour of the application. Has anyone run into a similar issue or have > further insight? > > Much Appreciated, > - Reem > > > >