Hi, We are facing a serious production issue with Flink. Any help would be appreciated.
We receive packets from a Kafka Cluster - This cluster has a sudden drop in the packets from 22:00 UTC till 00:30 UTC everyday [on a specific topic, say "topic A"]. Though our job reads from a different topic [say "topic B"], we see that we drop a lot of packets here [due to "laterecordsDropped" metric]. At the same time, we see the job which reads from "topic A" has high fetch rate. We also observed one of the brokers of this cluster had an abnormal CPU rise [which i attributed to the high fetch rates]. We have a tumbling window of 1 min [with 10 seconds of watermarksPeriodicBounded]. This is based on the packets' event time. Is there any reason why my job reading from "topic B" can higher records dropped. The picture below has a screenshot where Laterecords dropped corresponds to job reading from "topic B" Fetch and Consume rates relates to job reading from "topic A" [which has the downward trend in traffic in the mentioned times]. [image: image.png] All these graphs are correlated and we are unable to isolate this problem. there are other modules which consumes from this topic, and we have no slow records logged here, which is why we are not sure of there is this issue with Flink alone. Thanks.