Hi guys, We are on the final stages of moving our Flink pipeline from staging to production, but I just found something kinda weird:
We are graphing some Flink metrics, like flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max. If I got this right, that's "kafka head offset - flink consumer offset", e.g., the number of records flink still needs to reach the most recent in the partition. Is that right? If that's the case, I saw another weird thing: It seems that, at some points, this lag falls back to 0 and then slowly goes back up (remember, this is a staging environment, not production, so we are using smaller machines with few cores [2] and low memory [8Gb]) -- attached Grafana graph for reference. I don't see any checkpoint errors or taskmanager failures, so I don't think it simply dropped everything and started over. Any ideas what's going on here? -- *Julio Biason*, Sofware Engineer *AZION* | Deliver. Accelerate. Protect. Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 <callto:+5551996209291>*99907 0554*