All, It looks like Flink's default behavior is to restart all operators on a single operator error - in my case it is a Kafka Producer timing out. When this happens, I see logs that all operators are restarted. This essentially leads to data loss. In my case the volume of data is so high that it is becoming very expensive to checkpoint. I was wondering if Flink has a lifecycle hook to attach a forced checkpointing before restarting operators. That will solve a dire production issue for us. Thanks,
-- Ashish