Team, Hopefully, this is a quick one. We have setup restart strategy as follows in pretty much all of our apps: env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, Time.of(30, TimeUnit.SECONDS)));
This seems pretty straight-forward. App should retry starting 10 times every 30 seconds - so about 5 minutes. Either we are not understanding this or it seems inconsistent. Some of the applications restart and come back fine on issues like Kafka timeout (which I will come back to later) but in some cases same issues pretty much shuts the app down. My first guess here was that total count of 10 is not reset after App recovered normally. Is there a need to manually reset the counter in an App? I doubt Flink would be treating it like a counter that spans the life of an App instead of resetting on successful start-up - but not sure how else to explain the behavior. Along the same line, what actually constitutes as a "restart"? Our Kafka cluster has known performance bottlenecks during certain times of day that we are working to resolve. I do notice Kafka producer timeouts quite a few times during these times. When App hits these timeouts, it does recover fine but I dont necessary see entire application restarting as I dont see bootstrap logs of my App. Does something like this count as a restart of App from Restart Strategy perspective as well vs things like apps crashes/Yarn killing application etc. where App is actually restarted from scratch? We are really liking Flink, just need to hash out these operational issues to make it prime time for all streaming apps we have in our cluster. Thanks, Ashish