Team,
Hopefully, this is a quick one. 
We have setup restart strategy as follows in pretty much all of our apps:
    env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, Time.of(30, 
TimeUnit.SECONDS)));

This seems pretty straight-forward. App should retry starting 10 times every 30 
seconds - so about 5 minutes. Either we are not understanding this or it seems 
inconsistent. Some of the applications restart and come back fine on issues 
like Kafka timeout (which I will come back to later) but in some cases same 
issues pretty much shuts the app down. 

My first guess here was that total count of 10 is not reset after App recovered 
normally. Is there a need to manually reset the counter in an App? I doubt 
Flink would be treating it like a counter that spans the life of an App instead 
of resetting on successful start-up - but not sure how else to explain the 
behavior.
Along the same line, what actually constitutes as a "restart"? Our Kafka 
cluster has known performance bottlenecks during certain times of day that we 
are working to resolve. I do notice Kafka producer timeouts quite a few times 
during these times. When App hits these timeouts, it does recover fine but I 
dont necessary see entire application restarting as I dont see bootstrap logs 
of my App. Does something like this count as a restart of App from Restart 
Strategy perspective as well vs things like apps crashes/Yarn killing 
application etc. where App is actually restarted from scratch?
We are really liking Flink, just need to hash out these operational issues to 
make it prime time for all streaming apps we have in our cluster.
Thanks,
Ashish

Reply via email to