I recently updated my cluster with the following config: restart-strategy: failure-rate restart-strategy.failure-rate.max-failures-per-interval: 3 restart-strategy.failure-rate.failure-rate-interval: 5 min restart-strategy.failure-rate.delay: 10 s
I see the settings inside the JobManager web UI, as expected. I am not setting the restart-strategy programmatically, but the job does have checkpointing enabled. However, if I launch a job that (intentionally) fails every 10 seconds by throwing a RuntimeException, it continues to restart beyond the limit of 3 failures. Does anyone know why this might be happening? Any ideas of things I could check? Thanks! Shannon