Re: Why did JM fail on K8s (see original thread below)

2019-06-29 Thread Vishal Santoshi
even though Max. number of execution retries Restart with fixed delay (24 ms). #20 restart attempts. On Sat, Jun 29, 2019 at 10:44 AM Vishal Santoshi wrote: > This is strange, the retry strategy was 20 times with 4 minute delay. > This job tried once ( we had a hadoop Name Node hiccup ) but

Re: Why did JM fail on K8s (see original thread below)

2019-06-29 Thread Vishal Santoshi
This is strange, the retry strategy was 20 times with 4 minute delay. This job tried once ( we had a hadoop Name Node hiccup ) but I think it could not even get to NN and gave up ( as in did not retry the next 19 times ) *019-06-29 00:33:13,680 INFO org.apache.flink.runtime.executiongraph.E

Re: Why did JM fail on K8s (see original thread below)

2019-06-29 Thread Vishal Santoshi
We are investigating that. But is the above theory plausible ( flink gurus ) even though this, as in forcing restartPolicy: Never pretty much nullifies HA on JM is it is a Job cluster ( at leats on k8s ) As for the reason we are investigating that. One thing we looking as the QOS ( https://kub

Why did JM fail on K8s (see original thread below)

2019-06-29 Thread Timothy Victor
This is slightly off topic, so I'm changing the subject to not conflate the original issue you brought up. But do we know why JM crashed in the first place? We are also thinking of moving to K8s, but to be honest we had tons of stability issues in our first rodeo. That could just be our lack of