Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-06-29 Thread Vishal Santoshi
OK this happened again and it is bizarre ( and is definitely not what I think should happen ) The job failed and I see these logs ( In essence it is keeping the last 5 externalized checkpoints ) but deleting the zk checkpoints directory *06.28.2019 20:33:13.7382019-06-29 00:33:13,73

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-06-29 Thread Timothy Victor
Hi Vishal, can this be reproduced on a bare metal instance as well? Also is this a job or a session cluster? Thanks Tim On Sat, Jun 29, 2019, 7:50 AM Vishal Santoshi wrote: > OK this happened again and it is bizarre ( and is definitely not what I > think should happen ) > > > > > The job fai

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-06-29 Thread Vishal Santoshi
I have not tried on bare metal. We have no option but k8s. And this is a job cluster. On Sat, Jun 29, 2019 at 9:01 AM Timothy Victor wrote: > Hi Vishal, can this be reproduced on a bare metal instance as well? Also > is this a job or a session cluster? > > Thanks > > Tim > > On Sat, Jun 29, 2

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-06-29 Thread Vishal Santoshi
Another point the JM had terminated. The policy on K8s for Job Cluster is spec: restartPolicy: OnFailure *2019-06-29 00:33:14,308 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with exit code 1443.* On Sat,

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-06-29 Thread Vishal Santoshi
So there a re 2 scenerios 1. If JM goes down ( exits ) and k8s re launches the Job Cluster ( the JM ), it is using the exact command, the JM was brought up in the first place. 2. If the pipe is restarted for any other reason by the JM ( the JM has not exited but *Job Kafka-to-HDFS (

Why did JM fail on K8s (see original thread below)

2019-06-29 Thread Timothy Victor
This is slightly off topic, so I'm changing the subject to not conflate the original issue you brought up. But do we know why JM crashed in the first place? We are also thinking of moving to K8s, but to be honest we had tons of stability issues in our first rodeo. That could just be our lack of

Re: Why did JM fail on K8s (see original thread below)

2019-06-29 Thread Vishal Santoshi
We are investigating that. But is the above theory plausible ( flink gurus ) even though this, as in forcing restartPolicy: Never pretty much nullifies HA on JM is it is a Job cluster ( at leats on k8s ) As for the reason we are investigating that. One thing we looking as the QOS ( https://kub

Re: Why did JM fail on K8s (see original thread below)

2019-06-29 Thread Vishal Santoshi
This is strange, the retry strategy was 20 times with 4 minute delay. This job tried once ( we had a hadoop Name Node hiccup ) but I think it could not even get to NN and gave up ( as in did not retry the next 19 times ) *019-06-29 00:33:13,680 INFO org.apache.flink.runtime.executiongraph.E

Re: Why did JM fail on K8s (see original thread below)

2019-06-29 Thread Vishal Santoshi
even though Max. number of execution retries Restart with fixed delay (24 ms). #20 restart attempts. On Sat, Jun 29, 2019 at 10:44 AM Vishal Santoshi wrote: > This is strange, the retry strategy was 20 times with 4 minute delay. > This job tried once ( we had a hadoop Name Node hiccup ) but

Flink Kafka ordered offset commit & unordered processing

2019-06-29 Thread wang xuchen
Hi Flink experts, I am prototyping a real time system that reads from Kafka source with Flink and calls out to an external system as part of the event processing. One of the most important requirements are read from Kafka should NEVER stall, even in face of some async external calls slowness while