Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-07-03 Thread Vishal Santoshi
I guess using a session cluster rather then a job cluster will decouple the job from the container and may be the only option as of today? On Sat, Jun 29, 2019, 9:34 AM Vishal Santoshi wrote: > So there a re 2 scenerios > > 1. If JM goes down ( exits ) and k8s re launches the Job Cluster ( the J

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-06-29 Thread Vishal Santoshi
So there a re 2 scenerios 1. If JM goes down ( exits ) and k8s re launches the Job Cluster ( the JM ), it is using the exact command, the JM was brought up in the first place. 2. If the pipe is restarted for any other reason by the JM ( the JM has not exited but *Job Kafka-to-HDFS (

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-06-29 Thread Vishal Santoshi
Another point the JM had terminated. The policy on K8s for Job Cluster is spec: restartPolicy: OnFailure *2019-06-29 00:33:14,308 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with exit code 1443.* On Sat,

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-06-29 Thread Vishal Santoshi
I have not tried on bare metal. We have no option but k8s. And this is a job cluster. On Sat, Jun 29, 2019 at 9:01 AM Timothy Victor wrote: > Hi Vishal, can this be reproduced on a bare metal instance as well? Also > is this a job or a session cluster? > > Thanks > > Tim > > On Sat, Jun 29, 2

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-06-29 Thread Timothy Victor
Hi Vishal, can this be reproduced on a bare metal instance as well? Also is this a job or a session cluster? Thanks Tim On Sat, Jun 29, 2019, 7:50 AM Vishal Santoshi wrote: > OK this happened again and it is bizarre ( and is definitely not what I > think should happen ) > > > > > The job fai

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-06-29 Thread Vishal Santoshi
OK this happened again and it is bizarre ( and is definitely not what I think should happen ) The job failed and I see these logs ( In essence it is keeping the last 5 externalized checkpoints ) but deleting the zk checkpoints directory *06.28.2019 20:33:13.7382019-06-29 00:33:13,73

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-06-05 Thread Vishal Santoshi
Ok, I will do that. On Wed, Jun 5, 2019, 8:25 AM Chesnay Schepler wrote: > Can you provide us the jobmanager logs? > > After the first restart the JM should have started deleting older > checkpoints as new ones were created. > After the second restart the JM should have recovered all 10 checkpoi

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-06-05 Thread Chesnay Schepler
Can you provide us the jobmanager logs? After the first restart the JM should have started deleting older checkpoints as new ones were created. After the second restart the JM should have recovered all 10 checkpoints, start from the latest, and start pruning old ones as new ones were created.

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-06-05 Thread Vishal Santoshi
Any one? On Tue, Jun 4, 2019, 2:41 PM Vishal Santoshi wrote: > The above is flink 1.8 > > On Tue, Jun 4, 2019 at 12:32 PM Vishal Santoshi > wrote: > >> I had a sequence of events that created this issue. >> >> * I started a job and I had the state.checkpoints.num-retained: 5 >> >> * As expected

Re: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-06-04 Thread Vishal Santoshi
The above is flink 1.8 On Tue, Jun 4, 2019 at 12:32 PM Vishal Santoshi wrote: > I had a sequence of events that created this issue. > > * I started a job and I had the state.checkpoints.num-retained: 5 > > * As expected I have 5 latest checkpoints retained in my hdfs backend. > > > * JM dies ( K

Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline

2019-06-04 Thread Vishal Santoshi
I had a sequence of events that created this issue. * I started a job and I had the state.checkpoints.num-retained: 5 * As expected I have 5 latest checkpoints retained in my hdfs backend. * JM dies ( K8s limit etc ) without cleaning the hdfs directory. The k8s job restores from the latest che