Re: Flink Checkpoint on yarn

2016-03-19 Thread Ufuk Celebi
Can you please have a look into the JobManager log file and report which checkpoints are restored? You should see messages from ZooKeeperCompletedCheckpointStore like: - Found X checkpoints in ZooKeeper - Initialized with X. Removing all older checkpoints You can share the complete job manager log

Re: Flink Checkpoint on yarn

2016-03-19 Thread Ufuk Celebi
Yes, the jobs have their own UUID. Although you expect there to be two independent clusters (which makes sense since you started via yarn-cluster), both clusters act as a single one because of the shared ZooKeeper root. What happens in your case is the following (this is also the reason why we se

Re: Flink Checkpoint on yarn

2016-03-19 Thread Ufuk Celebi
OK, so you are submitting multiple jobs, but you submit them with -m yarn-cluster and therefore expect them to start separate YARN clusters. Makes sense and I would expect the same. I think that you can check in the client logs printed to stdout to which cluster the job is submitted. PS: The logs

Re: Flink Checkpoint on yarn

2016-03-19 Thread Ufuk Celebi
Hey Simone! Did you set different recovery.zookeeper.path.root keys? The default is /flink and if you don't change it for the 2nd cluster, it will try to recover the jobs of the first one. Can you gather the job manager logs as well please? – Ufuk On Thu, Mar 17, 2016 at 11:31 AM, Simone Robutti

Re: Flink Checkpoint on yarn

2016-03-19 Thread Simone Robutti
This is the log filtered to check messages from ZooKeeperCompletedCheckpointStore. https://gist.github.com/chobeat/0222b31b87df3fa46a23 It looks like it finds only a checkpoint but I'm not sure if the different hashes and IDs of the checkpoints are meaningful or not. 2016-03-16 15:33 GMT+01:00

Re: Flink Checkpoint on yarn

2016-03-19 Thread Ufuk Celebi
Hey Simone, from the logs it looks like multiple jobs have been submitted to the cluster, not just one. The different files correspond to different jobs recovering. The filtered logs show three jobs running/recovering (with IDs 10d8ccae6e87ac56bf763caf4bc4742f, 124f29322f9026ac1b35435d5de9f625, 7f

Re: Flink Checkpoint on yarn

2016-03-19 Thread Stefano Baghino
Hi Ufuk, does the recovery.zookeeper.path.root property need to be set independently for each job that is run? Doesn't Flink take care of assigning some sort of identification to each job and storing their checkpoints independently? On Thu, Mar 17, 2016 at 11:43 AM, Ufuk Celebi wrote: > Hey Sim

Re: Flink Checkpoint on yarn

2016-03-19 Thread Simone Robutti
Ok, i run another test. I launched two identical jobs, one after the other, on yarn (without the long running session). I then killed a job manager and both the jobs got problems and then resumed their work after a few seconds. The problem is the first job restored the state of the second job and

Re: Flink Checkpoint on yarn

2016-03-19 Thread Stefano Baghino
Hi Ufuk, I've read the documentation and it's exactly as you say, thanks for the clarification. Assuming one wants to run several jobs in parallel with different users on a secure cluster in HA m

Re: Flink Checkpoint on yarn

2016-03-19 Thread Simone Robutti
Actually the test was intended for a single job. The fact that there are more jobs is unexpected and it will be the first thing to verify. Considering these problems we will go for deeper tests with multiple jobs. The logs are collected with "yarn logs" but log aggregation is not properly configur

Re: Flink Checkpoint on yarn

2016-03-19 Thread Stefano Baghino
Yes, but each job runs his own cluster, right? We have to run them on a secure cluster and on a per-user basis, thus we can't run a YARN session but have to run each job independently. On Thu, Mar 17, 2016 at 12:09 PM, Ufuk Celebi wrote: > On Thu, Mar 17, 2016 at 11:51 AM, Stefano Baghino > wro

Re: Flink Checkpoint on yarn

2016-03-19 Thread Stefano Baghino
> Do you have time to repeat your experiment with different ZooKeeper root paths? We reached the same conclusion and we're running this test right now, thanks. On Thu, Mar 17, 2016 at 12:08 PM, Ufuk Celebi wrote: > Yes, the jobs have their own UUID. > > Although you expect there to be two indep

Re: Flink Checkpoint on yarn

2016-03-19 Thread Simone Robutti
I didn't resubmitted the job. Also the jobs are submitted one by one with -m yarn-master, not with a long running yarn session so I don't really know if they could mix up. I will repeat the test with a cleaned state because we saw that killing the job with yarn application -kill left the "flink ru

Re: Flink Checkpoint on yarn

2016-03-18 Thread Ufuk Celebi
On Thu, Mar 17, 2016 at 11:51 AM, Stefano Baghino wrote: > does the recovery.zookeeper.path.root property need to be set independently > for each job that is run? No, just per cluster.