may take me a bit to get the logs as we're not always in a situation where we've got enough hands free to run through the scenarios for a day.
Is that DEBUG JobManager, DEBUG ZooKeeper, or both you'd be interested in? Thanks, Dyana On Tue, 23 Apr 2019 at 13:23, Till Rohrmann <trohrm...@apache.org> wrote: > Hi Dyana, > > your analysis is almost correct. The only part which is missing is that the > lock nodes are created as ephemeral nodes. This should ensure that if a JM > process dies that the lock nodes will get removed by ZooKeeper. It depends > a bit on ZooKeeper's configuration how long it takes until Zk detects a > client connection as lost and then removes the ephemeral nodes. If the job > should terminate within this time interval, then it could happen that you > cannot remove the checkpoint/JobGraph. However, usually the Zookeeper > session timeout should be configured to be a couple of seconds. > > I would actually be interested in better understanding your problem to see > whether this is still a bug in Flink. Could you maybe share the respective > logs on DEBUG log level with me? Maybe it would also be possible to run the > latest version of Flink (1.7.2) to include all possible bug fixes. > > FYI: The community is currently discussing to reimplement the ZooKeeper > based high availability services [1]. One idea is to get rid of the lock > nodes by replacing them with transactions on the leader node. This could > prevent these kind of bugs in the future. > > [1] https://issues.apache.org/jira/browse/FLINK-10333 > > Cheers, > Till > > On Thu, Apr 18, 2019 at 3:12 PM dyana.rose <dyana.r...@salecycle.com> > wrote: > > > Flink v1.7.1 > > > > After a Flink reboot we've been seeing some unexpected issues with excess > > retained checkpoints not being able to be removed from ZooKeeper after a > > new checkpoint is created. > > > > I believe I've got my head around the role of ZK and lockNodes in > > Checkpointing after going through the code. Could you check my logic on > > this and add any insight, especially if I've got it wrong? > > > > The situation: > > 1) Say we run JM1 and JM2 and retain 10 checkpoints and are running in HA > > with S3 as the backing store. > > > > 2) JM1 and JM2 start up and each instance of ZooKeeperStateHandleStore > has > > its own lockNode UUID. JM1 is elected leader. > > > > 3) We submit a job, that JobGraph lockNode is added to ZK using JM1's > > JobGraph lockNode. > > > > 4) Checkpoints start rolling in, latest 10 are retained in ZK using JM1's > > checkpoint lockNode. We continue running, and checkpoints are > successfully > > being created and excess checkpoints removed. > > > > 5) Both JM1 and JM2 now are rebooted. > > > > 6) The JobGraph is recovered by the leader, the job restarts from the > > latest checkpoint. > > > > Now after every new checkpoint we see in the ZooKeeper logs: > > INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@653] - Got > > user-level KeeperException when processing sessionid:0x10000047715000d > > type:delete cxid:0x210 zxid:0x700001091 txntype:-1 reqpath:n/a Error > > > Path:/flink/job-name/checkpoints/2fa0d694e245f5ec1f709630c7c7bf69/0000000000000057813 > > Error:KeeperErrorCode = Directory not empty for > > > /flink/job-name/checkpoints/2fa0d694e245f5ec1f709630c7c7bf69/000000000000005781 > > with an increasing checkpoint id on each subsequent call. > > > > When JM1 and JM2 were rebooted the lockNode UUIDs would have rolled, > > right? As the old checkpoints were created under the old UUID, the new > JMs > > will never be able to remove the old retained checkpoints from ZooKeeper. > > > > Is that correct? > > > > If so, would this also happen with JobGraphs in the following situation > > (we saw this just recently where we had a JobGraph for a cancelled job > > still in ZK): > > > > Steps 1 through 3 above, then: > > 4) JM1 fails over to JM2, the job keeps running uninterrupted. JM1 > > restarts. > > > > 5) some time later while JM2 is still leader we hard cancel the job and > > restart the JMs > > > > In this case JM2 would successfully remove the job from s3, but because > > its lockNode is different from JM1 it cannot delete the lock file in the > > jobgraph folder and so can’t remove the jobgraph. Then Flink restarts and > > tries to process the JobGraph it has found, but the S3 files have been > > deleted. > > > > Possible related closed issues (fixes went in v1.7.0): > > https://issues.apache.org/jira/browse/FLINK-10184 and > > https://issues.apache.org/jira/browse/FLINK-10255 > > > > Thanks for any insight, > > Dyana > > > -- Dyana Rose Software Engineer W: www.salecycle.com <http://www.salecycle.com/> [image: The 2019 Look Book - Download Now] <https://t.xink.io/Tracking/Index/WcwBAKNtAAAwphkA0>