Re: HA lock nodes, Checkpoints, and JobGraphs after failure

2019-06-11 Thread Till Rohrmann
Great to hear Dyana. Thanks for the update. Cheers, Till On Fri, Jun 7, 2019 at 2:48 PM dyana.rose wrote: > Just wanted to give an update on this. > > Our ops team and myself independently came to the same conclusion that our > ZooKeeper quorum was having syncing issues. > > After a bit more re

Re: HA lock nodes, Checkpoints, and JobGraphs after failure

2019-06-07 Thread dyana . rose
Just wanted to give an update on this. Our ops team and myself independently came to the same conclusion that our ZooKeeper quorum was having syncing issues. After a bit more research, they have updated the initLimit and syncLimit in the quorum configs to: initLimit=10 syncLimit=5 After this c

Re: HA lock nodes, Checkpoints, and JobGraphs after failure

2019-05-02 Thread Till Rohrmann
Thanks for the update Dyana. I'm also not an expert in running one's own ZooKeeper cluster. It might be related to setting the ZooKeeper cluster properly up. Maybe someone else from the community has experience with this. Therefore, I'm cross posting this thread to the user ML again to have a wider

Re: HA lock nodes, Checkpoints, and JobGraphs after failure

2019-05-01 Thread dyana . rose
Like all the best problems, I can't get this to reproduce locally. Everything has worked as expected. I started up a test job with 5 retained checkpoints, let it run and watched the nodes in zookeeper. Then shut down and restarted the Flink cluster. The ephemeral lock nodes in the retained chec

Re: HA lock nodes, Checkpoints, and JobGraphs after failure

2019-04-23 Thread Till Rohrmann
It would be awesome to get the DEBUG logs for JobMaster, ZooKeeper, ZooKeeperCompletedCheckpointStore, ZooKeeperStateHandleStore, CheckpointCoordinator. Cheers, Till On Tue, Apr 23, 2019 at 2:37 PM Dyana Rose wrote: > may take me a bit to get the logs as we're not always in a situation where >

Re: HA lock nodes, Checkpoints, and JobGraphs after failure

2019-04-23 Thread Dyana Rose
may take me a bit to get the logs as we're not always in a situation where we've got enough hands free to run through the scenarios for a day. Is that DEBUG JobManager, DEBUG ZooKeeper, or both you'd be interested in? Thanks, Dyana On Tue, 23 Apr 2019 at 13:23, Till Rohrmann wrote: > Hi Dyana,

Re: HA lock nodes, Checkpoints, and JobGraphs after failure

2019-04-23 Thread Till Rohrmann
Hi Dyana, your analysis is almost correct. The only part which is missing is that the lock nodes are created as ephemeral nodes. This should ensure that if a JM process dies that the lock nodes will get removed by ZooKeeper. It depends a bit on ZooKeeper's configuration how long it takes until Zk

HA lock nodes, Checkpoints, and JobGraphs after failure

2019-04-18 Thread dyana . rose
Flink v1.7.1 After a Flink reboot we've been seeing some unexpected issues with excess retained checkpoints not being able to be removed from ZooKeeper after a new checkpoint is created. I believe I've got my head around the role of ZK and lockNodes in Checkpointing after going through the cod