Just wanted to give an update on this. Our ops team and myself independently came to the same conclusion that our ZooKeeper quorum was having syncing issues.
After a bit more research, they have updated the initLimit and syncLimit in the quorum configs to: initLimit=10 syncLimit=5 After this change we no longer saw any of the issues we were having. Thanks, Dyana On 2019/05/02 08:43:14, Till Rohrmann <trohrm...@apache.org> wrote: > Thanks for the update Dyana. I'm also not an expert in running one's own > ZooKeeper cluster. It might be related to setting the ZooKeeper cluster > properly up. Maybe someone else from the community has experience with > this. Therefore, I'm cross posting this thread to the user ML again to have > a wider reach. > > Cheers, > Till > > On Wed, May 1, 2019 at 10:17 AM dyana.rose <dyana.r...@salecycle.com> wrote: > > > Like all the best problems, I can't get this to reproduce locally. > > > > Everything has worked as expected. I started up a test job with 5 retained > > checkpoints, let it run and watched the nodes in zookeeper. > > > > Then shut down and restarted the Flink cluster. > > > > The ephemeral lock nodes in the retained checkpoints transitioned from one > > lock id to another without a hitch. > > > > So that's good. > > > > As I understand it, if the Zookeeper cluster is having a sync issue, > > ephemeral nodes may not get deleted when the session becomes inactive. > > We're new to running our own zookeeper so it may be down to that. > > >