Great to hear Dyana. Thanks for the update. Cheers, Till
On Fri, Jun 7, 2019 at 2:48 PM dyana.rose <dyana.r...@salecycle.com> wrote: > Just wanted to give an update on this. > > Our ops team and myself independently came to the same conclusion that our > ZooKeeper quorum was having syncing issues. > > After a bit more research, they have updated the initLimit and syncLimit > in the quorum configs to: > initLimit=10 > syncLimit=5 > > After this change we no longer saw any of the issues we were having. > > Thanks, > Dyana > > On 2019/05/02 08:43:14, Till Rohrmann <trohrm...@apache.org> wrote: > > Thanks for the update Dyana. I'm also not an expert in running one's own > > ZooKeeper cluster. It might be related to setting the ZooKeeper cluster > > properly up. Maybe someone else from the community has experience with > > this. Therefore, I'm cross posting this thread to the user ML again to > have > > a wider reach. > > > > Cheers, > > Till > > > > On Wed, May 1, 2019 at 10:17 AM dyana.rose <dyana.r...@salecycle.com> > wrote: > > > > > Like all the best problems, I can't get this to reproduce locally. > > > > > > Everything has worked as expected. I started up a test job with 5 > retained > > > checkpoints, let it run and watched the nodes in zookeeper. > > > > > > Then shut down and restarted the Flink cluster. > > > > > > The ephemeral lock nodes in the retained checkpoints transitioned from > one > > > lock id to another without a hitch. > > > > > > So that's good. > > > > > > As I understand it, if the Zookeeper cluster is having a sync issue, > > > ephemeral nodes may not get deleted when the session becomes inactive. > > > We're new to running our own zookeeper so it may be down to that. > > > > > >