Re: HA lock nodes, Checkpoints, and JobGraphs after failure

dyana . rose Fri, 07 Jun 2019 05:48:46 -0700

Just wanted to give an update on this.

Our ops team and myself independently came to the same conclusion that our 
ZooKeeper quorum was having syncing issues.


After a bit more research, they have updated the initLimit and syncLimit in the 
quorum configs to:
initLimit=10
syncLimit=5

After this change we no longer saw any of the issues we were having.

Thanks,
Dyana

On 2019/05/02 08:43:14, Till Rohrmann <trohrm...@apache.org> wrote: 
> Thanks for the update Dyana. I'm also not an expert in running one's own
> ZooKeeper cluster. It might be related to setting the ZooKeeper cluster
> properly up. Maybe someone else from the community has experience with
> this. Therefore, I'm cross posting this thread to the user ML again to have
> a wider reach.
> 
> Cheers,
> Till
> 
> On Wed, May 1, 2019 at 10:17 AM dyana.rose <dyana.r...@salecycle.com> wrote:
> 
> > Like all the best problems, I can't get this to reproduce locally.
> >
> > Everything has worked as expected. I started up a test job with 5 retained
> > checkpoints, let it run and watched the nodes in zookeeper.
> >
> > Then shut down and restarted the Flink cluster.
> >
> > The ephemeral lock nodes in the retained checkpoints transitioned from one
> > lock id to another without a hitch.
> >
> > So that's good.
> >
> > As I understand it, if the Zookeeper cluster is having a sync issue,
> > ephemeral nodes may not get deleted when the session becomes inactive.
> > We're new to running our own zookeeper so it may be down to that.
> >
>

Re: HA lock nodes, Checkpoints, and JobGraphs after failure

Reply via email to