Re: HA lock nodes, Checkpoints, and JobGraphs after failure

Till Rohrmann Tue, 11 Jun 2019 02:32:50 -0700

Great to hear Dyana. Thanks for the update.

Cheers,
Till


On Fri, Jun 7, 2019 at 2:48 PM dyana.rose <[email protected]> wrote:

> Just wanted to give an update on this.
>
> Our ops team and myself independently came to the same conclusion that our
> ZooKeeper quorum was having syncing issues.
>
> After a bit more research, they have updated the initLimit and syncLimit
> in the quorum configs to:
> initLimit=10
> syncLimit=5
>
> After this change we no longer saw any of the issues we were having.
>
> Thanks,
> Dyana
>
> On 2019/05/02 08:43:14, Till Rohrmann <[email protected]> wrote:
> > Thanks for the update Dyana. I'm also not an expert in running one's own
> > ZooKeeper cluster. It might be related to setting the ZooKeeper cluster
> > properly up. Maybe someone else from the community has experience with
> > this. Therefore, I'm cross posting this thread to the user ML again to
> have
> > a wider reach.
> >
> > Cheers,
> > Till
> >
> > On Wed, May 1, 2019 at 10:17 AM dyana.rose <[email protected]>
> wrote:
> >
> > > Like all the best problems, I can't get this to reproduce locally.
> > >
> > > Everything has worked as expected. I started up a test job with 5
> retained
> > > checkpoints, let it run and watched the nodes in zookeeper.
> > >
> > > Then shut down and restarted the Flink cluster.
> > >
> > > The ephemeral lock nodes in the retained checkpoints transitioned from
> one
> > > lock id to another without a hitch.
> > >
> > > So that's good.
> > >
> > > As I understand it, if the Zookeeper cluster is having a sync issue,
> > > ephemeral nodes may not get deleted when the session becomes inactive.
> > > We're new to running our own zookeeper so it may be down to that.
> > >
> >
>

Re: HA lock nodes, Checkpoints, and JobGraphs after failure

Reply via email to