Great to hear Dyana. Thanks for the update.
Cheers,
Till
On Fri, Jun 7, 2019 at 2:48 PM dyana.rose wrote:
> Just wanted to give an update on this.
>
> Our ops team and myself independently came to the same conclusion that our
> ZooKeeper quorum was having syncing issues.
>
> After a bit more re
Just wanted to give an update on this.
Our ops team and myself independently came to the same conclusion that our
ZooKeeper quorum was having syncing issues.
After a bit more research, they have updated the initLimit and syncLimit in the
quorum configs to:
initLimit=10
syncLimit=5
After this c
Thanks for the update Dyana. I'm also not an expert in running one's own
ZooKeeper cluster. It might be related to setting the ZooKeeper cluster
properly up. Maybe someone else from the community has experience with
this. Therefore, I'm cross posting this thread to the user ML again to have
a wider
Like all the best problems, I can't get this to reproduce locally.
Everything has worked as expected. I started up a test job with 5 retained
checkpoints, let it run and watched the nodes in zookeeper.
Then shut down and restarted the Flink cluster.
The ephemeral lock nodes in the retained chec
It would be awesome to get the DEBUG logs for JobMaster,
ZooKeeper, ZooKeeperCompletedCheckpointStore,
ZooKeeperStateHandleStore, CheckpointCoordinator.
Cheers,
Till
On Tue, Apr 23, 2019 at 2:37 PM Dyana Rose wrote:
> may take me a bit to get the logs as we're not always in a situation where
>
may take me a bit to get the logs as we're not always in a situation where
we've got enough hands free to run through the scenarios for a day.
Is that DEBUG JobManager, DEBUG ZooKeeper, or both you'd be interested in?
Thanks,
Dyana
On Tue, 23 Apr 2019 at 13:23, Till Rohrmann wrote:
> Hi Dyana,
Hi Dyana,
your analysis is almost correct. The only part which is missing is that the
lock nodes are created as ephemeral nodes. This should ensure that if a JM
process dies that the lock nodes will get removed by ZooKeeper. It depends
a bit on ZooKeeper's configuration how long it takes until Zk
Flink v1.7.1
After a Flink reboot we've been seeing some unexpected issues with excess
retained checkpoints not being able to be removed from ZooKeeper after a new
checkpoint is created.
I believe I've got my head around the role of ZK and lockNodes in Checkpointing
after going through the cod