Hi Ilan, thank you so much for the pointers! That's exactly the problem. We
updated our system to wrap the ZK server ID at 127, instead of 256, and
that fixed the problem.

Again, thank you so much!

Regards,
Patrick


On Thu, Dec 5, 2024 at 4:47 PM Ilan Ginzburg <ilans...@gmail.com> wrote:

> That value seems to be the ZooKeeper session id.
> I've found https://issues.apache.org/jira/browse/ZOOKEEPER-1622 that
> might be related (but I guess you would have seen the error a while
> ago so it's likely not that).
>
> Also, looking at the session ID generation code (see code in the jira
> above, SessionTrackerImpl.initializeNextSessionId()), if the server id
> is bigger than 127 the resulting session id will be negative (if my
> bit shift analysis skills are still ok).
> Anything that might have changed there?
>
> Doesn't seem to be something that can be reset, it is decided by this
> method. Solr code should be fixed to do a better job of parsing that
> string.
>
> Ilan
>
>
> On Tue, Dec 3, 2024 at 6:56 PM Patrick Lok
> <patrick....@salesforce.com.invalid> wrote:
> >
> > That's what I think is happening too. The problem is the code is not
> > expecting it to happen and not handling it correctly. I'm wondering if
> > there's a way to reset it.
> >
> > On Tue, Dec 3, 2024 at 3:28 AM Ilan Ginzburg <ilans...@gmail.com> wrote:
> >
> > > Didn’t look at the code but from the number of digits wouldn’t it be a
> long
> > > wrapping around into negative territory?
> > >
> > > On Tue 3 Dec 2024 at 02:55, Patrick Lok <patrick....@salesforce.com
> > > .invalid>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > We are seeing some weird issues with the Overseer ID which causes
> some
> > > > overseer election problems in our cluster.
> > > >
> > > > Recently we have noticed that one of our Solr 8 clusters is having
> > > trouble
> > > > electing dedicated overseer hosts as leader. After some
> investigation, we
> > > > noticed that we are having "negative" Overseer ID (Overseer ID with
> > > leading
> > > > dash"
> > > >
> > > > [zk: localhost:2181(CONNECTED) 0] ls /overseer_elect/election
> > > > [-5188057493699159958-1.1.1.15:8983_solr-n_0000192189,
> > > > -5260098076001480373-
> > > > 1.1.1.19:8983_solr-n_0000192192,
> > > > -5548288611309897871-1.1.1.28:8983_solr-n_0000192191,
> > > > -6124715353171356222-1.1.1.18:8983_solr-n_0000192188,
> > > -6412935227404643144-
> > > > 1.1.1.22:8983_solr-n_0000192186,
> > > > -6412935227404648050-1.1.1.89:8983_solr-n_0000192181,
> > > > -6557083032988176767-1.1.1.105:8983_solr-n_0000192190,
> > > > -6701159159471144532-
> > > > 1.1.1.219:8983_solr-n_0000192183]
> > > >
> > > >
> > > > (the actual IP addresses are different from what pasted above)
> > > >
> > > > Because of the leading dash in the Overseer ID, it causes the
> > > > LeaderElector.getNodeName() to return "5188057493699159958-1.1.1.15
> > > > :8983_solr" instead "1.1.1.15:8983_solr" causing quite a bit of
> issues.
> > > >
> > > > Does anyone know why we started seeing a leading dash with the
> initial
> > > set
> > > > of digits in the Overseer ID? Who's generating that set of digits?
> Solr
> > > or
> > > > ZooKeeper? Is there a way to fix it?
> > > >
> > > > A simple change to LeaderElector.NODE_NAME seems to be an easy fix.
> But
> > > > since there's no unit test around it, I'm a bit worried that it might
> > > break
> > > > somewhere else in the code.
> > > >
> > > > Thanks,
> > > > Patrick
> > > >
> > >
>

Reply via email to