Patrick (or Ilan), can you please file a JIRA issue to describe the problem.
Ideally also mention the work-around and possible solution ideas.

On Wed, Dec 18, 2024 at 12:27 PM Patrick Lok
<patrick....@salesforce.com.invalid> wrote:

> Hi Ilan, thank you so much for the pointers! That's exactly the problem. We
> updated our system to wrap the ZK server ID at 127, instead of 256, and
> that fixed the problem.
>
> Again, thank you so much!
>
> Regards,
> Patrick
>
>
> On Thu, Dec 5, 2024 at 4:47 PM Ilan Ginzburg <ilans...@gmail.com> wrote:
>
> > That value seems to be the ZooKeeper session id.
> > I've found https://issues.apache.org/jira/browse/ZOOKEEPER-1622 that
> > might be related (but I guess you would have seen the error a while
> > ago so it's likely not that).
> >
> > Also, looking at the session ID generation code (see code in the jira
> > above, SessionTrackerImpl.initializeNextSessionId()), if the server id
> > is bigger than 127 the resulting session id will be negative (if my
> > bit shift analysis skills are still ok).
> > Anything that might have changed there?
> >
> > Doesn't seem to be something that can be reset, it is decided by this
> > method. Solr code should be fixed to do a better job of parsing that
> > string.
> >
> > Ilan
> >
> >
> > On Tue, Dec 3, 2024 at 6:56 PM Patrick Lok
> > <patrick....@salesforce.com.invalid> wrote:
> > >
> > > That's what I think is happening too. The problem is the code is not
> > > expecting it to happen and not handling it correctly. I'm wondering if
> > > there's a way to reset it.
> > >
> > > On Tue, Dec 3, 2024 at 3:28 AM Ilan Ginzburg <ilans...@gmail.com>
> wrote:
> > >
> > > > Didn’t look at the code but from the number of digits wouldn’t it be
> a
> > long
> > > > wrapping around into negative territory?
> > > >
> > > > On Tue 3 Dec 2024 at 02:55, Patrick Lok <patrick....@salesforce.com
> > > > .invalid>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > We are seeing some weird issues with the Overseer ID which causes
> > some
> > > > > overseer election problems in our cluster.
> > > > >
> > > > > Recently we have noticed that one of our Solr 8 clusters is having
> > > > trouble
> > > > > electing dedicated overseer hosts as leader. After some
> > investigation, we
> > > > > noticed that we are having "negative" Overseer ID (Overseer ID with
> > > > leading
> > > > > dash"
> > > > >
> > > > > [zk: localhost:2181(CONNECTED) 0] ls /overseer_elect/election
> > > > > [-5188057493699159958-1.1.1.15:8983_solr-n_0000192189,
> > > > > -5260098076001480373-
> > > > > 1.1.1.19:8983_solr-n_0000192192,
> > > > > -5548288611309897871-1.1.1.28:8983_solr-n_0000192191,
> > > > > -6124715353171356222-1.1.1.18:8983_solr-n_0000192188,
> > > > -6412935227404643144-
> > > > > 1.1.1.22:8983_solr-n_0000192186,
> > > > > -6412935227404648050-1.1.1.89:8983_solr-n_0000192181,
> > > > > -6557083032988176767-1.1.1.105:8983_solr-n_0000192190,
> > > > > -6701159159471144532-
> > > > > 1.1.1.219:8983_solr-n_0000192183]
> > > > >
> > > > >
> > > > > (the actual IP addresses are different from what pasted above)
> > > > >
> > > > > Because of the leading dash in the Overseer ID, it causes the
> > > > > LeaderElector.getNodeName() to return "5188057493699159958-1.1.1.15
> > > > > :8983_solr" instead "1.1.1.15:8983_solr" causing quite a bit of
> > issues.
> > > > >
> > > > > Does anyone know why we started seeing a leading dash with the
> > initial
> > > > set
> > > > > of digits in the Overseer ID? Who's generating that set of digits?
> > Solr
> > > > or
> > > > > ZooKeeper? Is there a way to fix it?
> > > > >
> > > > > A simple change to LeaderElector.NODE_NAME seems to be an easy fix.
> > But
> > > > > since there's no unit test around it, I'm a bit worried that it
> might
> > > > break
> > > > > somewhere else in the code.
> > > > >
> > > > > Thanks,
> > > > > Patrick
> > > > >
> > > >
> >
>

Reply via email to