Patrick (or Ilan), can you please file a JIRA issue to describe the problem. Ideally also mention the work-around and possible solution ideas.
On Wed, Dec 18, 2024 at 12:27 PM Patrick Lok <patrick....@salesforce.com.invalid> wrote: > Hi Ilan, thank you so much for the pointers! That's exactly the problem. We > updated our system to wrap the ZK server ID at 127, instead of 256, and > that fixed the problem. > > Again, thank you so much! > > Regards, > Patrick > > > On Thu, Dec 5, 2024 at 4:47 PM Ilan Ginzburg <ilans...@gmail.com> wrote: > > > That value seems to be the ZooKeeper session id. > > I've found https://issues.apache.org/jira/browse/ZOOKEEPER-1622 that > > might be related (but I guess you would have seen the error a while > > ago so it's likely not that). > > > > Also, looking at the session ID generation code (see code in the jira > > above, SessionTrackerImpl.initializeNextSessionId()), if the server id > > is bigger than 127 the resulting session id will be negative (if my > > bit shift analysis skills are still ok). > > Anything that might have changed there? > > > > Doesn't seem to be something that can be reset, it is decided by this > > method. Solr code should be fixed to do a better job of parsing that > > string. > > > > Ilan > > > > > > On Tue, Dec 3, 2024 at 6:56 PM Patrick Lok > > <patrick....@salesforce.com.invalid> wrote: > > > > > > That's what I think is happening too. The problem is the code is not > > > expecting it to happen and not handling it correctly. I'm wondering if > > > there's a way to reset it. > > > > > > On Tue, Dec 3, 2024 at 3:28 AM Ilan Ginzburg <ilans...@gmail.com> > wrote: > > > > > > > Didn’t look at the code but from the number of digits wouldn’t it be > a > > long > > > > wrapping around into negative territory? > > > > > > > > On Tue 3 Dec 2024 at 02:55, Patrick Lok <patrick....@salesforce.com > > > > .invalid> > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > We are seeing some weird issues with the Overseer ID which causes > > some > > > > > overseer election problems in our cluster. > > > > > > > > > > Recently we have noticed that one of our Solr 8 clusters is having > > > > trouble > > > > > electing dedicated overseer hosts as leader. After some > > investigation, we > > > > > noticed that we are having "negative" Overseer ID (Overseer ID with > > > > leading > > > > > dash" > > > > > > > > > > [zk: localhost:2181(CONNECTED) 0] ls /overseer_elect/election > > > > > [-5188057493699159958-1.1.1.15:8983_solr-n_0000192189, > > > > > -5260098076001480373- > > > > > 1.1.1.19:8983_solr-n_0000192192, > > > > > -5548288611309897871-1.1.1.28:8983_solr-n_0000192191, > > > > > -6124715353171356222-1.1.1.18:8983_solr-n_0000192188, > > > > -6412935227404643144- > > > > > 1.1.1.22:8983_solr-n_0000192186, > > > > > -6412935227404648050-1.1.1.89:8983_solr-n_0000192181, > > > > > -6557083032988176767-1.1.1.105:8983_solr-n_0000192190, > > > > > -6701159159471144532- > > > > > 1.1.1.219:8983_solr-n_0000192183] > > > > > > > > > > > > > > > (the actual IP addresses are different from what pasted above) > > > > > > > > > > Because of the leading dash in the Overseer ID, it causes the > > > > > LeaderElector.getNodeName() to return "5188057493699159958-1.1.1.15 > > > > > :8983_solr" instead "1.1.1.15:8983_solr" causing quite a bit of > > issues. > > > > > > > > > > Does anyone know why we started seeing a leading dash with the > > initial > > > > set > > > > > of digits in the Overseer ID? Who's generating that set of digits? > > Solr > > > > or > > > > > ZooKeeper? Is there a way to fix it? > > > > > > > > > > A simple change to LeaderElector.NODE_NAME seems to be an easy fix. > > But > > > > > since there's no unit test around it, I'm a bit worried that it > might > > > > break > > > > > somewhere else in the code. > > > > > > > > > > Thanks, > > > > > Patrick > > > > > > > > > > > >