Hi David and Ilan I don't have a JIRA account, could one of you please create the tickets? I've written the tickets below, please make any changes as you see fit.
================================= ZooKeeper Summary: SessionTrackerImpl generates negative session IDs when server ID is larger than 127 Description: This issue was discovered during a [discussion| https://lists.apache.org/thread/0pwxw1rzdffmbxctdzv2rmplzgwt6lpl] about negative Solr Overseer IDs Setting the server ID to a value greater than 127 in the myid file causes the SessionTrackerImpl.initializeNextSessionId function to generate a negative session ID. {code:java} public static long initializeNextSessionId(long id) { long nextSid; nextSid = (Time.currentElapsedTime() << 24) >>> 8; nextSid = nextSid | (id << 56); <------------------------ if (nextSid == EphemeralType.CONTAINER_EPHEMERAL_OWNER) { ++nextSid; // this is an unlikely edge case, but check it just in case } return nextSid; } {code} ================================= Solr Summary: LeaderElector not able to parse node ID correctly when it has a leading dash Description: This issue was [reported| https://lists.apache.org/thread/0pwxw1rzdffmbxctdzv2rmplzgwt6lpl] on users@solr.apache.org. There could be time when the node ID contains a leading dash {noformat} -5188057493699159958-1.1.1.15:8983_solr-n_0000192189 {noformat} instead of just {noformat} 5188057493699159958-1.1.1.15:8983_solr-n_0000192189 {noformat} In such case, LeaderElector.getNodeName returns *5188057493699159958-1.1.1.15:8983_solr* instead of just {*}1.1.1.15:8983 _solr{*}. The problem is that the regex LeaderElector.NODE_NAME was not designed to handle the leading dash. LeaderElector.LEADER_SEQ and LeaderElector.SESSION_ID seem to have the same problem. Thanks, Patrick On Sat, Dec 21, 2024 at 12:02 PM David Smiley <dsmi...@apache.org> wrote: > Patrick (or Ilan), can you please file a JIRA issue to describe the > problem. > Ideally also mention the work-around and possible solution ideas. > > On Wed, Dec 18, 2024 at 12:27 PM Patrick Lok > <patrick....@salesforce.com.invalid> wrote: > > > Hi Ilan, thank you so much for the pointers! That's exactly the problem. > We > > updated our system to wrap the ZK server ID at 127, instead of 256, and > > that fixed the problem. > > > > Again, thank you so much! > > > > Regards, > > Patrick > > > > > > On Thu, Dec 5, 2024 at 4:47 PM Ilan Ginzburg <ilans...@gmail.com> wrote: > > > > > That value seems to be the ZooKeeper session id. > > > I've found > https://urldefense.com/v3/__https://issues.apache.org/jira/browse/ZOOKEEPER-1622__;!!DCbAVzZNrAf4!H2u9K4JvYG1dubi-ktOCh1jsLacpumxkWUOmh_lUjp5yIBlLf_NtARU_H6WSEgGiRk7ddcpi_uMkt6UYbJhr5w$ > that > > > might be related (but I guess you would have seen the error a while > > > ago so it's likely not that). > > > > > > Also, looking at the session ID generation code (see code in the jira > > > above, SessionTrackerImpl.initializeNextSessionId()), if the server id > > > is bigger than 127 the resulting session id will be negative (if my > > > bit shift analysis skills are still ok). > > > Anything that might have changed there? > > > > > > Doesn't seem to be something that can be reset, it is decided by this > > > method. Solr code should be fixed to do a better job of parsing that > > > string. > > > > > > Ilan > > > > > > > > > On Tue, Dec 3, 2024 at 6:56 PM Patrick Lok > > > <patrick....@salesforce.com.invalid> wrote: > > > > > > > > That's what I think is happening too. The problem is the code is not > > > > expecting it to happen and not handling it correctly. I'm wondering > if > > > > there's a way to reset it. > > > > > > > > On Tue, Dec 3, 2024 at 3:28 AM Ilan Ginzburg <ilans...@gmail.com> > > wrote: > > > > > > > > > Didn’t look at the code but from the number of digits wouldn’t it > be > > a > > > long > > > > > wrapping around into negative territory? > > > > > > > > > > On Tue 3 Dec 2024 at 02:55, Patrick Lok < > patrick....@salesforce.com > > > > > .invalid> > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > We are seeing some weird issues with the Overseer ID which causes > > > some > > > > > > overseer election problems in our cluster. > > > > > > > > > > > > Recently we have noticed that one of our Solr 8 clusters is > having > > > > > trouble > > > > > > electing dedicated overseer hosts as leader. After some > > > investigation, we > > > > > > noticed that we are having "negative" Overseer ID (Overseer ID > with > > > > > leading > > > > > > dash" > > > > > > > > > > > > [zk: localhost:2181(CONNECTED) 0] ls /overseer_elect/election > > > > > > [-5188057493699159958-1.1.1.15:8983_solr-n_0000192189, > > > > > > -5260098076001480373- > > > > > > 1.1.1.19:8983_solr-n_0000192192, > > > > > > -5548288611309897871-1.1.1.28:8983_solr-n_0000192191, > > > > > > -6124715353171356222-1.1.1.18:8983_solr-n_0000192188, > > > > > -6412935227404643144- > > > > > > 1.1.1.22:8983_solr-n_0000192186, > > > > > > -6412935227404648050-1.1.1.89:8983_solr-n_0000192181, > > > > > > -6557083032988176767-1.1.1.105:8983_solr-n_0000192190, > > > > > > -6701159159471144532- > > > > > > 1.1.1.219:8983_solr-n_0000192183] > > > > > > > > > > > > > > > > > > (the actual IP addresses are different from what pasted above) > > > > > > > > > > > > Because of the leading dash in the Overseer ID, it causes the > > > > > > LeaderElector.getNodeName() to return > "5188057493699159958-1.1.1.15 > > > > > > :8983_solr" instead "1.1.1.15:8983_solr" causing quite a bit of > > > issues. > > > > > > > > > > > > Does anyone know why we started seeing a leading dash with the > > > initial > > > > > set > > > > > > of digits in the Overseer ID? Who's generating that set of > digits? > > > Solr > > > > > or > > > > > > ZooKeeper? Is there a way to fix it? > > > > > > > > > > > > A simple change to LeaderElector.NODE_NAME seems to be an easy > fix. > > > But > > > > > > since there's no unit test around it, I'm a bit worried that it > > might > > > > > break > > > > > > somewhere else in the code. > > > > > > > > > > > > Thanks, > > > > > > Patrick > > > > > > > > > > > > > > > > >