Never mind, I created a JIRA account and filed the tickets. https://issues.apache.org/jira/browse/ZOOKEEPER-4904 https://issues.apache.org/jira/browse/SOLR-17694
On Fri, Mar 7, 2025 at 2:02 PM Patrick Lok <patrick....@salesforce.com> wrote: > Hi David and Ilan > > I don't have a JIRA account, could one of you please create the tickets? > I've written the tickets below, please make any changes as you see fit. > > ================================= > ZooKeeper > > Summary: > SessionTrackerImpl generates negative session IDs when server ID is larger > than 127 > > Description: > This issue was discovered during a [discussion| > https://lists.apache.org/thread/0pwxw1rzdffmbxctdzv2rmplzgwt6lpl] about > negative Solr Overseer IDs > > Setting the server ID to a value greater than 127 in the myid file causes > the SessionTrackerImpl.initializeNextSessionId function to generate a > negative session ID. > > > {code:java} > public static long initializeNextSessionId(long id) { > long nextSid; > nextSid = (Time.currentElapsedTime() << 24) >>> 8; > nextSid = nextSid | (id << 56); <------------------------ > if (nextSid == EphemeralType.CONTAINER_EPHEMERAL_OWNER) { > ++nextSid; // this is an unlikely edge case, but check it > just in case > } > return nextSid; > } > {code} > > ================================= > > Solr > > Summary: > LeaderElector not able to parse node ID correctly when it has a leading > dash > > Description: > This issue was [reported| > https://lists.apache.org/thread/0pwxw1rzdffmbxctdzv2rmplzgwt6lpl] on > users@solr.apache.org. > > There could be time when the node ID contains a leading dash > {noformat} > -5188057493699159958-1.1.1.15:8983_solr-n_0000192189 > {noformat} > instead of just > {noformat} > 5188057493699159958-1.1.1.15:8983_solr-n_0000192189 > {noformat} > In such case, LeaderElector.getNodeName returns > *5188057493699159958-1.1.1.15:8983_solr* instead of just {*}1.1.1.15:8983 > _solr{*}. > > The problem is that the regex LeaderElector.NODE_NAME was not designed to > handle the leading dash. LeaderElector.LEADER_SEQ and > LeaderElector.SESSION_ID seem to have the same problem. > > Thanks, > Patrick > > > > On Sat, Dec 21, 2024 at 12:02 PM David Smiley <dsmi...@apache.org> wrote: > >> Patrick (or Ilan), can you please file a JIRA issue to describe the >> problem. >> Ideally also mention the work-around and possible solution ideas. >> >> On Wed, Dec 18, 2024 at 12:27 PM Patrick Lok >> <patrick....@salesforce.com.invalid> wrote: >> >> > Hi Ilan, thank you so much for the pointers! That's exactly the >> problem. We >> > updated our system to wrap the ZK server ID at 127, instead of 256, and >> > that fixed the problem. >> > >> > Again, thank you so much! >> > >> > Regards, >> > Patrick >> > >> > >> > On Thu, Dec 5, 2024 at 4:47 PM Ilan Ginzburg <ilans...@gmail.com> >> wrote: >> > >> > > That value seems to be the ZooKeeper session id. >> > > I've found >> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/ZOOKEEPER-1622__;!!DCbAVzZNrAf4!H2u9K4JvYG1dubi-ktOCh1jsLacpumxkWUOmh_lUjp5yIBlLf_NtARU_H6WSEgGiRk7ddcpi_uMkt6UYbJhr5w$ >> that >> > > might be related (but I guess you would have seen the error a while >> > > ago so it's likely not that). >> > > >> > > Also, looking at the session ID generation code (see code in the jira >> > > above, SessionTrackerImpl.initializeNextSessionId()), if the server id >> > > is bigger than 127 the resulting session id will be negative (if my >> > > bit shift analysis skills are still ok). >> > > Anything that might have changed there? >> > > >> > > Doesn't seem to be something that can be reset, it is decided by this >> > > method. Solr code should be fixed to do a better job of parsing that >> > > string. >> > > >> > > Ilan >> > > >> > > >> > > On Tue, Dec 3, 2024 at 6:56 PM Patrick Lok >> > > <patrick....@salesforce.com.invalid> wrote: >> > > > >> > > > That's what I think is happening too. The problem is the code is not >> > > > expecting it to happen and not handling it correctly. I'm wondering >> if >> > > > there's a way to reset it. >> > > > >> > > > On Tue, Dec 3, 2024 at 3:28 AM Ilan Ginzburg <ilans...@gmail.com> >> > wrote: >> > > > >> > > > > Didn’t look at the code but from the number of digits wouldn’t it >> be >> > a >> > > long >> > > > > wrapping around into negative territory? >> > > > > >> > > > > On Tue 3 Dec 2024 at 02:55, Patrick Lok < >> patrick....@salesforce.com >> > > > > .invalid> >> > > > > wrote: >> > > > > >> > > > > > Hi, >> > > > > > >> > > > > > We are seeing some weird issues with the Overseer ID which >> causes >> > > some >> > > > > > overseer election problems in our cluster. >> > > > > > >> > > > > > Recently we have noticed that one of our Solr 8 clusters is >> having >> > > > > trouble >> > > > > > electing dedicated overseer hosts as leader. After some >> > > investigation, we >> > > > > > noticed that we are having "negative" Overseer ID (Overseer ID >> with >> > > > > leading >> > > > > > dash" >> > > > > > >> > > > > > [zk: localhost:2181(CONNECTED) 0] ls /overseer_elect/election >> > > > > > [-5188057493699159958-1.1.1.15:8983_solr-n_0000192189, >> > > > > > -5260098076001480373- >> > > > > > 1.1.1.19:8983_solr-n_0000192192, >> > > > > > -5548288611309897871-1.1.1.28:8983_solr-n_0000192191, >> > > > > > -6124715353171356222-1.1.1.18:8983_solr-n_0000192188, >> > > > > -6412935227404643144- >> > > > > > 1.1.1.22:8983_solr-n_0000192186, >> > > > > > -6412935227404648050-1.1.1.89:8983_solr-n_0000192181, >> > > > > > -6557083032988176767-1.1.1.105:8983_solr-n_0000192190, >> > > > > > -6701159159471144532- >> > > > > > 1.1.1.219:8983_solr-n_0000192183] >> > > > > > >> > > > > > >> > > > > > (the actual IP addresses are different from what pasted above) >> > > > > > >> > > > > > Because of the leading dash in the Overseer ID, it causes the >> > > > > > LeaderElector.getNodeName() to return >> "5188057493699159958-1.1.1.15 >> > > > > > :8983_solr" instead "1.1.1.15:8983_solr" causing quite a bit of >> > > issues. >> > > > > > >> > > > > > Does anyone know why we started seeing a leading dash with the >> > > initial >> > > > > set >> > > > > > of digits in the Overseer ID? Who's generating that set of >> digits? >> > > Solr >> > > > > or >> > > > > > ZooKeeper? Is there a way to fix it? >> > > > > > >> > > > > > A simple change to LeaderElector.NODE_NAME seems to be an easy >> fix. >> > > But >> > > > > > since there's no unit test around it, I'm a bit worried that it >> > might >> > > > > break >> > > > > > somewhere else in the code. >> > > > > > >> > > > > > Thanks, >> > > > > > Patrick >> > > > > > >> > > > > >> > > >> > >> >