Finally, I got your question. Back in 2017-2018, there was a Discovery SPI's stabilization activity. The networking component could fail in various hard-to-reproduce scenarios affecting cluster availability and consistency. That ticket reminds me of those notorious issues that would fire once a week or month under specific configuration settings. So, I would not touch the code that fixes the issue unless @Alexey Goncharuk <alexey.goncha...@gmail.com> or @Sergey Chugunov <schugu...@gridgain.com> confirms that it's safe to do. Also, there should be a test for this scenario.
- Denis On Fri, Jun 5, 2020 at 12:28 AM Vladimir Steshin <vlads...@gmail.com> wrote: > Denis, > > I have no nodes that I'm unable to interconnect. This case is simulated > in IgniteDiscoveryMassiveNodeFailTest.testMassiveFailSelfKill() > Introduced in [1]. > > I’m asking if it is real or supposed problem. Where it was met? Which > network configuration/issues could be? > > > [1] https://issues.apache.org/jira/browse/IGNITE-7163 > > 05.06.2020 1:01, Denis Magda пишет: > > Vladimir, > > > > I'm suggesting to share the log files from the nodes that are unable to > > interconnect so that the community can check them for potential issues. > > Instead of sharing the logs from all the 5 nodes, try to start a > two-nodes > > cluster with the nodes that fail to discover each other and attach the > logs > > from those. > > > > - > > Denis > > > > > > On Thu, Jun 4, 2020 at 1:57 PM Vladimir Steshin <vlads...@gmail.com> > wrote: > > > >> Denis, hi. > >> > >> Sorry, I didn’t catch your idea. Are you saying this can happen > and > >> suggest experiment? I’m not descripting a probable case. It is already > >> done in [1]. I’m asking is it real, where it was met. > >> > >> > >> 04.06.2020 23:33, Denis Magda пишет: > >>> Vladimir, > >>> > >>> Please do the following experiment. Start a 2-nodes cluster booting > node > >> 3 > >>> and, for instance, node 5. Those won't be able to interconnect > according > >> to > >>> your description. Attach the log files from both nodes for analysis. > This > >>> should be a networking issue. > >>> > >>> - > >>> Denis > >>> > >>> > >>> On Thu, Jun 4, 2020 at 1:24 PM Vladimir Steshin <vlads...@gmail.com> > >> wrote: > >>>> Hi, Igniters. > >>>> > >>>> > >>>> I wanted to ask how one node may not be able to connect to > another > >>>> whereas rest of the cluster can. This got covered in [1]. In short: > node > >>>> 3 can't connect to nodes 4 and 5 but can to 1. At the same time, node > 2 > >>>> can connect to 4. Questions: > >>>> > >>>> 1) Is it real case? Where this problem came from? > >>>> > >>>> 2) If node 3 can’t connect to 4 and 5, does it mean node 2 can’t > connect > >>>> to 4 (and 5) too? > >>>> > >>>> Sergey, Dmitry maybe you bring light (I see you in [1])? I'm > >>>> participating in [2] and found this backward connection checking. > >>>> Answering would help us a lot. > >>>> > >>>> Thanks! > >>>> > >>>> [1] > >>>> https://issues.apache.org/jira/browse/IGNITE-7163< > >>>> https://issues.apache.org/jira/browse/IGNITE-7163> > >>>> > >>>> [2] > >>>> > >>>> > >> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up > >>>> < > >>>> > >> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up >