Finally, I got your question.

Back in 2017-2018, there was a Discovery SPI's stabilization activity. The
networking component could fail in various hard-to-reproduce scenarios
affecting cluster availability and consistency. That ticket reminds me of
those notorious issues that would fire once a week or month under specific
configuration settings. So, I would not touch the code that fixes the issue
unless @Alexey Goncharuk <alexey.goncha...@gmail.com> or @Sergey Chugunov
<schugu...@gridgain.com> confirms that it's safe to do. Also, there should
be a test for this scenario.

-
Denis


On Fri, Jun 5, 2020 at 12:28 AM Vladimir Steshin <vlads...@gmail.com> wrote:

> Denis,
>
> I have no nodes that I'm unable to interconnect. This case is simulated
> in IgniteDiscoveryMassiveNodeFailTest.testMassiveFailSelfKill()
> Introduced in [1].
>
> I’m asking if it is real or supposed problem. Where it was met? Which
> network configuration/issues could be?
>
>
> [1] https://issues.apache.org/jira/browse/IGNITE-7163
>
> 05.06.2020 1:01, Denis Magda пишет:
> > Vladimir,
> >
> > I'm suggesting to share the log files from the nodes that are unable to
> > interconnect so that the community can check them for potential issues.
> > Instead of sharing the logs from all the 5 nodes, try to start a
> two-nodes
> > cluster with the nodes that fail to discover each other and attach the
> logs
> > from those.
> >
> > -
> > Denis
> >
> >
> > On Thu, Jun 4, 2020 at 1:57 PM Vladimir Steshin <vlads...@gmail.com>
> wrote:
> >
> >> Denis, hi.
> >>
> >>       Sorry, I didn’t catch your idea. Are you saying this can happen
> and
> >> suggest experiment? I’m not descripting a probable case. It is already
> >> done in [1]. I’m asking is it real, where it was met.
> >>
> >>
> >> 04.06.2020 23:33, Denis Magda пишет:
> >>> Vladimir,
> >>>
> >>> Please do the following experiment. Start a 2-nodes cluster booting
> node
> >> 3
> >>> and, for instance, node 5. Those won't be able to interconnect
> according
> >> to
> >>> your description. Attach the log files from both nodes for analysis.
> This
> >>> should be a networking issue.
> >>>
> >>> -
> >>> Denis
> >>>
> >>>
> >>> On Thu, Jun 4, 2020 at 1:24 PM Vladimir Steshin <vlads...@gmail.com>
> >> wrote:
> >>>>        Hi, Igniters.
> >>>>
> >>>>
> >>>>        I wanted to ask how one node may not be able to connect to
> another
> >>>> whereas rest of the cluster can. This got covered in [1]. In short:
> node
> >>>> 3 can't connect to nodes 4 and 5 but can to 1. At the same time, node
> 2
> >>>> can connect to 4. Questions:
> >>>>
> >>>> 1) Is it real case? Where this problem came from?
> >>>>
> >>>> 2) If node 3 can’t connect to 4 and 5, does it mean node 2 can’t
> connect
> >>>> to 4 (and 5) too?
> >>>>
> >>>> Sergey, Dmitry maybe you bring light (I see you in [1])? I'm
> >>>> participating in [2] and found this backward connection checking.
> >>>> Answering would help us a lot.
> >>>>
> >>>> Thanks!
> >>>>
> >>>> [1]
> >>>> https://issues.apache.org/jira/browse/IGNITE-7163<
> >>>> https://issues.apache.org/jira/browse/IGNITE-7163>
> >>>>
> >>>> [2]
> >>>>
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up
> >>>> <
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up
>

Reply via email to