[
https://issues.apache.org/jira/browse/IGNITE-15733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17430373#comment-17430373
]
Ivan Bessonov commented on IGNITE-15733:
----------------------------------------
It feels like the problem is fundamental. I think it can also be reproduced if
you have a fresh clean node that starts first and then "accepts" nodes from
cluster that has previously been activated.
Core reason of the problem is that there's an assumption in code that the first
node that joins itself in cluster will decide cluster's tag and id. This is not
necessarily correct as we see.
Old solution from development branch (back in 2019) was to only assign cluster
id upon its first activation. But that was suddenly changed when we allowed to
write into DMS before activation. Maybe if we return that old behavior for
tag&id specifically and also wait for distributed metastorage data to persist
on activation, this will solve the issue, but I'm not sure.
What I am sure about is that "local join" is not a proper place to initialize
defaults that can differ on different nodes. So, a way to fix it as proposed by
[~zstan] is to persist cluster tag (at least) once you receive it right in
discovery thread. This will allow you to see if there was an attempt of setting
this value before restart. In this case node shouldn't try setting it on local
join and instead just wait. I'm sure there are corner cases to this solution as
well. For example - what if none of cluster nodes actually saved tag&id in
metastorage? This also doesn't solve the issue of the first node in cluster
being clean.
These are my thoughts. I have no working solution for now
> Eventually failure of baseline registration.
> --------------------------------------------
>
> Key: IGNITE-15733
> URL: https://issues.apache.org/jira/browse/IGNITE-15733
> Project: Ignite
> Issue Type: Bug
> Components: general
> Affects Versions: 2.11
> Reporter: Evgeny Stanilovsky
> Priority: Major
> Attachments: _Community_Edition_Cache_9_18998_cut.log
>
>
> All info can be found in attached log:
> briefly: Cluster of 2 nodes with persistence, sequentially start nodes,
> activate, stop nodes using org.apache.ignite.Ignite#close, start nodes,
> activate:
> expected :
> 1 node : Cluster ID and tag has been read from metastorage: null
> 2 node : Cluster ID and tag has been read from metastorage: null
> stop
> start
> 1 node: Cluster ID and tag has been read from metastorage: ClusterIdAndTag
> [id=some_id, tag=some_tag]
> 2 node: Cluster ID and tag has been read from metastorage: ClusterIdAndTag
> [id=some_id, tag=some_tag]
> but obtained (check attach)
>
> 1 node : Cluster ID and tag has been read from metastorage: null
> 2 node : Cluster ID and tag has been read from metastorage: null
> stop
> start
> 1 node: Cluster ID and tag has been read from metastorage: null
> 2 node: Cluster ID and tag has been read from metastorage: ClusterIdAndTag
> [id=some_id, tag=some_tag]
> and as a result :
> _Joining node has conflicting distributed metastorage data_
> [^_Community_Edition_Cache_9_18998_cut.log]
> this test MetricsConfigurationTest.testNodeRestart [1] is flaky
> [1][https://ci.ignite.apache.org/buildConfiguration/IgniteTests24Java8_Cache9/6220901?buildTab=tests&name=MetricsConfigurationTes&view=tests&status=passed&suite=org.apache.ignite.testsuites.IgniteCacheTestSuite9%3A+&package=org.apache.ignite.internal.metric&expandedTest=build%3A%28id%3A6220901%29%2Cid%3A576406]
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)