[jira] [Commented] (IGNITE-15733) Eventually failure of baseline registration.

Ivan Bessonov (Jira) Tue, 19 Oct 2021 00:44:09 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-15733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17430373#comment-17430373
 ]


Ivan Bessonov commented on IGNITE-15733:
----------------------------------------

It feels like the problem is fundamental. I think it can also be reproduced if 
you have a fresh clean node that starts first and then "accepts" nodes from 
cluster that has previously been activated.

Core reason of the problem is that there's an assumption in code that the first 
node that joins itself in cluster will decide cluster's tag and id. This is not 
necessarily correct as we see.

Old solution from development branch (back in 2019) was to only assign cluster 
id upon its first activation. But that was suddenly changed when we allowed to 
write into DMS before activation. Maybe if we return that old behavior for 
tag&id specifically and also wait for distributed metastorage data to persist 
on activation, this will solve the issue, but I'm not sure.

What I am sure about is that "local join" is not a proper place to initialize 
defaults that can differ on different nodes. So, a way to fix it as proposed by 
[~zstan] is to persist cluster tag (at least) once you receive it right in 
discovery thread. This will allow you to see if there was an attempt of setting 
this value before restart. In this case node shouldn't try setting it on local 
join and instead just wait. I'm sure there are corner cases to this solution as 
well. For example - what if none of cluster nodes actually saved tag&id in 
metastorage? This also doesn't solve the issue of the first node in cluster 
being clean.

These are my thoughts. I have no working solution for now

 

> Eventually failure of baseline registration.
> --------------------------------------------
>
>                 Key: IGNITE-15733
>                 URL: https://issues.apache.org/jira/browse/IGNITE-15733
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 2.11
>            Reporter: Evgeny Stanilovsky
>            Priority: Major
>         Attachments: _Community_Edition_Cache_9_18998_cut.log
>
>
> All info can be found in attached log:
> briefly: Cluster of 2 nodes with persistence, sequentially start nodes, 
> activate, stop nodes using org.apache.ignite.Ignite#close, start nodes, 
> activate:
> expected :
> 1 node : Cluster ID and tag has been read from metastorage: null
> 2 node : Cluster ID and tag has been read from metastorage: null
> stop
> start
> 1 node: Cluster ID and tag has been read from metastorage: ClusterIdAndTag 
> [id=some_id, tag=some_tag]
> 2 node: Cluster ID and tag has been read from metastorage: ClusterIdAndTag 
> [id=some_id, tag=some_tag]
> but obtained (check attach)
>  
> 1 node : Cluster ID and tag has been read from metastorage: null
> 2 node : Cluster ID and tag has been read from metastorage: null
> stop
> start
> 1 node: Cluster ID and tag has been read from metastorage: null
> 2 node: Cluster ID and tag has been read from metastorage: ClusterIdAndTag 
> [id=some_id, tag=some_tag]
> and as a result : 
> _Joining node has conflicting distributed metastorage data_
> [^_Community_Edition_Cache_9_18998_cut.log]
> this test MetricsConfigurationTest.testNodeRestart [1] is flaky
> [1][https://ci.ignite.apache.org/buildConfiguration/IgniteTests24Java8_Cache9/6220901?buildTab=tests&name=MetricsConfigurationTes&view=tests&status=passed&suite=org.apache.ignite.testsuites.IgniteCacheTestSuite9%3A+&package=org.apache.ignite.internal.metric&expandedTest=build%3A%28id%3A6220901%29%2Cid%3A576406]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-15733) Eventually failure of baseline registration.

Reply via email to