[
https://issues.apache.org/jira/browse/CASSANDRA-18319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17700384#comment-17700384
]
Raymond Huffman commented on CASSANDRA-18319:
---------------------------------------------
I've implemented a DTest here that reproduces the issue.
https://github.com/apache/cassandra-dtest/pull/215 I've confirmed that this
test fails on v3.0.28 and v3.11.14. Logs from these tests are attached:
[^test_decommission_after_ip_change_logs.zip]
In the tests, the node at 127.0.0.6 changes to 127.0.0.9.
This test performs the following:
* creates a 6 node cluster
* changes the IP of Node6 from {{127.0.0.6}} to {{127.0.0.9}}
* performs a rolling restart on the cluster
* decommissions Node6
* asserts that the log {{"Node /127.0.0.6 is now part of the cluster"}} does
not appear after the rolling restart.
Running nodetool status a few seconds after the decommission looks like this:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID
Rack
UN 127.0.0.1 188.92 KiB 1 16.7%
82d3d6c1-c4c8-4bdc-afc5-5827cd17544a rack1
UN 127.0.0.2 162.84 KiB 1 16.7%
1399d77b-06d0-4d3b-9248-dbe24486a310 rack1
UN 127.0.0.3 162.07 KiB 1 16.7%
1289aa44-e4a6-422f-ab72-b5daf53a55d2 rack1
UN 127.0.0.4 162.48 KiB 1 16.7%
b38c9f92-b651-4660-941c-ca2072d24501 rack1
UN 127.0.0.5 188.36 KiB 1 16.7%
7125bfc7-a519-419b-ab1a-e9995aed40d9 rack1
?N 127.0.0.6 110.25 KiB 1 16.7%
29cbf560-7686-49f6-a06a-7184ebd42aa2 rack1
Gossipinfo looks like this: [^3.11_gossipinfo.zip]
> Cassandra in Kubernetes: IP switch decommission issue
> -----------------------------------------------------
>
> Key: CASSANDRA-18319
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18319
> Project: Cassandra
> Issue Type: Bug
> Reporter: Ines Potier
> Priority: Normal
> Attachments: 3.11_gossipinfo.zip, node1_gossipinfo.txt,
> test_decommission_after_ip_change_logs.zip
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> We have recently encountered a recurring old IP reappearance issue while
> testing decommissions on some of our Kubernetes Cassandra staging clusters.
> *Issue Description*
> In Kubernetes, a Cassandra node can change IP at each pod bounce. We have
> noticed that this behavior, associated with a decommission operation, can get
> the cluster into an erroneous state.
> Consider the following situation: a Cassandra node {{node1}} , with
> {{{}hostId1{}}}, owning 20.5% of the token ring, bounces and switches IP
> ({{{}old_IP{}}} → {{{}new_IP{}}}). After a couple gossip iterations, all
> other nodes’ nodetool status output includes a {{new_IP}} UN entry owning
> 20.5% of the token ring and no {{old_IP}} entry.
> Shortly after the bounce, {{node1}} gets decommissioned. Our cluster does not
> have a lot of data, and the decommission operation completes pretty quickly.
> Logs on other nodes start showing acknowledgment that {{node1}} has left and
> soon, nodetool status’ {{new_IP}} UL entry disappears. {{node1}} ‘s pod is
> deleted.
> After a minute delay, the cluster enters the erroneous state. An {{old_IP}}
> DN entry reappears in nodetool status, owning 20.5% of the token ring. No
> node owns this IP anymore and according to logs, {{old_IP}} is still
> associated with {{{}hostId1{}}}.
> *Issue Root Cause*
> By digging through Cassandra logs, and re-testing this scenario over and over
> again, we have reached the following conclusion:
> * Other nodes will continue exchanging gossip about {{old_IP}} , even after
> it becomes a fatClient.
> * The fatClient timeout and subsequent quarantine does not stop {{old_IP}}
> from reappearing in a node’s Gossip state, once its quarantine is over. We
> believe that this is due to a misalignment on all nodes’ {{old_IP}}
> expiration time.
> * Once {{new_IP}} has left the cluster, and {{old_IP}} next gossip state
> message is received by a node, StorageService will no longer face collisions
> (or will, but with an even older IP) for {{hostId1}} and its corresponding
> tokens. As a result, {{old_IP}} will regain ownership of 20.5% of the token
> ring.
> *Proposed fix*
> Following the above investigation, we were thinking about implementing the
> following fix:
> When a node receives a gossip status change with {{STATE_LEFT}} for a leaving
> endpoint {{{}new_IP{}}}, before evicting {{{}new_IP from the token ring,
> purge from Gossip (ie evictFromMembership{}}}) all endpoints that meet the
> following criteria:
> * {{endpointStateMap}} contains this endpoint
> * The endpoint is not currently a token owner
> ({{{}!tokenMetadata.isMember(endpoint){}}})
> * The endpoint’s {{hostId}} matches the {{hostId}} of {{new_IP}}
> * The endpoint is older than {{leaving_IP}}
> ({{{}Gossiper.instance.compareEndpointStartup{}}})
> * The endpoint’s token range (from {{{}endpointStateMap{}}}) intersects with
> {{{}new_IP{}}}’s
> This modification’s intention is to force nodes to realign on {{old_IP}}
> expiration, and expunge it from Gossip so it does not reappear after
> {{new_IP}} leaves the ring.
> Another approach we have also been considering is expunging {{old_IP}} at the
> moment of the StorageService collision resolution.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]