Cassandra in Kubernetes: IP switch decommission issue

Inès Potier Thu, 23 Feb 2023 12:41:15 -0800

Hi Cassandra community,

We have recently encountered a recurring old IP reappearance issue while
testing decommissions on some of our Kubernetes Cassandra staging clusters.
We have not yet found other references to this issue online. We could
really use some additional inputs/opinions, both on the problem itself and
the fix we are currently considering.


*Issue Description*

In Kubernetes, a Cassandra node can change IP at each pod bounce. We have
noticed that this behavior, associated with a decommission operation, can
get the cluster into an erroneous state.

Consider the following situation: a Cassandra node node1 , with hostId1,
owning 20.5% of the token ring, bounces and switches IP (old_IP → new_IP).
After a couple gossip iterations, all other nodes’ nodetool status output
includes a new_IP UN entry owning 20.5% of the token ring and no old_IP
entry.

Shortly after the bounce, node1 gets decommissioned. Our cluster does not
have a lot of data, and the decommission operation completes pretty
quickly. Logs on other nodes start showing acknowledgment that node1 has
left and soon, nodetool status’ new_IP UL entry disappears. node1 ‘s pod is
deleted.

After a minute delay, the cluster enters the erroneous state. An  old_IP DN
entry reappears in nodetool status, owning 20.5% of the token ring. No node
owns this IP anymore and according to logs, old_IP is still associated with
hostId1.

*Issue Root Cause*

By digging through Cassandra logs, and re-testing this scenario over and
over again, we have reached the following conclusion:

   - Other nodes will continue exchanging gossip about old_IP , even after
   it becomes a fatClient.
   - The fatClient timeout and subsequent quarantine does not stop old_IP from
   reappearing in a node’s Gossip state, once its quarantine is over. We
   believe that this is due to a misalignment on all nodes’ old_IP expiration
   time.
   - Once new_IP has left the cluster, and old_IP next gossip state message
   is received by a node, StorageService will no longer face collisions (or
   will, but with an even older IP) for hostId1 and its corresponding
   tokens. As a result, old_IP will regain ownership of 20.5% of the token
   ring.


*Proposed fix*

Following the above investigation, we were thinking about implementing the
following fix:

When a node receives a gossip status change with STATE_LEFT for a leaving
endpoint new_IP, before evicting new_IP from the token ring, purge from
Gossip (ie evictFromMembership) all endpoints that meet the following
criteria:

   - endpointStateMap contains this endpoint
   - The endpoint is not currently a token owner (
   !tokenMetadata.isMember(endpoint))
   - The endpoint’s hostId matches the hostId of new_IP
   - The endpoint is older than leaving_IP (
   Gossiper.instance.compareEndpointStartup)
   - The endpoint’s token range (from endpointStateMap) intersects with
   new_IP’s

This modification’s intention is to force nodes to realign on old_IP
expiration,
and expunge it from Gossip so it does not reappear after new_IP leaves the
ring.


Additional opinions/ideas regarding the fix’s viability and the issue
itself would be really helpful.
Thanks in advance,
Ines

Cassandra in Kubernetes: IP switch decommission issue

Reply via email to