[jira] [Commented] (CASSANDRA-21185) Fix flaky DTest: bootstrap_test_*

Sam Tunnicliffe (Jira) Wed, 25 Feb 2026 09:54:09 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-21185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061130#comment-18061130
 ]


Sam Tunnicliffe commented on CASSANDRA-21185:
---------------------------------------------

> Does it make sense to start initMessaging before initializeAsFirstCMSNode to 
> prevent split-brain when nodes start simultaneously? Vote nodes cannot 
> discover the seed because FIRST_CMS delays messaging until after CMS 
> initialization, causing them to elect their own CMS 

Looking at the attached logs, I see that the first node does try to init itself 
as the first CMS node. This is correct behaviour and is driven by this 
condition in `Startup.StartupMode`
{code}
    // For CCM and local dev clusters
    boolean isOnlySeed = DatabaseDescriptor.getSeeds().size() == 1
                         && 
DatabaseDescriptor.getSeeds().contains(FBUtilities.getBroadcastAddressAndPort())
                         && 
DatabaseDescriptor.getSeeds().iterator().next().getAddress().isLoopbackAddress();
{code}
The way CCM configures seeds for a cluster is that each node has itself and any 
"preceding" nodes as seeds.
e.g. node1 has only itself as a seed, node2 has itself and node1, node3 has 
node1, node2, and node3, and so on 
For whatever reason, node1 takes some time initializing the CMS locally. In the 
log we see this and its consequence: that when nodes2 & node3 begin their 
discovery process 
node1 has not yet started its messaging service and so the initial discovery 
requests timeout:

{code:title=node1}
--- Startup mode determination ---
INFO  [main] 2026-02-24T10:26:25,215 Startup.java:595 - hasAnyEpoch = false, 
hasBootedBefore = false
INFO  [main] 2026-02-24T10:26:25,219 Startup.java:109 - Initializing as first 
CMS node in a new cluster

--- CMS initialization (completes BEFORE messaging starts) ---
INFO  [main]              2026-02-24T10:26:25,350 LocalLog.java:667 - Marking 
LocalLog ready at epoch Epoch{epoch=0}
INFO  [main]              2026-02-24T10:26:25,351 LocalLog.java:675 - Adding 
default listeners to LocalLog
INFO  [main]              2026-02-24T10:26:25,356 ColumnFamilyStore.java:516 - 
Initializing system_cluster_metadata.distributed_metadata_log
INFO  [GlobalLogFollower] 2026-02-24T10:26:25,799 LocalLog.java:526 - Enacted 
PreInitialize. New tail is Epoch{epoch=1}
INFO  [main]              2026-02-24T10:26:25,922 
AbstractLocalProcessor.java:104 - Committed Initialize{baseState = 
Epoch{epoch=1}}. New epoch is Epoch{epoch=2}
INFO  [GlobalLogFollower] 2026-02-24T10:26:26,272 LocalLog.java:526 - Enacted 
EXECUTED {Initialize{baseState = Epoch{epoch=1}}}. New tail is Epoch{epoch=2}
{code}

{code:title=node2}
--- Discovery begins: sends to seed node1, which is NOT YET LISTENING ---
DEBUG [main] 2026-02-24T10:26:25,745 Startup.java:217 - Discovering other nodes 
in the system
INFO  [main] 2026-02-24T10:26:25,764 MessageDelivery.java:79 - Election for 
metadata migration sending TCM_DISCOVER_REQ 
(org.apache.cassandra.net.NoPayload@258b7043) to /127.0.0.1:7000
...
--- TCM_DISCOVER_REQ to node1 DROPPED: expired before reaching network (node1 
was deaf) ---
WARN  [Messaging-EventLoop-3-3] 2026-02-24T10:26:40,777 NoSpamLogger.java:110 - 
/127.0.0.2:7000->/127.0.0.1:7000-URGENT_MESSAGES-[no-channel] dropping message 
of type TCM_DISCOVER_REQ whose timeout expired before reaching the network
INFO  [InternalResponseStage:1] 2026-02-24T10:26:40,779 MessageDelivery.java:73 
- Received failure in response to TCM_DISCOVER_REQ from /127.0.0.1:7000: 
RequestFailure{reason=TIMEOUT, failure='null'}
{code}

But node1 _does_ start its messaging service well within the discovery timeout 
of nodes 2 & 3
{code:title=node1}
--- Messaging starts 1.056s AFTER Initialize committed (this is the bug) ---
INFO  [main] 2026-02-24T10:26:26,275 InboundConnectionInitiator.java:170 - 
Listening on address: (/127.0.0.1:7000), nic: lo0, encryption: unencrypted

--- Normal startup continues, node1 unaware of split-brain ---
INFO  [main] 2026-02-24T10:26:26,583 Gossiper.java:2015 - Waiting for gossip to 
settle...
INFO  [main] 2026-02-24T10:26:34,593 Gossiper.java:2046 - No gossip backlog; 
proceeding
INFO  [main] 2026-02-24T10:26:34,604 AbstractLocalProcessor.java:104 - 
Committed Register{addresses=NodeAddresses{broadcastAddress=/127.0.0.1:7000, 
localAddress=/127.0.0.1:7000, nativeAddress=/127.0.0.1:9042}, 
location=datacenter1/rack1, 
version=NodeVersion{cassandraVersion=5.1.0-SNAPSHOT, serializationVersion=8}}. 
New epoch is Epoch{epoch=3}
INFO  [GlobalLogFollower] 2026-02-24T10:26:34,649 LocalLog.java:526 - Enacted 
EXECUTED {Register{addresses=NodeAddresses{broadcastAddress=/127.0.0.1:7000, 
localAddress=/127.0.0.1:7000, nativeAddress=/127.0.0.1:9042}, 
location=datacenter1/rack1, 
version=NodeVersion{cassandraVersion=5.1.0-SNAPSHOT, serializationVersion=8}}}. 
New tail is Epoch{epoch=3}
INFO  [main] 2026-02-24T10:26:34,650 Register.java:137 - Registered with 
endpoint /127.0.0.1:7000, node id: NodeId{id=1}
INFO  [main] 2026-02-24T10:26:34,695 Register.java:151 - New node ID obtained 
6d194555-f6eb-41d0-c000-000000000001, (Note: This should happen exactly once 
per node)
INFO  [main] 2026-02-24T10:26:34,704 BootStrapper.java:258 - tokens manually 
specified as [-9223372036854775808]
INFO  [main] 2026-02-24T10:26:34,717 AbstractLocalProcessor.java:104 - 
Committed UnsafeJoin{nodeId=NodeId{id=1}, tokens=[-9223372036854775808], 
joinTokenRing=true, streamData=false}. New epoch is Epoch{epoch=4}
INFO  [GlobalLogFollower] 2026-02-24T10:26:34,779 LocalLog.java:526 - Enacted 
EXECUTED {UnsafeJoin{nodeId=NodeId{id=1}, tokens=[-9223372036854775808], 
joinTokenRing=true, streamData=false}}. New tail is Epoch{epoch=4}
INFO  [main] 2026-02-24T10:26:34,826 Startup.java:483 - NORMAL
INFO  [main] 2026-02-24T10:26:34,911 PipelineConfigurator.java:165 - Starting 
listening for CQL clients on /127.0.0.1:9042 (unencrypted)...
INFO  [main] 2026-02-24T10:26:34,913 CassandraDaemon.java:853 - Startup complete
{code}

So node1 has started listening by {{2026-02-24T10:26:26,275}}, at which point 
nodes 2 & 3 are still performing rounds of discovery, only concluding that 
process after 30s & 5 rounds of unchanging results:
{code:title=node2}
INFO  [main]                     2026-02-24T10:26:41,797 
MessageDelivery.java:79 - Election for metadata migration sending 
TCM_DISCOVER_REQ (org.apache.cassandra.net.NoPayload@258b7043) to 
/127.0.0.3:7000
INFO  [InternalMetadataStage:8]  2026-02-24T10:26:41,800 
MessageDelivery.java:65 - Received a TCM_DISCOVER_RSP response from 
/127.0.0.3:7000: DiscoveredNodes{nodes=[/127.0.0.2:7000, /127.0.0.3:7000], 
kind=KNOWN_PEERS}
DEBUG [main] 2026-02-24T10:26:57,813 Startup.java:221 - Got candidates: 
DiscoveredNodes{nodes=[/127.0.0.2:7000, /127.0.0.3:7000], kind=KNOWN_PEERS}
{code}

So there should have been plenty of opportunity for 2 & 3 to discover node1, 
which would have informed them that it considers itself the CMS, which they 
would have accepted rather than hold their own election.
The bug is in {{Discovery::discoverOnce}}, which represents a single discovery 
round. This is accumulative as the input to each discovery round includes the 
output of previous rounds and this is the set of endpoints 
that the discovery protocol interrogates. 

It _should_ also include the nodes configured seeds, but these are only added 
if the candidate list is empty. So the flow on node2 (& node3 is similar) looks 
something like:
* node2's first round goes to its seeds, node1 & node3.
* node3 responds and so is added to node2's {{discovered}} set
* node1 does not respond and the discovery request times out
* node2 begins its second round, using node3 only as a candidate endpoint
* node2 learns nothing more, repeat until the discovery thresholds are reached

Neither node2 or node3 have any cause to contact node1 again until their 
discovery & election are complete.
In the log we see them initiate this contact and node1 correctly reject it with 
the {{Cluster metadata identifier mismatch}} error. 

IMO the correct fix for this would be to unconditionally add {{seeds}} to the 
{{candidates}} on every discovery round.  

{code}
  diff --git a/src/java/org/apache/cassandra/tcm/Discovery.java 
b/src/java/org/apache/cassandra/tcm/Discovery.java
  index 99ca8907cc6..52eac4cccfb 100644
  --- a/src/java/org/apache/cassandra/tcm/Discovery.java
  +++ b/src/java/org/apache/cassandra/tcm/Discovery.java
  @@ -148,8 +148,7 @@ public class Discovery
           else
               candidates.addAll(discovered);

  -        if (candidates.isEmpty())
  -            candidates.addAll(seeds.get());
  +        candidates.addAll(seeds.get());

           candidates.remove(self);
{code}


> Fix flaky DTest: bootstrap_test_*
> ---------------------------------
>
>                 Key: CASSANDRA-21185
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21185
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/python
>            Reporter: Sam Lightfoot
>            Assignee: Sam Lightfoot
>            Priority: Normal
>             Fix For: 5.1
>
>         Attachments: split_brain_logs.txt
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Tests often failing due to no seed no being up whilst non-seeds trying to 
> join the ring. Likely fix to start the seed node to ensure CMS initialization 
> is complete then allow other nodes in CCM to start in parallel.
> Affects 5.1+ due to <=5.0 using [sequential 
> startup|https://github.com/apache/cassandra-dtest/blob/trunk/bootstrap_test.py#L254C30-L254C32].
> On further analysis it appears two separate clusters form due to the seed 
> node not accepting messages during CMS initialization. Attached logs show the 
> independent clusters resulting from 
> _bootstrap_test.py::TestBootstrap::test_read_from_bootstrapped_node._



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-21185) Fix flaky DTest: bootstrap_test_*

Reply via email to