[
https://issues.apache.org/jira/browse/CASSANDRA-21185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061130#comment-18061130
]
Sam Tunnicliffe commented on CASSANDRA-21185:
---------------------------------------------
> Does it make sense to start initMessaging before initializeAsFirstCMSNode to
> prevent split-brain when nodes start simultaneously? Vote nodes cannot
> discover the seed because FIRST_CMS delays messaging until after CMS
> initialization, causing them to elect their own CMS
Looking at the attached logs, I see that the first node does try to init itself
as the first CMS node. This is correct behaviour and is driven by this
condition in `Startup.StartupMode`
{code}
// For CCM and local dev clusters
boolean isOnlySeed = DatabaseDescriptor.getSeeds().size() == 1
&&
DatabaseDescriptor.getSeeds().contains(FBUtilities.getBroadcastAddressAndPort())
&&
DatabaseDescriptor.getSeeds().iterator().next().getAddress().isLoopbackAddress();
{code}
The way CCM configures seeds for a cluster is that each node has itself and any
"preceding" nodes as seeds.
e.g. node1 has only itself as a seed, node2 has itself and node1, node3 has
node1, node2, and node3, and so on
For whatever reason, node1 takes some time initializing the CMS locally. In the
log we see this and its consequence: that when nodes2 & node3 begin their
discovery process
node1 has not yet started its messaging service and so the initial discovery
requests timeout:
{code:title=node1}
--- Startup mode determination ---
INFO [main] 2026-02-24T10:26:25,215 Startup.java:595 - hasAnyEpoch = false,
hasBootedBefore = false
INFO [main] 2026-02-24T10:26:25,219 Startup.java:109 - Initializing as first
CMS node in a new cluster
--- CMS initialization (completes BEFORE messaging starts) ---
INFO [main] 2026-02-24T10:26:25,350 LocalLog.java:667 - Marking
LocalLog ready at epoch Epoch{epoch=0}
INFO [main] 2026-02-24T10:26:25,351 LocalLog.java:675 - Adding
default listeners to LocalLog
INFO [main] 2026-02-24T10:26:25,356 ColumnFamilyStore.java:516 -
Initializing system_cluster_metadata.distributed_metadata_log
INFO [GlobalLogFollower] 2026-02-24T10:26:25,799 LocalLog.java:526 - Enacted
PreInitialize. New tail is Epoch{epoch=1}
INFO [main] 2026-02-24T10:26:25,922
AbstractLocalProcessor.java:104 - Committed Initialize{baseState =
Epoch{epoch=1}}. New epoch is Epoch{epoch=2}
INFO [GlobalLogFollower] 2026-02-24T10:26:26,272 LocalLog.java:526 - Enacted
EXECUTED {Initialize{baseState = Epoch{epoch=1}}}. New tail is Epoch{epoch=2}
{code}
{code:title=node2}
--- Discovery begins: sends to seed node1, which is NOT YET LISTENING ---
DEBUG [main] 2026-02-24T10:26:25,745 Startup.java:217 - Discovering other nodes
in the system
INFO [main] 2026-02-24T10:26:25,764 MessageDelivery.java:79 - Election for
metadata migration sending TCM_DISCOVER_REQ
(org.apache.cassandra.net.NoPayload@258b7043) to /127.0.0.1:7000
...
--- TCM_DISCOVER_REQ to node1 DROPPED: expired before reaching network (node1
was deaf) ---
WARN [Messaging-EventLoop-3-3] 2026-02-24T10:26:40,777 NoSpamLogger.java:110 -
/127.0.0.2:7000->/127.0.0.1:7000-URGENT_MESSAGES-[no-channel] dropping message
of type TCM_DISCOVER_REQ whose timeout expired before reaching the network
INFO [InternalResponseStage:1] 2026-02-24T10:26:40,779 MessageDelivery.java:73
- Received failure in response to TCM_DISCOVER_REQ from /127.0.0.1:7000:
RequestFailure{reason=TIMEOUT, failure='null'}
{code}
But node1 _does_ start its messaging service well within the discovery timeout
of nodes 2 & 3
{code:title=node1}
--- Messaging starts 1.056s AFTER Initialize committed (this is the bug) ---
INFO [main] 2026-02-24T10:26:26,275 InboundConnectionInitiator.java:170 -
Listening on address: (/127.0.0.1:7000), nic: lo0, encryption: unencrypted
--- Normal startup continues, node1 unaware of split-brain ---
INFO [main] 2026-02-24T10:26:26,583 Gossiper.java:2015 - Waiting for gossip to
settle...
INFO [main] 2026-02-24T10:26:34,593 Gossiper.java:2046 - No gossip backlog;
proceeding
INFO [main] 2026-02-24T10:26:34,604 AbstractLocalProcessor.java:104 -
Committed Register{addresses=NodeAddresses{broadcastAddress=/127.0.0.1:7000,
localAddress=/127.0.0.1:7000, nativeAddress=/127.0.0.1:9042},
location=datacenter1/rack1,
version=NodeVersion{cassandraVersion=5.1.0-SNAPSHOT, serializationVersion=8}}.
New epoch is Epoch{epoch=3}
INFO [GlobalLogFollower] 2026-02-24T10:26:34,649 LocalLog.java:526 - Enacted
EXECUTED {Register{addresses=NodeAddresses{broadcastAddress=/127.0.0.1:7000,
localAddress=/127.0.0.1:7000, nativeAddress=/127.0.0.1:9042},
location=datacenter1/rack1,
version=NodeVersion{cassandraVersion=5.1.0-SNAPSHOT, serializationVersion=8}}}.
New tail is Epoch{epoch=3}
INFO [main] 2026-02-24T10:26:34,650 Register.java:137 - Registered with
endpoint /127.0.0.1:7000, node id: NodeId{id=1}
INFO [main] 2026-02-24T10:26:34,695 Register.java:151 - New node ID obtained
6d194555-f6eb-41d0-c000-000000000001, (Note: This should happen exactly once
per node)
INFO [main] 2026-02-24T10:26:34,704 BootStrapper.java:258 - tokens manually
specified as [-9223372036854775808]
INFO [main] 2026-02-24T10:26:34,717 AbstractLocalProcessor.java:104 -
Committed UnsafeJoin{nodeId=NodeId{id=1}, tokens=[-9223372036854775808],
joinTokenRing=true, streamData=false}. New epoch is Epoch{epoch=4}
INFO [GlobalLogFollower] 2026-02-24T10:26:34,779 LocalLog.java:526 - Enacted
EXECUTED {UnsafeJoin{nodeId=NodeId{id=1}, tokens=[-9223372036854775808],
joinTokenRing=true, streamData=false}}. New tail is Epoch{epoch=4}
INFO [main] 2026-02-24T10:26:34,826 Startup.java:483 - NORMAL
INFO [main] 2026-02-24T10:26:34,911 PipelineConfigurator.java:165 - Starting
listening for CQL clients on /127.0.0.1:9042 (unencrypted)...
INFO [main] 2026-02-24T10:26:34,913 CassandraDaemon.java:853 - Startup complete
{code}
So node1 has started listening by {{2026-02-24T10:26:26,275}}, at which point
nodes 2 & 3 are still performing rounds of discovery, only concluding that
process after 30s & 5 rounds of unchanging results:
{code:title=node2}
INFO [main] 2026-02-24T10:26:41,797
MessageDelivery.java:79 - Election for metadata migration sending
TCM_DISCOVER_REQ (org.apache.cassandra.net.NoPayload@258b7043) to
/127.0.0.3:7000
INFO [InternalMetadataStage:8] 2026-02-24T10:26:41,800
MessageDelivery.java:65 - Received a TCM_DISCOVER_RSP response from
/127.0.0.3:7000: DiscoveredNodes{nodes=[/127.0.0.2:7000, /127.0.0.3:7000],
kind=KNOWN_PEERS}
DEBUG [main] 2026-02-24T10:26:57,813 Startup.java:221 - Got candidates:
DiscoveredNodes{nodes=[/127.0.0.2:7000, /127.0.0.3:7000], kind=KNOWN_PEERS}
{code}
So there should have been plenty of opportunity for 2 & 3 to discover node1,
which would have informed them that it considers itself the CMS, which they
would have accepted rather than hold their own election.
The bug is in {{Discovery::discoverOnce}}, which represents a single discovery
round. This is accumulative as the input to each discovery round includes the
output of previous rounds and this is the set of endpoints
that the discovery protocol interrogates.
It _should_ also include the nodes configured seeds, but these are only added
if the candidate list is empty. So the flow on node2 (& node3 is similar) looks
something like:
* node2's first round goes to its seeds, node1 & node3.
* node3 responds and so is added to node2's {{discovered}} set
* node1 does not respond and the discovery request times out
* node2 begins its second round, using node3 only as a candidate endpoint
* node2 learns nothing more, repeat until the discovery thresholds are reached
Neither node2 or node3 have any cause to contact node1 again until their
discovery & election are complete.
In the log we see them initiate this contact and node1 correctly reject it with
the {{Cluster metadata identifier mismatch}} error.
IMO the correct fix for this would be to unconditionally add {{seeds}} to the
{{candidates}} on every discovery round.
{code}
diff --git a/src/java/org/apache/cassandra/tcm/Discovery.java
b/src/java/org/apache/cassandra/tcm/Discovery.java
index 99ca8907cc6..52eac4cccfb 100644
--- a/src/java/org/apache/cassandra/tcm/Discovery.java
+++ b/src/java/org/apache/cassandra/tcm/Discovery.java
@@ -148,8 +148,7 @@ public class Discovery
else
candidates.addAll(discovered);
- if (candidates.isEmpty())
- candidates.addAll(seeds.get());
+ candidates.addAll(seeds.get());
candidates.remove(self);
{code}
> Fix flaky DTest: bootstrap_test_*
> ---------------------------------
>
> Key: CASSANDRA-21185
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21185
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Test/dtest/python
> Reporter: Sam Lightfoot
> Assignee: Sam Lightfoot
> Priority: Normal
> Fix For: 5.1
>
> Attachments: split_brain_logs.txt
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Tests often failing due to no seed no being up whilst non-seeds trying to
> join the ring. Likely fix to start the seed node to ensure CMS initialization
> is complete then allow other nodes in CCM to start in parallel.
> Affects 5.1+ due to <=5.0 using [sequential
> startup|https://github.com/apache/cassandra-dtest/blob/trunk/bootstrap_test.py#L254C30-L254C32].
> On further analysis it appears two separate clusters form due to the seed
> node not accepting messages during CMS initialization. Attached logs show the
> independent clusters resulting from
> _bootstrap_test.py::TestBootstrap::test_read_from_bootstrapped_node._
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]