Hi folks,
I'm occasionally seeing SchemaDisagreementError on the boot of a *new*
cluster. I'm hoping someone can explain what I'm doing wrong, or help
me track down the bug if it is one.
The problem occurs in about 1 in 4 launches when I start a 2-node
cluster, where the two machines are configured identically with both
nodes as the seeds (apart from the listen_address being different). On
the problematic launches, describing schema versions immediately after
start shows that the two nodes have different schemas (reported at both
nodes) and any attempt to work with the nodes returns the SDE. This is
before I attempt to do anything to the cluster. After ~60s the nodes
reconcile their differences, report a single schema used at both nodes,
and I can use the cluster without problems.
Key points:
* The problem usually fixes itself 60s after startup (almost exactly, I
poll every second)
* The problem is intermittent occurring on between 10% and 50% of
launches (failure rates seem higher at peak cloud times -- so possibly
linked to background CPU/network/storage contention)
* For the problem period (the initial 60s), peer size is reported as 2,
and both nodes report the same schema versions map containing two
schemas each with one of the nodes against them (after 60s the map
contains one schema with both nodes)
* In some of the problematic launches, it takes ~120s to reconcile,
where for the first 60s the nodes do not seem to see each other at all
(each reports peer size 1, and a a single schema used by only one node
(itself)), then for the next 60s the problem is as described above
(disagreeing schemas); again the 60s/120s seems meaningfully precise
* The problem occurs whether the two nodes are launched simultaneously
or are launched with a delay between the two
I have a workaround, which is to use just one node to seed this initial
set. When the set of seeds is cardinality 1, the problem does not
occur. However the advice is to use 2 seeds and have them be the same
across the cluster -- so I'd like to get to the bottom of this!
I'd also like to be sure that any subsequent nodes added to the cluster
aren't going to cause the same problem when we start using it!
I am running Cassandra 1.2.2 running in Amazon, using Brooklyn
(brooklyn.io) to start and manage it. I can share test cases,
cassandra.yaml, logs, etc -- but am starting with the above summary in
case anyone can point me in the right direction from that.
Thanks,
Alex
- SchemaDisagreementError when launching a new Cassandra (1.2.... Alex Heneveld
-