Re: incomplete schema sync for new node

Jeremy Stribling Fri, 01 Jul 2011 17:59:20 -0700

Oops, forgot to mention that we're using Cassandra 0.7.2.


On 07/01/2011 05:46 PM, Jeremy Stribling wrote:

Hi all,
I'm running into a problem with Cassandra, where a new node coming upseems to only get an incomplete set of schema mutations whenbootstrapping, and as a result hits an "IllegalStateException:replication factor (3) exceeds number of endpoints (2)" error.
I will describe the sequence of events below as I see them, but firstI need to warn you that I run Cassandra in a very non-standard way. Iembed it in a JVM, along with Zookeeper, and other classes for aproduct we are working on. We need to bring nodes up and downdynamically in our product, including going from one node to threenodes, and back down to one, at any time. If we ever drop below threenodes, we have code that sets the replication factor of our keyspacesto 1; similarly, whenever we have three or more nodes, we change thereplication factor to 3. I know this is frowned upon by thecommunity, but we're stuck with doing it this way for now.
Ok, here's the scenario:
1) Node 50.0.0.4 bootstraps into a cluster consisting of nodes50.0.0.2 and 50.0.0.3.2) Once 50.0.0.4 is fully bootstrapped, we change the replicationfactor for our two keyspaces to 3.3) Then node 50.0.0.2 is taken down permanently, and we change thereplication factor back down to 1.4) We then remove node 50.0.0.2's tokens using the removeToken call onnode 50.0.0.3.5) Then we start node 50.0.0.5, and have it join the cluster using50.0.0.3 and 50.0.0.4 as seeds.6) 50.0.0.5 starts receiving schema mutations to get it up to speed;the last one it receives (7d51e757-a40b-11e0-a98d-65ed1eced995) hasthe replication factor at 3. However, there should be more schemaupdates after this that never arrive (you can see them arrive at50.0.0.4 while it is bootstrapping).7) Minutes after receiving this last mutation, node 50.0.0.5 hits theIllegalStateException I've listed above, and I think for that reasonnever successfully joins the cluster.
My question is why doesn't node 50.0.0.5 receive the schema updatesthat follow 7d51e757-a40b-11e0-a98d-65ed1eced995? (For example,8fc8820d-a40c-11e0-9eaf-6720e49624c2 is present in 50.0.0.4's log andsets the replication factor back down to 1.)
I've put logs for nodes 50.0.0.3/4/5 athttp://pdos.csail.mit.edu/~strib/cassandra_logs.tgz . The logs arepretty messy because they includes log messages from both Zookeeperand our product code -- sorry about that. Also, I think the clock onnode 50.0.0.4 is a few minutes ahead of the other nodes' clocks.
I also noticed in 50.0.0.4's log the following exceptions:
2011-07-01 18:00:49,832 76315 [HintedHandoff:1] ERRORorg.apache.cassandra.concurrent.DebuggableThreadPoolExecutor - Errorin ThreadPoolExecutorjava.lang.RuntimeException: java.lang.RuntimeException: Could notreach schema agreement with /50.0.0.3 in 60000msatorg.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)atjava.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

I don't know if that's related or not.

Thanks in advance,

Jeremy

Re: incomplete schema sync for new node

Reply via email to