Node stuck during nodetool rebuild

Vasileios Vlachos Tue, 05 Aug 2014 01:29:26 -0700

Hello All,

We are on 1.2.18 (running on Ubuntu 12.04) and we recently tried to add a
second DC on our demo environment, just before trying it on live. The
existing DC1 has two nodes which approximately hold 10G of data (RF=2). In
order to add the second DC, DC2, we followed this procedure:


On DC1 nodes:
1. Changed the Snitch in the cassandra.yaml from default to
GossipingPropertyFileSnitch.
2. Configured the cassandra-rackdc.properties (DC1, RAC1).
3. Rolling restart
4. Update replication strategy for each keyspace, for example: ALTER
KEYSPACE <keyspace> WITH REPLICATION =
{'class':'NetworkTopologyStrategy','DC1':2};

On DC2 nodes:
5. Edit the cassandra.yaml with: auto_bootstrap: false, seeds (one IP from
DC1), cluster name to match whatever we have on DC1 nodes, correct IP
settings, num_tokens, initial_token left unset and finally the snitch
(GossipingPropertyFileSnitch, as in DC1).
6. Changed the cassandra-rackdc.properties (DC2, RAC1)

On the Application:
7. Changed the C# DataStax driver load balancing policy to be
DCAwareRoundRobinPolicy
8. Changed the application consistency level from QUORUM to LOCAL_QUORUM
9. After deleting the data, commitlog and saved_caches directory we started
cassandra both nodes in the new DC, DC2. According to the logs at this
point all nodes were able to see all other nodes with the correct/expected
output when running nodetool status.

On DC1 nodes:
10. After cassandra was running on DC2, we changed the Keyspace RF to
include the new DC as follows:  ALTER KEYSPACE <keyspace> WITH REPLICATION
= {'class':'NetworkTopologyStrategy','DC1':2, 'DC2':2};
11. As a last step and in order to stream the data across to the second DC,
we run this on node1 of DC2: nodetool rebuild DC1. After the successful
completion of this, we were planning to run the same on node2 of DC2.

The problem is that the nodetool seems to be stuck, and nodetool netstats
on node1 of DC2 appears to be stuck at 10% streaming a 5G file from node2
at DC1. This doesn't tally with nodetool netstats when running it against
either of the DC1 nodes. The DC1 nodes don't think they stream anything to
DC2.

It is worth pointing that initially we tried to run 'nodetool rebuild DC1'
on both nodes at DC2, given the small amount of data to be streamed in
total (approximately 10G as I explained above). We exoerienced the same
problem, with the only difference being that 'nodetool rebuild DC1' stuck
on both nodes at DC2 very soon after running it, whereas now it happened
only after running it for an hour or so. We thought the problem was that we
tried to run nodetool against both nodes at the same time. So, we tried
running it only against node 1 after we deleted all the data, commitlog and
caches on both nodes and started from step (9) again. Now nodetool rebuild
is running against node1 at DC2 for more than 12 hours with no luck... The
weird thing is that the cassandra logs appear to be clean and the VPN
between the two DCs has no problems at all.

Any thoughts? Have we missed something in the steps I described? Is
anything wrong in the procedure? Any help would be much appreciated.

Thanks,

Vasilis

Node stuck during nodetool rebuild

Reply via email to