TL;DR you need to run repair in between doing those two things. Full explanation: https://issues.apache.org/jira/browse/CASSANDRA-2434 https://issues.apache.org/jira/browse/CASSANDRA-5901
Thanks, -Jeremiah Jordan On Nov 25, 2013, at 11:00 AM, Christopher J. Bottaro <cjbott...@academicworks.com> wrote: > Hello, > > We recently experienced (pretty severe) data loss after moving our 4 node > Cassandra cluster from one EC2 availability zone to another. Our strategy > for doing so was as follows: > One at a time, bring up new nodes in the new availability zone and have them > join the cluster. > One at a time, decommission the old nodes in the old availability zone and > turn them off (stop the Cassandra process). > Everything seemed to work as expected. As we decommissioned each node, we > checked the logs for messages indicating "yes, this node is done > decommissioning" before turning the node off. > > Pretty quickly after the old nodes left the cluster, we started getting > client calls about data missing. > > We immediately turned the old nodes back on and when they rejoined the > cluster *most* of the reported missing data returned. For the rest of the > missing data, we had to spin up a new cluster from EBS snapshots and copy it > over. > > What did we do wrong? > > In hindsight, we noticed a few things which may be clues... > The new nodes had much lower load after joining the cluster than the old ones > (3-4 gb as opposed to 10 gb). > We have EC2Snitch turned on, although we're using SimpleStrategy for > replication. > The new nodes showed even ownership (via nodetool status) after joining the > cluster. > Here's more info about our cluster... > Cassandra 1.2.10 > Replication factor of 3 > Vnodes with 256 tokens > All tables made via CQL > Data dirs on EBS (yes, we are aware of the performance implications) > > Thanks for the help.