Hello, (I tried my best to read all I could before posting but I really couldn't find info to answer my questions. So, here's my post.)
I have some questions. Background: We have a 6-node Cassandra cluster running in one data center with the following config: Cassandra 0.6.6 Replicas: 3 Placement: RackUnaware originally Using Standard data storage and mmap index storage RAM 16GB Per node load: roughly 100GB +- 20 We then added a second 6-node cluster in a second data center with the goal of migrating data to this new DC and then shutting down the original nodes in the DC. We switched all nodes to RackAwareStrategy and restarted. We set up seeds on one of the new nodes pointing to three of the old nodes (nodes 2, 4, 6). We did not add any new nodes as seeds to the old ones. All went according to plan with injecting the new nodes into the original key spaces half way between each of the original nodes. this just worked magically, as advertised. :) We ran nodetool repair on the new nodes, one at a time, waiting until activity finished (Indicated by 0 compaction and 0 AE stages). We then moved to running repair on the original nodes. This is where my questions came up. We see that after starting repair on one node, we get lots of GC (However, we are not swapping and disk io seems fine). We also see increases in the pending queue for AE stages (Seems normal, on the order of 40-80 pending stages). What doesn't seem normal is that we see large increase in the AE pending queue on all other nodes not running repair (I would expect this on neighbors, but not all nodes) and it seems to take forever for these queues to drain (Forever = over 24 hrs). Here are some questions I have (I can provide any additional info required): 1. If a node we run repair on finishes, indicated by compaction and AE being 0, but the next node we want to repair still has non-zero queues for C and AE, can we still start up the repair? 2. What is the effect of running repair on more than one node at a time under 0.6.6? I realize its not recommended but I accidentally did this and am curious of the effect. 3. Is large GC activity normal during a repair outside the documented "GC Storm" cases? By the way, really great work on cassandra from an operations POV. I've enjoyed working with it. Regards and thanks for any help. Jake -- Jake Maizel Network Operations Soundcloud Mail & GTalk: j...@soundcloud.com Skype: jakecloud Rosenthaler strasse 13, 101 19, Berlin, DE -- Jake Maizel Network Operations Soundcloud Mail & GTalk: j...@soundcloud.com Skype: jakecloud Rosenthaler strasse 13, 101 19, Berlin, DE