Hi Dave, Thank you for your response.
I can clarify a couple of things here: > 2. You grew from 2 nodes to 4, but the original 2 nodes have 200GB and the 2 > new ones have 40 GB. What's the recommended practice for rebalancing (i.e., > when should you do it), what's the actual procedure, and what's the expected > impact of it? + is it likely to cause a problem in the short term if I don't (ie. if I just wait until 'normal activity' to somehow even out the distribution of data). > 3. Cassandra nodes "disappear". (I'm not quite clear what this means.) Nodetool reports the node as down. I'm seeing lots of machine-x is DOWN in the logs. Flapping, actually. I don't have any swap configured (which I've read somewhere might induce flapping). The machine also feels like it goes on a hiatus - separately, but typically observed at the same time. Tail -f on the Cassandra logs delays for several minutes, pending ssh's to the box also stall until 'something' happens that releases the machine from its slumber. Typically that something is a message in the logs that a compaction of a hintedhandoff has completed. As I say, nmon/top show minimal network & disk activity, and just one of the four cores flatlining during this time. The machine *should* be more responsive. Actually: http://pastebin.com/AeM2VgL3 All the machines referenced in there are ones that are in the cluster now. > 4. You took a machine offline without decommissioning it from the cluster. > Now the machine is gone, but the other nodes (in Gossip logs) report that > they are still looking for it. How do you stop nodes from looking for a > removed node? I was attempting to drain the thing first, but that was stalling, so I stopped Cassandra then stopped the box. The storage and config were on EBS (persistent disk) so they came back - it's just that the IP address of the machine changed. I typically use my own assigned hostnames (cass-01, cass-02, etc, say) but for proper resolution I use the EC2 'internal hostnames', which were updated to all four Cassandra boxes, the other three instances of Cassandra were stopped, and then all four brought back up. You say you have similar EC2-related thoughts .. have you done much on the EC2 hardware so far? Are you seeing the same kind of thing? cheers, Jedd.