On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe <et...@the-rowes.com> wrote: > Hi. > > We've been running a 7-node cluster with RF 3, QUORUM reads/writes in our > production environment for a few months. It's been consistently stable > during this period, particularly once we got out maintenance strategy fully > worked out (per node, one repair a week, one major compaction a week, the > latter due to the nature of our data model and usage). While this cluster > started, back in June or so, on the 0.7 series, it's been running 0.8.3 for > a while now with no issues. We upgraded to 0.8.5 two days ago, having > tested the upgrade in our staging cluster (with an otherwise identical > configuration) previously and verified that our application's various use > cases appeared successful. > > One of our nodes suffered a disk failure yesterday. We attempted to replace > the dead node by placing a new node at OldNode.initial_token - 1 with > auto_bootstrap on. A few things went awry from there: > > 1. We never saw the new node in bootstrap mode; it became available pretty > much immediately upon joining the ring, and never reported a "joining" > state. I did verify that auto_bootstrap was on. > > 2. I mistakenly ran repair on the new node rather than removetoken on the > old node, due to a delightful mental error. The repair got nowhere fast, as > it attempts to repair against the down node which throws an exception. So I > interrupted the repair, restarted the node to clear any pending validation > compactions, and... > > 3. Ran removetoken for the old node. > > 4. We let this run for some time and saw eventually that all the nodes > appeared to be done various compactions and were stuck at streaming. Many > streams listed as open, none making any progress. > > 5. I observed an Rpc-related exception on the new node (where the > removetoken was launched) and concluded that the streams were broken so the > process wouldn't ever finish. > > 6. Ran a "removetoken force" to get the dead node out of the mix. No > problems. > > 7. Ran a repair on the new node. > > 8. Validations ran, streams opened up, and again things got stuck in > streaming, hanging for over an hour with no progress. > > 9. Musing that lingering tasks from the removetoken could be a factor, I > performed a rolling restart and attempted a repair again. > > 10. Same problem. Did another rolling restart and attempted a fresh repair > on the most important column family alone. > > 11. Same problem. Streams included CFs not specified, so I guess they must > be for hinted handoff. > > In concluding that streaming is stuck, I've observed: > - streams will be open to the new node from other nodes, but the new node > doesn't list them > - streams will be open to the other nodes from the new node, but the other > nodes don't list them > - the streams reported may make some initial progress, but then they hang at > a particular point and do not move on for an hour or more. > - The logs report repair-related activity, until NPEs on incoming TCP > connections show up, which appear likely to be the culprit.
Can you send the stack trace from those NPE. > > I can provide more exact details when I'm done commuting. > > With streaming broken on this node, I'm unable to run repairs, which is > obviously problematic. The application didn't suffer any operational issues > as a consequence of this, but I need to review the overnight results to > verify we're not suffering data loss (I doubt we are). > > At this point, I'm considering a couple options: > 1. Remove the new node and let the adjacent node take over its range > 2. Bring the new node down, add a new one in front of it, and properly > removetoken the problematic one. > 3. Bring the new node down, remove all its data except for the system > keyspace, then bring it back up and repair it. > 4. Revert to 0.8.3 and see if that helps. > > Recommendations? > > Thanks. > - Ethan >