On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe <et...@the-rowes.com> wrote:
> Hi.
>
> We've been running a 7-node cluster with RF 3, QUORUM reads/writes in our
> production environment for a few months.  It's been consistently stable
> during this period, particularly once we got out maintenance strategy fully
> worked out (per node, one repair a week, one major compaction a week, the
> latter due to the nature of our data model and usage).  While this cluster
> started, back in June or so, on the 0.7 series, it's been running 0.8.3 for
> a while now with no issues.  We upgraded to 0.8.5 two days ago, having
> tested the upgrade in our staging cluster (with an otherwise identical
> configuration) previously and verified that our application's various use
> cases appeared successful.
>
> One of our nodes suffered a disk failure yesterday.  We attempted to replace
> the dead node by placing a new node at OldNode.initial_token - 1 with
> auto_bootstrap on.  A few things went awry from there:
>
> 1. We never saw the new node in bootstrap mode; it became available pretty
> much immediately upon joining the ring, and never reported a "joining"
> state.  I did verify that auto_bootstrap was on.
>
> 2. I mistakenly ran repair on the new node rather than removetoken on the
> old node, due to a delightful mental error.  The repair got nowhere fast, as
> it attempts to repair against the down node which throws an exception.  So I
> interrupted the repair, restarted the node to clear any pending validation
> compactions, and...
>
> 3. Ran removetoken for the old node.
>
> 4. We let this run for some time and saw eventually that all the nodes
> appeared to be done various compactions and were stuck at streaming.  Many
> streams listed as open, none making any progress.
>
> 5.  I observed an Rpc-related exception on the new node (where the
> removetoken was launched) and concluded that the streams were broken so the
> process wouldn't ever finish.
>
> 6. Ran a "removetoken force" to get the dead node out of the mix.  No
> problems.
>
> 7. Ran a repair on the new node.
>
> 8. Validations ran, streams opened up, and again things got stuck in
> streaming, hanging for over an hour with no progress.
>
> 9. Musing that lingering tasks from the removetoken could be a factor, I
> performed a rolling restart and attempted a repair again.
>
> 10. Same problem.  Did another rolling restart and attempted a fresh repair
> on the most important column family alone.
>
> 11. Same problem.  Streams included CFs not specified, so I guess they must
> be for hinted handoff.
>
> In concluding that streaming is stuck, I've observed:
> - streams will be open to the new node from other nodes, but the new node
> doesn't list them
> - streams will be open to the other nodes from the new node, but the other
> nodes don't list them
> - the streams reported may make some initial progress, but then they hang at
> a particular point and do not move on for an hour or more.
> - The logs report repair-related activity, until NPEs on incoming TCP
> connections show up, which appear likely to be the culprit.

Can you send the stack trace from those NPE.

>
> I can provide more exact details when I'm done commuting.
>
> With streaming broken on this node, I'm unable to run repairs, which is
> obviously problematic.  The application didn't suffer any operational issues
> as a consequence of this, but I need to review the overnight results to
> verify we're not suffering data loss (I doubt we are).
>
> At this point, I'm considering a couple options:
> 1. Remove the new node and let the adjacent node take over its range
> 2. Bring the new node down, add a new one in front of it, and properly
> removetoken the problematic one.
> 3. Bring the new node down, remove all its data except for the system
> keyspace, then bring it back up and repair it.
> 4. Revert to 0.8.3 and see if that helps.
>
> Recommendations?
>
> Thanks.
> - Ethan
>

Reply via email to