Here's a typical log slice (not terribly informative, I fear): > INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106 AntiEntropyService.java > (l > ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8 for > (299 > 90798416657667504332586989223299634,54296681768153272037430773234349600451] > INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java (line > 181) > Stream context metadata > [/mnt/cassandra/data/events_production/FitsByShip-g-1 > 0-Data.db sections=88 progress=0/11707163 - 0%, > /mnt/cassandra/data/events_pr > oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 - 0%, > /mnt/c > assandra/data/events_production/FitsByShip-g-6-Data.db sections=1 > progress=0/ > 6918814 - 0%, /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db > s > ections=260 progress=0/9091780 - 0%], 4 sstables. > INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428 StreamOutSession.java > (lin > e 174) Streaming to /10.34.90.8 > ERROR [Thread-56] 2011-09-15 05:41:38,515 AbstractCassandraDaemon.java > (line > 139) Fatal exception in thread Thread[Thread-56,5,main] > java.lang.NullPointerException > at > org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC > onnection.java:174) > at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn > ection.java:114)
Not sure if the exception is related to the outbound streaming above; other nodes are actively trying to stream to this node, so perhaps it comes from those and temporal adjacency to the outbound stream is just coincidental. I have other snippets that look basically identical to the above, except if I look at the logs to which this node is trying to stream, I see that it has concurrently opened a stream in the other direction, which could be the one that the exception pertains to. On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne <sylv...@datastax.com>wrote: > On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe <et...@the-rowes.com> wrote: > > Hi. > > > > We've been running a 7-node cluster with RF 3, QUORUM reads/writes in our > > production environment for a few months. It's been consistently stable > > during this period, particularly once we got out maintenance strategy > fully > > worked out (per node, one repair a week, one major compaction a week, the > > latter due to the nature of our data model and usage). While this > cluster > > started, back in June or so, on the 0.7 series, it's been running 0.8.3 > for > > a while now with no issues. We upgraded to 0.8.5 two days ago, having > > tested the upgrade in our staging cluster (with an otherwise identical > > configuration) previously and verified that our application's various use > > cases appeared successful. > > > > One of our nodes suffered a disk failure yesterday. We attempted to > replace > > the dead node by placing a new node at OldNode.initial_token - 1 with > > auto_bootstrap on. A few things went awry from there: > > > > 1. We never saw the new node in bootstrap mode; it became available > pretty > > much immediately upon joining the ring, and never reported a "joining" > > state. I did verify that auto_bootstrap was on. > > > > 2. I mistakenly ran repair on the new node rather than removetoken on the > > old node, due to a delightful mental error. The repair got nowhere fast, > as > > it attempts to repair against the down node which throws an exception. > So I > > interrupted the repair, restarted the node to clear any pending > validation > > compactions, and... > > > > 3. Ran removetoken for the old node. > > > > 4. We let this run for some time and saw eventually that all the nodes > > appeared to be done various compactions and were stuck at streaming. > Many > > streams listed as open, none making any progress. > > > > 5. I observed an Rpc-related exception on the new node (where the > > removetoken was launched) and concluded that the streams were broken so > the > > process wouldn't ever finish. > > > > 6. Ran a "removetoken force" to get the dead node out of the mix. No > > problems. > > > > 7. Ran a repair on the new node. > > > > 8. Validations ran, streams opened up, and again things got stuck in > > streaming, hanging for over an hour with no progress. > > > > 9. Musing that lingering tasks from the removetoken could be a factor, I > > performed a rolling restart and attempted a repair again. > > > > 10. Same problem. Did another rolling restart and attempted a fresh > repair > > on the most important column family alone. > > > > 11. Same problem. Streams included CFs not specified, so I guess they > must > > be for hinted handoff. > > > > In concluding that streaming is stuck, I've observed: > > - streams will be open to the new node from other nodes, but the new node > > doesn't list them > > - streams will be open to the other nodes from the new node, but the > other > > nodes don't list them > > - the streams reported may make some initial progress, but then they hang > at > > a particular point and do not move on for an hour or more. > > - The logs report repair-related activity, until NPEs on incoming TCP > > connections show up, which appear likely to be the culprit. > > Can you send the stack trace from those NPE. > > > > > I can provide more exact details when I'm done commuting. > > > > With streaming broken on this node, I'm unable to run repairs, which is > > obviously problematic. The application didn't suffer any operational > issues > > as a consequence of this, but I need to review the overnight results to > > verify we're not suffering data loss (I doubt we are). > > > > At this point, I'm considering a couple options: > > 1. Remove the new node and let the adjacent node take over its range > > 2. Bring the new node down, add a new one in front of it, and properly > > removetoken the problematic one. > > 3. Bring the new node down, remove all its data except for the system > > keyspace, then bring it back up and repair it. > > 4. Revert to 0.8.3 and see if that helps. > > > > Recommendations? > > > > Thanks. > > - Ethan > > >