If you added the new node as a seed, it would ignore bootstrap mode. And bootstrap / repair *do* use streaming so you'll want to re-run repair post-scrub. (No need to re-bootstrap since you're repairing.)
Scrub is a little less heavyweight than major compaction but same ballpark. It runs sstable-at-a-time so (as long as you haven't been in the habit of forcing majors) space should not be a concern. On Thu, Sep 15, 2011 at 8:40 AM, Ethan Rowe <et...@the-rowes.com> wrote: > On Thu, Sep 15, 2011 at 9:21 AM, Jonathan Ellis <jbel...@gmail.com> wrote: >> >> Where did the data loss come in? > > The outcome of the analytical jobs run overnight while some of these repairs > were (not) running is consistent with what I would expect if perhaps 20-30% > of the source data was missing. Given the strong consistency model we're > using, this is surprising to me, since the jobs did not report any read or > write failures. I wonder if this is a consequence of the dead node missing > and the new node being operational but having received basically none of its > hinted handoff streams. Perhaps with streaming fixed the data will > reappear, which would be a happy outcome, but if not, I can reimport the > critical stuff from files. >> >> Scrub is safe to run in parallel. > > Is it somewhat analogous to a major compaction in terms of I/O impact, with > perhaps less greedy use of disk space? > >> >> On Thu, Sep 15, 2011 at 8:08 AM, Ethan Rowe <et...@the-rowes.com> wrote: >> > After further review, I'm definitely going to scrub all the original >> > nodes >> > in the cluster. >> > We've lost some data as a result of this situation. It can be restored, >> > but >> > the question is what to do with the problematic new node first. I don't >> > particularly care about the data that's on it, since I'm going to >> > re-import >> > the critical data from files anyway, and then I can recreate derivative >> > data >> > afterwards. So it's purely a matter of getting the cluster healthy >> > again as >> > quickly as possible so I can begin that import process. >> > Any issue with running scrubs on multiple nodes at a time, provided they >> > aren't replication neighbors? >> > On Thu, Sep 15, 2011 at 8:18 AM, Ethan Rowe <et...@the-rowes.com> wrote: >> >> >> >> I just noticed the following from one of Jonathan Ellis' messages >> >> yesterday: >> >>> >> >>> Added to NEWS: >> >>> >> >>> - After upgrading, run nodetool scrub against each node before >> >>> running >> >>> repair, moving nodes, or adding new ones. >> >> >> >> >> >> We did not do this, as it was not indicated as necessary in the news >> >> when >> >> we were dealing with the upgrade. >> >> So perhaps I need to scrub everything before going any further, though >> >> the >> >> question is what to do with the problematic node. Additionally, it >> >> would be >> >> helpful to know if scrub will affect the hinted handoffs that have >> >> accumulated, as these seem likely to be part of the set of failing >> >> streams. >> >> On Thu, Sep 15, 2011 at 8:13 AM, Ethan Rowe <et...@the-rowes.com> >> >> wrote: >> >>> >> >>> Here's a typical log slice (not terribly informative, I fear): >> >>>> >> >>>> INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106 >> >>>> AntiEntropyService.java (l >> >>>> ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8 >> >>>> for >> >>>> (299 >> >>>> >> >>>> >> >>>> 90798416657667504332586989223299634,54296681768153272037430773234349600451] >> >>>> INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java >> >>>> (line >> >>>> 181) >> >>>> Stream context metadata >> >>>> [/mnt/cassandra/data/events_production/FitsByShip-g-1 >> >>>> 0-Data.db sections=88 progress=0/11707163 - 0%, >> >>>> /mnt/cassandra/data/events_pr >> >>>> oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 - >> >>>> 0%, >> >>>> /mnt/c >> >>>> assandra/data/events_production/FitsByShip-g-6-Data.db sections=1 >> >>>> progress=0/ >> >>>> 6918814 - 0%, >> >>>> /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db s >> >>>> ections=260 progress=0/9091780 - 0%], 4 sstables. >> >>>> INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428 >> >>>> StreamOutSession.java >> >>>> (lin >> >>>> e 174) Streaming to /10.34.90.8 >> >>>> ERROR [Thread-56] 2011-09-15 05:41:38,515 >> >>>> AbstractCassandraDaemon.java >> >>>> (line >> >>>> 139) Fatal exception in thread Thread[Thread-56,5,main] >> >>>> java.lang.NullPointerException >> >>>> at >> >>>> org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC >> >>>> onnection.java:174) >> >>>> at >> >>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn >> >>>> ection.java:114) >> >>> >> >>> Not sure if the exception is related to the outbound streaming above; >> >>> other nodes are actively trying to stream to this node, so perhaps it >> >>> comes >> >>> from those and temporal adjacency to the outbound stream is just >> >>> coincidental. I have other snippets that look basically identical to >> >>> the >> >>> above, except if I look at the logs to which this node is trying to >> >>> stream, >> >>> I see that it has concurrently opened a stream in the other direction, >> >>> which >> >>> could be the one that the exception pertains to. >> >>> >> >>> On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne >> >>> <sylv...@datastax.com> >> >>> wrote: >> >>>> >> >>>> On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe <et...@the-rowes.com> >> >>>> wrote: >> >>>> > Hi. >> >>>> > >> >>>> > We've been running a 7-node cluster with RF 3, QUORUM reads/writes >> >>>> > in >> >>>> > our >> >>>> > production environment for a few months. It's been consistently >> >>>> > stable >> >>>> > during this period, particularly once we got out maintenance >> >>>> > strategy >> >>>> > fully >> >>>> > worked out (per node, one repair a week, one major compaction a >> >>>> > week, >> >>>> > the >> >>>> > latter due to the nature of our data model and usage). While this >> >>>> > cluster >> >>>> > started, back in June or so, on the 0.7 series, it's been running >> >>>> > 0.8.3 for >> >>>> > a while now with no issues. We upgraded to 0.8.5 two days ago, >> >>>> > having >> >>>> > tested the upgrade in our staging cluster (with an otherwise >> >>>> > identical >> >>>> > configuration) previously and verified that our application's >> >>>> > various >> >>>> > use >> >>>> > cases appeared successful. >> >>>> > >> >>>> > One of our nodes suffered a disk failure yesterday. We attempted >> >>>> > to >> >>>> > replace >> >>>> > the dead node by placing a new node at OldNode.initial_token - 1 >> >>>> > with >> >>>> > auto_bootstrap on. A few things went awry from there: >> >>>> > >> >>>> > 1. We never saw the new node in bootstrap mode; it became available >> >>>> > pretty >> >>>> > much immediately upon joining the ring, and never reported a >> >>>> > "joining" >> >>>> > state. I did verify that auto_bootstrap was on. >> >>>> > >> >>>> > 2. I mistakenly ran repair on the new node rather than removetoken >> >>>> > on >> >>>> > the >> >>>> > old node, due to a delightful mental error. The repair got nowhere >> >>>> > fast, as >> >>>> > it attempts to repair against the down node which throws an >> >>>> > exception. >> >>>> > So I >> >>>> > interrupted the repair, restarted the node to clear any pending >> >>>> > validation >> >>>> > compactions, and... >> >>>> > >> >>>> > 3. Ran removetoken for the old node. >> >>>> > >> >>>> > 4. We let this run for some time and saw eventually that all the >> >>>> > nodes >> >>>> > appeared to be done various compactions and were stuck at >> >>>> > streaming. >> >>>> > Many >> >>>> > streams listed as open, none making any progress. >> >>>> > >> >>>> > 5. I observed an Rpc-related exception on the new node (where the >> >>>> > removetoken was launched) and concluded that the streams were >> >>>> > broken >> >>>> > so the >> >>>> > process wouldn't ever finish. >> >>>> > >> >>>> > 6. Ran a "removetoken force" to get the dead node out of the mix. >> >>>> > No >> >>>> > problems. >> >>>> > >> >>>> > 7. Ran a repair on the new node. >> >>>> > >> >>>> > 8. Validations ran, streams opened up, and again things got stuck >> >>>> > in >> >>>> > streaming, hanging for over an hour with no progress. >> >>>> > >> >>>> > 9. Musing that lingering tasks from the removetoken could be a >> >>>> > factor, >> >>>> > I >> >>>> > performed a rolling restart and attempted a repair again. >> >>>> > >> >>>> > 10. Same problem. Did another rolling restart and attempted a >> >>>> > fresh >> >>>> > repair >> >>>> > on the most important column family alone. >> >>>> > >> >>>> > 11. Same problem. Streams included CFs not specified, so I guess >> >>>> > they >> >>>> > must >> >>>> > be for hinted handoff. >> >>>> > >> >>>> > In concluding that streaming is stuck, I've observed: >> >>>> > - streams will be open to the new node from other nodes, but the >> >>>> > new >> >>>> > node >> >>>> > doesn't list them >> >>>> > - streams will be open to the other nodes from the new node, but >> >>>> > the >> >>>> > other >> >>>> > nodes don't list them >> >>>> > - the streams reported may make some initial progress, but then >> >>>> > they >> >>>> > hang at >> >>>> > a particular point and do not move on for an hour or more. >> >>>> > - The logs report repair-related activity, until NPEs on incoming >> >>>> > TCP >> >>>> > connections show up, which appear likely to be the culprit. >> >>>> >> >>>> Can you send the stack trace from those NPE. >> >>>> >> >>>> > >> >>>> > I can provide more exact details when I'm done commuting. >> >>>> > >> >>>> > With streaming broken on this node, I'm unable to run repairs, >> >>>> > which >> >>>> > is >> >>>> > obviously problematic. The application didn't suffer any >> >>>> > operational >> >>>> > issues >> >>>> > as a consequence of this, but I need to review the overnight >> >>>> > results >> >>>> > to >> >>>> > verify we're not suffering data loss (I doubt we are). >> >>>> > >> >>>> > At this point, I'm considering a couple options: >> >>>> > 1. Remove the new node and let the adjacent node take over its >> >>>> > range >> >>>> > 2. Bring the new node down, add a new one in front of it, and >> >>>> > properly >> >>>> > removetoken the problematic one. >> >>>> > 3. Bring the new node down, remove all its data except for the >> >>>> > system >> >>>> > keyspace, then bring it back up and repair it. >> >>>> > 4. Revert to 0.8.3 and see if that helps. >> >>>> > >> >>>> > Recommendations? >> >>>> > >> >>>> > Thanks. >> >>>> > - Ethan >> >>>> > >> >>> >> >> >> > >> > >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of DataStax, the source for professional Cassandra support >> http://www.datastax.com > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com