Re: Dropping down replication factor

Jeff Jirsa Sun, 13 Aug 2017 07:30:07 -0700

Running repairs when you have corrupt sstables can spread the corruption

In 2.1.15, corruption is almost certainly from something like a bad disk or bad 
RAM


One way to deal with corruption is to stop the node and replace is (with 
-Dcassandra.replace_address) so you restream data from neighbors. The challenge 
here is making sure you have a healthy replica for streaming 

Please make sure you have backups and snapshots if you have corruption popping 
up

If you're using vnodes, once you get rid of the corruption you may consider 
adding another c node with fewer vnodes to try to get it joined faster with 
less data.


-- 
Jeff Jirsa


> On Aug 13, 2017, at 7:11 AM, Brian Spindler <brian.spind...@gmail.com> wrote:
> 
> Hi Jeff, I ran the scrub online and that didn't help.  I went ahead and 
> stopped the node, deleted all the corrupted data files <cf>-<num>-*.db files 
> and planned on running a repair when it came back online.  
> 
> Unrelated I believe, now another CF is corrupted!  
> 
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /ephemeral/cassandra/data/OpsCenter/rollups300-45c85324387b35238d056678f8fa8b0f/OpsCenter-rollups300-ka-100672-Data.db
> Caused by: org.apache.cassandra.io.compress.CorruptBlockException: 
> (/ephemeral/cassandra/data/OpsCenter/rollups300-45c85324387b35238d056678f8fa8b0f/OpsCenter-rollups300-ka-100672-Data.db):
>  corruption detected, chunk at 101500 of length 26523398.
> 
> Few days ago when troubleshooting this I did change the OpsCenter keyspace RF 
> == 2 from 3 since I thought that would help reduce load.  Did that cause this 
> corruption? 
> 
> running 'nodetool scrub OpsCenter rollups300' on that node now 
> 
> And now I also see this when running nodetool status: 
> 
> "Note: Non-system keyspaces don't have the same replication settings, 
> effective ownership information is meaningless"
> 
> What to do?  
> 
> I still can't stream to this new node cause of this corruption.  Disk space 
> is getting low on these nodes ... 
> 
>> On Sat, Aug 12, 2017 at 9:51 PM Brian Spindler <brian.spind...@gmail.com> 
>> wrote:
>> nothing in logs on the node that it was streaming from.  
>> 
>> however, I think I found the issue on the other node in the C rack: 
>> 
>> ERROR [STREAM-IN-/10.40.17.114] 2017-08-12 16:48:53,354 
>> StreamSession.java:512 - [Stream #08957970-7f7e-11e7-b2a2-a31e21b877e5] 
>> Streaming error occurred
>> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
>> /ephemeral/cassandra/data/...
>> 
>> I did a 'cat /var/log/cassandra/system.log|grep Corrupt'  and it seems it's 
>> a single Index.db file and nothing on the other node.  
>> 
>> I think nodetool scrub or offline sstablescrub might be in order but with 
>> the current load I'm not sure I can take it offline for very long.  
>> 
>> Thanks again for the help. 
>> 
>> 
>>> On Sat, Aug 12, 2017 at 9:38 PM Jeffrey Jirsa <jji...@gmail.com> wrote:
>>> Compaction is backed up – that may be normal write load (because of the 
>>> rack imbalance), or it may be a secondary index build. Hard to say for 
>>> sure. ‘nodetool compactionstats’ if you’re able to provide it. The jstack 
>>> probably not necessary, streaming is being marked as failed and it’s 
>>> turning itself off. Not sure why streaming is marked as failing, though, 
>>> anything on the sending sides?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> From: Brian Spindler <brian.spind...@gmail.com>
>>> Reply-To: <user@cassandra.apache.org>
>>> Date: Saturday, August 12, 2017 at 6:34 PM
>>> To: <user@cassandra.apache.org>
>>> Subject: Re: Dropping down replication factor
>>> 
>>> Thanks for replying Jeff. 
>>> 
>>> Responses below. 
>>> 
>>>> On Sat, Aug 12, 2017 at 8:33 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>> Answers inline
>>>> 
>>>> --
>>>> Jeff Jirsa
>>>> 
>>>> 
>>>> > On Aug 12, 2017, at 2:58 PM, brian.spind...@gmail.com wrote:
>>>> >
>>>> > Hi folks, hopefully a quick one:
>>>> >
>>>> > We are running a 12 node cluster (2.1.15) in AWS with Ec2Snitch.  It's 
>>>> > all in one region but spread across 3 availability zones.  It was nicely 
>>>> > balanced with 4 nodes in each.
>>>> >
>>>> > But with a couple of failures and subsequent provisions to the wrong az 
>>>> > we now have a cluster with :
>>>> >
>>>> > 5 nodes in az A
>>>> > 5 nodes in az B
>>>> > 2 nodes in az C
>>>> >
>>>> > Not sure why, but when adding a third node in AZ C it fails to stream 
>>>> > after getting all the way to completion and no apparent error in logs.  
>>>> > I've looked at a couple of bugs referring to scrubbing and possible OOM 
>>>> > bugs due to metadata writing at end of streaming (sorry don't have 
>>>> > ticket handy).  I'm worried I might not be able to do much with these 
>>>> > since the disk space usage is high and they are under a lot of load 
>>>> > given the small number of them for this rack.
>>>> 
>>>> You'll definitely have higher load on az C instances with rf=3 in this 
>>>> ratio 
>>>> 
>>>> Streaming should still work - are you sure it's not busy doing something? 
>>>> Like building secondary index or similar? jstack thread dump would be 
>>>> useful, or at least nodetool tpstats
>>>> 
>>> 
>>> Only other thing might be a backup.  We do incrementals x1hr and snapshots 
>>> x24h; they are shipped to s3 then links are cleaned up.  The error I get on 
>>> the node I'm trying to add to rack C is: 
>>> 
>>> ERROR [main] 2017-08-12 23:54:51,546 CassandraDaemon.java:583 - Exception 
>>> encountered during startup
>>> java.lang.RuntimeException: Error during boostrap: Stream failed
>>>         at 
>>> org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:87) 
>>> ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at 
>>> org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1166)
>>>  ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at 
>>> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:944)
>>>  ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at 
>>> org.apache.cassandra.service.StorageService.initServer(StorageService.java:740)
>>>  ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at 
>>> org.apache.cassandra.service.StorageService.initServer(StorageService.java:617)
>>>  ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at 
>>> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:391)
>>>  [apache-cassandra-2.1.15.jar:2.1.15]
>>>         at 
>>> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:566)
>>>  [apache-cassandra-2.1.15.jar:2.1.15]
>>>         at 
>>> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:655) 
>>> [apache-cassandra-2.1.15.jar:2.1.15]
>>> Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
>>>         at 
>>> org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85)
>>>  ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at 
>>> com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) 
>>> ~[guava-16.0.jar:na]
>>>         at 
>>> com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>>>  ~[guava-16.0.jar:na]
>>>         at 
>>> com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
>>>  ~[guava-16.0.jar:na]
>>>         at 
>>> com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
>>>  ~[guava-16.0.jar:na]
>>>         at 
>>> com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)
>>>  ~[guava-16.0.jar:na]
>>>         at 
>>> org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:209)
>>>  ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at 
>>> org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:185)
>>>  ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at 
>>> org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:413)
>>>  ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at 
>>> org.apache.cassandra.streaming.StreamSession.maybeCompleted(StreamSession.java:700)
>>>  ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at 
>>> org.apache.cassandra.streaming.StreamSession.taskCompleted(StreamSession.java:661)
>>>  ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at 
>>> org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:179)
>>>  ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at 
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
>>> ~[na:1.8.0_112]
>>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
>>> ~[na:1.8.0_112]
>>>         at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>  ~[na:1.8.0_112]
>>>         at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>  ~[na:1.8.0_112]
>>>         at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_112]
>>> WARN  [StorageServiceShutdownHook] 2017-08-12 23:54:51,582 
>>> Gossiper.java:1462 - No local state or state is in silent shutdown, not 
>>> announcing shutdown
>>> INFO  [StorageServiceShutdownHook] 2017-08-12 23:54:51,582 
>>> MessagingService.java:734 - Waiting for messaging service to quiesce
>>> INFO  [ACCEPT-/10.40.17.114] 2017-08-12 23:54:51,583 
>>> MessagingService.java:1020 - MessagingService has terminated the accept() 
>>> thread
>>> 
>>> And I got this on this same node when it was bootstrapping, I ran 'nodetool 
>>> netstats' just before it shutdown: 
>>> 
>>>         Receiving 377 files, 161928296443 bytes total. Already received 377 
>>> files, 161928296443 bytes total
>>>  
>>> TPStats on host that was streaming the data to this node: 
>>> 
>>> Pool Name                    Active   Pending      Completed   Blocked  All 
>>> time blocked
>>> MutationStage                     1         1     4488289014         0      
>>>            0
>>> ReadStage                         0         0       24486526         0      
>>>            0
>>> RequestResponseStage              0         0     3038847374         0      
>>>            0
>>> ReadRepairStage                   0         0        1601576         0      
>>>            0
>>> CounterMutationStage              0         0          68403         0      
>>>            0
>>> MiscStage                         0         0              0         0      
>>>            0
>>> AntiEntropySessions               0         0              0         0      
>>>            0
>>> HintedHandoff                     0         0             18         0      
>>>            0
>>> GossipStage                       0         0        2786892         0      
>>>            0
>>> CacheCleanupExecutor              0         0              0         0      
>>>            0
>>> InternalResponseStage             0         0          61115         0      
>>>            0
>>> CommitLogArchiver                 0         0              0         0      
>>>            0
>>> CompactionExecutor                4        83         304167         0      
>>>            0
>>> ValidationExecutor                0         0          78249         0      
>>>            0
>>> MigrationStage                    0         0          94201         0      
>>>            0
>>> AntiEntropyStage                  0         0         160505         0      
>>>            0
>>> PendingRangeCalculator            0         0             30         0      
>>>            0
>>> Sampler                           0         0              0         0      
>>>            0
>>> MemtableFlushWriter               0         0          71270         0      
>>>            0
>>> MemtablePostFlush                 0         0         175209         0      
>>>            0
>>> MemtableReclaimMemory             0         0          81222         0      
>>>            0
>>> Native-Transport-Requests         2         0     1983565628         0      
>>>      9405444
>>> 
>>> Message type           Dropped
>>> READ                       218
>>> RANGE_SLICE                 15
>>> _TRACE                       0
>>> MUTATION               2949001
>>> COUNTER_MUTATION             0
>>> BINARY                       0
>>> REQUEST_RESPONSE             0
>>> PAGED_RANGE                  0
>>> READ_REPAIR               8571
>>> 
>>> I can get a jstack if needed. 
>>>> 
>>>> >
>>>> > Rather than troubleshoot this further, what I was thinking about doing 
>>>> > was:
>>>> > - drop the replication factor on our keyspace to two
>>>> 
>>>> Repair before you do this, or you'll lose your consistency guarantees
>>>  
>>> Given the load on the 2 nodes in rack C I'm hoping a repair will succeed.
>>> 
>>>> 
>>>> > - hopefully this would reduce load on these two remaining nodes
>>>> 
>>>> It should, racks awareness guarantees on replica per rack if rf==num 
>>>> racks, so right now those 2 c machines have 2.5x as much data as the 
>>>> others. This will drop that requirement and drop the load significantly
>>>> 
>>>> > - run repairs/cleanup across the cluster
>>>> > - then shoot these two nodes in the 'c' rack
>>>> 
>>>> Why shoot the c instances? Why not drop RF and then add 2 more C 
>>>> instances, then increase RF back to 3, run repair, then Decom the extra 
>>>> instances in a and b?
>>>> 
>>>> 
>>> 
>>> Fair point.  I was considering staying at RF two but I think with your 
>>> points below, I should reconsider.  
>>>  
>>>> > - run repairs/cleanup across the cluster
>>>> >
>>>> > Would this work with minimal/no disruption?
>>>> 
>>>> The big risk of running rf=2 is that quorum==all - any gc pause or node 
>>>> restarting will make you lose HA or strong consistency guarantees.
>>>> 
>>>> > Should I update their "rack" before hand or after ?
>>>> 
>>>> You can't change a node's rack once it's in the cluster, it SHOULD refuse 
>>>> to start if you do that
>>>> 
>>> 
>>> Got it. 
>>>  
>>>> > What else am I not thinking about?
>>>> >
>>>> > My main goal atm is to get back to where the cluster is in a clean 
>>>> > consistent state that allows nodes to properly bootstrap.
>>>> >
>>>> > Thanks for your help in advance.
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>>> > For additional commands, e-mail: user-h...@cassandra.apache.org
>>>> >
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>>

Re: Dropping down replication factor

Reply via email to