Re: Cassandra compaction stuck? Should I disable?

Kai Wang Mon, 07 Dec 2015 16:06:12 -0800

Thank you for the investigation.
On Dec 2, 2015 5:21 AM, "PenguinWhispererThe ." <
th3penguinwhispe...@gmail.com> wrote:


> So it seems I found the problem.
>
> The node opening a stream is waiting for the other node to respond but
> that node never responds due to a broken pipe which makes Cassandra wait
> forever.
>
> It's basically this issue:
> https://issues.apache.org/jira/browse/CASSANDRA-8472
> And this is the workaround/fix:
> https://issues.apache.org/jira/browse/CASSANDRA-8611
>
> So:
> - update cassandra to >=2.0.11
> - add option streaming_socket_timeout_in_ms = 10000
> - do rolling restart of cassandra
>
> What's weird is that the IOException: Broken pipe is never shown in my
> logs (not on any node). And my logging is set to INFO in log4j config.
> I have this config in log4j-server.properties:
> # output messages into a rolling log file as well as stdout
> log4j.rootLogger=INFO,stdout,R
>
> # stdout
> log4j.appender.stdout=org.apache.log4j.ConsoleAppender
> log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
> log4j.appender.stdout.layout.ConversionPattern=%5p %d{HH:mm:ss,SSS} %m%n
>
> # rolling log file
> log4j.appender.R=org.apache.log4j.RollingFileAppender
> log4j.appender.R.maxFileSize=20MB
> log4j.appender.R.maxBackupIndex=50
> log4j.appender.R.layout=org.apache.log4j.PatternLayout
> log4j.appender.R.layout.ConversionPattern=%5p [%t] %d{ISO8601} %F (line
> %L) %m%n
> # Edit the next line to point to your logs directory
> log4j.appender.R.File=/var/log/cassandra/system.log
>
> # Application logging options
> #log4j.logger.org.apache.cassandra=DEBUG
> #log4j.logger.org.apache.cassandra.db=DEBUG
> #log4j.logger.org.apache.cassandra.service.StorageProxy=DEBUG
>
> # Adding this to avoid thrift logging disconnect errors.
> log4j.logger.org.apache.thrift.server.TNonblockingServer=ERROR
>
> Too bad nobody else could point to those. Hope it helps someone else from
> wasting a lot of time.
>
> 2015-11-11 15:42 GMT+01:00 Sebastian Estevez <
> sebastian.este...@datastax.com>:
>
>> Use 'nodetool compactionhistory'
>>
>> all the best,
>>
>> Sebastián
>> On Nov 11, 2015 3:23 AM, "PenguinWhispererThe ." <
>> th3penguinwhispe...@gmail.com> wrote:
>>
>>> Does compactionstats shows only stats for completed compactions (100%)?
>>> It might be that the compaction is running constantly, over and over again.
>>> In that case I need to know what I might be able to do to stop this
>>> constant compaction so I can start a nodetool repair.
>>>
>>> Note that there is a lot of traffic on this columnfamily so I'm not sure
>>> if temporary disabling compaction is an option. The repair will probably
>>> take long as well.
>>>
>>> Sebastian and Rob: do you might have any more ideas about the things I
>>> put in this thread? Any help is appreciated!
>>>
>>> 2015-11-10 20:03 GMT+01:00 PenguinWhispererThe . <
>>> th3penguinwhispe...@gmail.com>:
>>>
>>>> Hi Sebastian,
>>>>
>>>> Thanks for your response.
>>>>
>>>> No swap is used. No offense, I just don't see a reason why having swap
>>>> would be the issue here. I put swapiness on 1. I also have jna installed.
>>>> That should prevent java being swapped out as wel AFAIK.
>>>>
>>>>
>>>> 2015-11-10 19:50 GMT+01:00 Sebastian Estevez <
>>>> sebastian.este...@datastax.com>:
>>>>
>>>>> Turn off Swap.
>>>>>
>>>>>
>>>>> http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installRecommendSettings.html?scroll=reference_ds_sxl_gf3_2k__disable-swap
>>>>>
>>>>>
>>>>> All the best,
>>>>>
>>>>>
>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>
>>>>> Sebastián Estévez
>>>>>
>>>>> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com
>>>>>
>>>>> [image: linkedin.png] <https://www.linkedin.com/company/datastax> [image:
>>>>> facebook.png] <https://www.facebook.com/datastax> [image: twitter.png]
>>>>> <https://twitter.com/datastax> [image: g+.png]
>>>>> <https://plus.google.com/+Datastax/about>
>>>>> <http://feeds.feedburner.com/datastax>
>>>>> <http://goog_410786983>
>>>>>
>>>>>
>>>>> <http://www.datastax.com/gartner-magic-quadrant-odbms>
>>>>>
>>>>> DataStax is the fastest, most scalable distributed database
>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>> scalable to any size. With more than 500 customers in 45 countries, 
>>>>> DataStax
>>>>> is the database technology and transactional backbone of choice for the
>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>
>>>>> On Tue, Nov 10, 2015 at 1:48 PM, PenguinWhispererThe . <
>>>>> th3penguinwhispe...@gmail.com> wrote:
>>>>>
>>>>>> I also have the following memory usage:
>>>>>> [root@US-BILLINGDSX4 cassandra]# free -m
>>>>>>              total       used       free     shared    buffers
>>>>>> cached
>>>>>> Mem:         12024       9455       2569          0        110
>>>>>> 2163
>>>>>> -/+ buffers/cache:       7180       4844
>>>>>> Swap:         2047          0       2047
>>>>>>
>>>>>> Still a lot free and a lot of free buffers/cache.
>>>>>>
>>>>>> 2015-11-10 19:45 GMT+01:00 PenguinWhispererThe . <
>>>>>> th3penguinwhispe...@gmail.com>:
>>>>>>
>>>>>>> Still stuck with this. However I enabled GC logging. This shows the
>>>>>>> following:
>>>>>>>
>>>>>>> [root@myhost cassandra]# tail -f gc-1447180680.log
>>>>>>> 2015-11-10T18:41:45.516+0000: 225.428: [GC
>>>>>>> 2721842K->2066508K(6209536K), 0.0199040 secs]
>>>>>>> 2015-11-10T18:41:45.977+0000: 225.889: [GC
>>>>>>> 2721868K->2066511K(6209536K), 0.0221910 secs]
>>>>>>> 2015-11-10T18:41:46.437+0000: 226.349: [GC
>>>>>>> 2721871K->2066524K(6209536K), 0.0222140 secs]
>>>>>>> 2015-11-10T18:41:46.897+0000: 226.809: [GC
>>>>>>> 2721884K->2066539K(6209536K), 0.0224140 secs]
>>>>>>> 2015-11-10T18:41:47.359+0000: 227.271: [GC
>>>>>>> 2721899K->2066538K(6209536K), 0.0302520 secs]
>>>>>>> 2015-11-10T18:41:47.821+0000: 227.733: [GC
>>>>>>> 2721898K->2066557K(6209536K), 0.0280530 secs]
>>>>>>> 2015-11-10T18:41:48.293+0000: 228.205: [GC
>>>>>>> 2721917K->2066571K(6209536K), 0.0218000 secs]
>>>>>>> 2015-11-10T18:41:48.790+0000: 228.702: [GC
>>>>>>> 2721931K->2066780K(6209536K), 0.0292470 secs]
>>>>>>> 2015-11-10T18:41:49.290+0000: 229.202: [GC
>>>>>>> 2722140K->2066843K(6209536K), 0.0288740 secs]
>>>>>>> 2015-11-10T18:41:49.756+0000: 229.668: [GC
>>>>>>> 2722203K->2066818K(6209536K), 0.0283380 secs]
>>>>>>> 2015-11-10T18:41:50.249+0000: 230.161: [GC
>>>>>>> 2722178K->2067158K(6209536K), 0.0218690 secs]
>>>>>>> 2015-11-10T18:41:50.713+0000: 230.625: [GC
>>>>>>> 2722518K->2067236K(6209536K), 0.0278810 secs]
>>>>>>>
>>>>>>> This is a VM with 12GB of RAM. Highered the HEAP_SIZE to 6GB and
>>>>>>> HEAP_NEWSIZE to 800MB.
>>>>>>>
>>>>>>> Still the same result.
>>>>>>>
>>>>>>> This looks very similar to following issue:
>>>>>>>
>>>>>>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAJ=3xgRLsvpnZe0uXEYjG94rKhfXeU+jBR=q3a-_c3rsdd5...@mail.gmail.com%3E
>>>>>>>
>>>>>>> Is the only possibility to upgrade memory? I mean, I can't believe
>>>>>>> it's just loading all it's data in memory. That would require to keep
>>>>>>> scaling up the node to keep it work?
>>>>>>>
>>>>>>>
>>>>>>> 2015-11-10 9:36 GMT+01:00 PenguinWhispererThe . <
>>>>>>> th3penguinwhispe...@gmail.com>:
>>>>>>>
>>>>>>>> Correction...
>>>>>>>> I was grepping on Segmentation on the strace and it happens a lot.
>>>>>>>>
>>>>>>>> Do I need to run a scrub?
>>>>>>>>
>>>>>>>> 2015-11-10 9:30 GMT+01:00 PenguinWhispererThe . <
>>>>>>>> th3penguinwhispe...@gmail.com>:
>>>>>>>>
>>>>>>>>> Hi Rob,
>>>>>>>>>
>>>>>>>>> Thanks for your reply.
>>>>>>>>>
>>>>>>>>> 2015-11-09 23:17 GMT+01:00 Robert Coli <rc...@eventbrite.com>:
>>>>>>>>>
>>>>>>>>>> On Mon, Nov 9, 2015 at 1:29 PM, PenguinWhispererThe . <
>>>>>>>>>> th3penguinwhispe...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> In Opscenter I see one of the nodes is orange. It seems like
>>>>>>>>>>> it's working on compaction. I used nodetool compactionstats and 
>>>>>>>>>>> whenever I
>>>>>>>>>>> did this the Completed nad percentage stays the same (even with 
>>>>>>>>>>> hours in
>>>>>>>>>>> between).
>>>>>>>>>>>
>>>>>>>>>> Are you the same person from IRC, or a second report today of
>>>>>>>>>> compaction hanging in this way?
>>>>>>>>>>
>>>>>>>>> Same person ;) Just didn't had things to work with from the chat
>>>>>>>>> there. I want to understand the issue more, see what I can tune or 
>>>>>>>>> fix. I
>>>>>>>>> want to do nodetool repair before upgrading to 2.1.11 but the 
>>>>>>>>> compaction is
>>>>>>>>> blocking it.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> What version of Cassandra?
>>>>>>>>>>
>>>>>>>>> 2.0.9
>>>>>>>>>
>>>>>>>>>> I currently don't see cpu load from cassandra on that node. So it
>>>>>>>>>>> seems stuck (somewhere mid 60%). Also some other nodes have 
>>>>>>>>>>> compaction on
>>>>>>>>>>> the same columnfamily. I don't see any progress.
>>>>>>>>>>>
>>>>>>>>>>>  WARN [RMI TCP Connection(554)-192.168.0.68] 2015-11-09 
>>>>>>>>>>> 17:18:13,677 ColumnFamilyStore.java (line 2101) Unable to cancel 
>>>>>>>>>>> in-progress compactions for usage_record_ptd.  Probably there is an 
>>>>>>>>>>> unusually large row in progress somewhere.  It is also possible 
>>>>>>>>>>> that buggy code left some sstables compacting after it was done 
>>>>>>>>>>> with them
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - How can I assure that nothing is happening?
>>>>>>>>>>>
>>>>>>>>>>> Find the thread that is doing compaction and strace it.
>>>>>>>>>> Generally it is one of the threads with a lower thread priority.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have 141 threads. Not sure if that's normal.
>>>>>>>>>
>>>>>>>>> This seems to be the one:
>>>>>>>>>  61404 cassandr  24   4 8948m 4.3g 820m R 90.2 36.8 292:54.47 java
>>>>>>>>>
>>>>>>>>> In the strace I see basically this part repeating (with once in a
>>>>>>>>> while the "resource temporarily unavailable"):
>>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>>> getpriority(PRIO_PROCESS, 61404)        = 16
>>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494045, NULL) = -1 EAGAIN
>>>>>>>>> (Resource temporarily unavailable)
>>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494047, NULL) = 0
>>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>>> getpriority(PRIO_PROCESS, 61404)        = 16
>>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494049, NULL) = 0
>>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>>> getpriority(PRIO_PROCESS, 61404)        = 16
>>>>>>>>>
>>>>>>>>> But wait!
>>>>>>>>> I also see this:
>>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494055, NULL) = 0
>>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>>>>>>>>>
>>>>>>>>> This doesn't seem to happen that often though.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Compaction often appears hung when decompressing a very large
>>>>>>>>>> row, but usually not for "hours".
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Is it recommended to disable compaction from a certain
>>>>>>>>>>>    data size? (I believe 25GB on each node).
>>>>>>>>>>>
>>>>>>>>>>> It is almost never recommended to disable compaction.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Can I stop this compaction? nodetool stop compaction
>>>>>>>>>>>    doesn't seem to work.
>>>>>>>>>>>
>>>>>>>>>>> Killing the JVM ("the dungeon collapses!") would certainly stop
>>>>>>>>>> it, but it'd likely just start again when you restart the node.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Is stopping the compaction dangerous?
>>>>>>>>>>>
>>>>>>>>>>>  Not if you're in a version that properly cleans up partial
>>>>>>>>>> compactions, which is most of them.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Is killing the cassandra process dangerous while
>>>>>>>>>>>    compacting(I did nodetool drain on one node)?
>>>>>>>>>>>
>>>>>>>>>>> No. But probably nodetool drain couldn't actually stop the
>>>>>>>>>> in-progress compaction either, FWIW.
>>>>>>>>>>
>>>>>>>>>>> This is output of nodetool compactionstats grepped for the
>>>>>>>>>>> keyspace that seems stuck.
>>>>>>>>>>>
>>>>>>>>>>> Do you have gigantic rows in that keyspace? What does cfstats
>>>>>>>>>> say about the largest row compaction has seen/do you have log 
>>>>>>>>>> messages
>>>>>>>>>> about compacting large rows?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I don't know about the gigantic rows. How can I check?
>>>>>>>>>
>>>>>>>>> I've checked the logs and found this:
>>>>>>>>>  INFO [CompactionExecutor:67] 2015-11-10 02:34:19,077
>>>>>>>>> CompactionController.java (line 192) Compacting large row
>>>>>>>>> billing/usage_record_ptd:177727:2015-10-14 00\:00Z (243992466 bytes)
>>>>>>>>> incrementally
>>>>>>>>> So this is from 6 hours ago.
>>>>>>>>>
>>>>>>>>> I also see a lot of messages like this:
>>>>>>>>> INFO [OptionalTasks:1] 2015-11-10 06:36:06,395 MeteredFlusher.java
>>>>>>>>> (line 58) flushing high-traffic column family 
>>>>>>>>> CFS(Keyspace='mykeyspace',
>>>>>>>>> ColumnFamily='mycolumnfamily') (estimated 100317609 bytes)
>>>>>>>>> And (although it's unrelated this might impact compaction
>>>>>>>>> performance?):
>>>>>>>>>  WARN [Native-Transport-Requests:10514] 2015-11-10 06:33:34,172
>>>>>>>>> BatchStatement.java (line 223) Batch of prepared statements for
>>>>>>>>> [billing.usage_record_ptd] is of size 13834, exceeding specified 
>>>>>>>>> threshold
>>>>>>>>> of 5120 by 8714.
>>>>>>>>>
>>>>>>>>> It's like the compaction is only doing one sstable at a time and
>>>>>>>>> is doing nothing a long time in between.
>>>>>>>>>
>>>>>>>>> cfstats for this keyspace and columnfamily gives the following:
>>>>>>>>>                 Table: mycolumnfamily
>>>>>>>>>                 SSTable count: 26
>>>>>>>>>                 Space used (live), bytes: 319858991
>>>>>>>>>                 Space used (total), bytes: 319860267
>>>>>>>>>                 SSTable Compression Ratio: 0.24265700071674673
>>>>>>>>>                 Number of keys (estimate): 6656
>>>>>>>>>                 Memtable cell count: 22710
>>>>>>>>>                 Memtable data size, bytes: 3310654
>>>>>>>>>                 Memtable switch count: 31
>>>>>>>>>                 Local read count: 0
>>>>>>>>>                 Local read latency: 0.000 ms
>>>>>>>>>                 Local write count: 997667
>>>>>>>>>                 Local write latency: 0.000 ms
>>>>>>>>>                 Pending tasks: 0
>>>>>>>>>                 Bloom filter false positives: 0
>>>>>>>>>                 Bloom filter false ratio: 0.00000
>>>>>>>>>                 Bloom filter space used, bytes: 12760
>>>>>>>>>                 Compacted partition minimum bytes: 1332
>>>>>>>>>                 Compacted partition maximum bytes: 43388628
>>>>>>>>>                 Compacted partition mean bytes: 234682
>>>>>>>>>                 Average live cells per slice (last five minutes):
>>>>>>>>> 0.0
>>>>>>>>>                 Average tombstones per slice (last five minutes):
>>>>>>>>> 0.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I also see frequently lines like this in system.log:
>>>>>>>>>>>
>>>>>>>>>>> WARN [Native-Transport-Requests:11935] 2015-11-09 20:10:41,886 
>>>>>>>>>>> BatchStatement.java (line 223) Batch of prepared statements for 
>>>>>>>>>>> [billing.usage_record_by_billing_period, billing.metric] is of size 
>>>>>>>>>>> 53086, exceeding specified threshold of 5120 by 47966.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> Unrelated.
>>>>>>>>>>
>>>>>>>>>> =Rob
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can I upgrade to 2.1.11 without doing a nodetool repair/compaction
>>>>>>>>> being stuck?
>>>>>>>>> Another thing to mention is that nodetool repair didn't run yet.
>>>>>>>>> It got installed but nobody bothered to schedule the repair.
>>>>>>>>>
>>>>>>>>> Thanks for looking into this!
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: Cassandra compaction stuck? Should I disable?

Reply via email to