Re: Cassandra compaction stuck? Should I disable?

PenguinWhispererThe . Wed, 02 Dec 2015 02:22:01 -0800

So it seems I found the problem.

The node opening a stream is waiting for the other node to respond but that
node never responds due to a broken pipe which makes Cassandra wait forever.


It's basically this issue:
https://issues.apache.org/jira/browse/CASSANDRA-8472
And this is the workaround/fix:
https://issues.apache.org/jira/browse/CASSANDRA-8611

So:
- update cassandra to >=2.0.11
- add option streaming_socket_timeout_in_ms = 10000
- do rolling restart of cassandra

What's weird is that the IOException: Broken pipe is never shown in my logs
(not on any node). And my logging is set to INFO in log4j config.
I have this config in log4j-server.properties:
# output messages into a rolling log file as well as stdout
log4j.rootLogger=INFO,stdout,R

# stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%5p %d{HH:mm:ss,SSS} %m%n

# rolling log file
log4j.appender.R=org.apache.log4j.RollingFileAppender
log4j.appender.R.maxFileSize=20MB
log4j.appender.R.maxBackupIndex=50
log4j.appender.R.layout=org.apache.log4j.PatternLayout
log4j.appender.R.layout.ConversionPattern=%5p [%t] %d{ISO8601} %F (line %L)
%m%n
# Edit the next line to point to your logs directory
log4j.appender.R.File=/var/log/cassandra/system.log

# Application logging options
#log4j.logger.org.apache.cassandra=DEBUG
#log4j.logger.org.apache.cassandra.db=DEBUG
#log4j.logger.org.apache.cassandra.service.StorageProxy=DEBUG

# Adding this to avoid thrift logging disconnect errors.
log4j.logger.org.apache.thrift.server.TNonblockingServer=ERROR

Too bad nobody else could point to those. Hope it helps someone else from
wasting a lot of time.

2015-11-11 15:42 GMT+01:00 Sebastian Estevez <sebastian.este...@datastax.com
>:

> Use 'nodetool compactionhistory'
>
> all the best,
>
> Sebastián
> On Nov 11, 2015 3:23 AM, "PenguinWhispererThe ." <
> th3penguinwhispe...@gmail.com> wrote:
>
>> Does compactionstats shows only stats for completed compactions (100%)?
>> It might be that the compaction is running constantly, over and over again.
>> In that case I need to know what I might be able to do to stop this
>> constant compaction so I can start a nodetool repair.
>>
>> Note that there is a lot of traffic on this columnfamily so I'm not sure
>> if temporary disabling compaction is an option. The repair will probably
>> take long as well.
>>
>> Sebastian and Rob: do you might have any more ideas about the things I
>> put in this thread? Any help is appreciated!
>>
>> 2015-11-10 20:03 GMT+01:00 PenguinWhispererThe . <
>> th3penguinwhispe...@gmail.com>:
>>
>>> Hi Sebastian,
>>>
>>> Thanks for your response.
>>>
>>> No swap is used. No offense, I just don't see a reason why having swap
>>> would be the issue here. I put swapiness on 1. I also have jna installed.
>>> That should prevent java being swapped out as wel AFAIK.
>>>
>>>
>>> 2015-11-10 19:50 GMT+01:00 Sebastian Estevez <
>>> sebastian.este...@datastax.com>:
>>>
>>>> Turn off Swap.
>>>>
>>>>
>>>> http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installRecommendSettings.html?scroll=reference_ds_sxl_gf3_2k__disable-swap
>>>>
>>>>
>>>> All the best,
>>>>
>>>>
>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>
>>>> Sebastián Estévez
>>>>
>>>> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com
>>>>
>>>> [image: linkedin.png] <https://www.linkedin.com/company/datastax> [image:
>>>> facebook.png] <https://www.facebook.com/datastax> [image: twitter.png]
>>>> <https://twitter.com/datastax> [image: g+.png]
>>>> <https://plus.google.com/+Datastax/about>
>>>> <http://feeds.feedburner.com/datastax>
>>>> <http://goog_410786983>
>>>>
>>>>
>>>> <http://www.datastax.com/gartner-magic-quadrant-odbms>
>>>>
>>>> DataStax is the fastest, most scalable distributed database
>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>> scalable to any size. With more than 500 customers in 45 countries, 
>>>> DataStax
>>>> is the database technology and transactional backbone of choice for the
>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>
>>>> On Tue, Nov 10, 2015 at 1:48 PM, PenguinWhispererThe . <
>>>> th3penguinwhispe...@gmail.com> wrote:
>>>>
>>>>> I also have the following memory usage:
>>>>> [root@US-BILLINGDSX4 cassandra]# free -m
>>>>>              total       used       free     shared    buffers
>>>>> cached
>>>>> Mem:         12024       9455       2569          0        110
>>>>> 2163
>>>>> -/+ buffers/cache:       7180       4844
>>>>> Swap:         2047          0       2047
>>>>>
>>>>> Still a lot free and a lot of free buffers/cache.
>>>>>
>>>>> 2015-11-10 19:45 GMT+01:00 PenguinWhispererThe . <
>>>>> th3penguinwhispe...@gmail.com>:
>>>>>
>>>>>> Still stuck with this. However I enabled GC logging. This shows the
>>>>>> following:
>>>>>>
>>>>>> [root@myhost cassandra]# tail -f gc-1447180680.log
>>>>>> 2015-11-10T18:41:45.516+0000: 225.428: [GC
>>>>>> 2721842K->2066508K(6209536K), 0.0199040 secs]
>>>>>> 2015-11-10T18:41:45.977+0000: 225.889: [GC
>>>>>> 2721868K->2066511K(6209536K), 0.0221910 secs]
>>>>>> 2015-11-10T18:41:46.437+0000: 226.349: [GC
>>>>>> 2721871K->2066524K(6209536K), 0.0222140 secs]
>>>>>> 2015-11-10T18:41:46.897+0000: 226.809: [GC
>>>>>> 2721884K->2066539K(6209536K), 0.0224140 secs]
>>>>>> 2015-11-10T18:41:47.359+0000: 227.271: [GC
>>>>>> 2721899K->2066538K(6209536K), 0.0302520 secs]
>>>>>> 2015-11-10T18:41:47.821+0000: 227.733: [GC
>>>>>> 2721898K->2066557K(6209536K), 0.0280530 secs]
>>>>>> 2015-11-10T18:41:48.293+0000: 228.205: [GC
>>>>>> 2721917K->2066571K(6209536K), 0.0218000 secs]
>>>>>> 2015-11-10T18:41:48.790+0000: 228.702: [GC
>>>>>> 2721931K->2066780K(6209536K), 0.0292470 secs]
>>>>>> 2015-11-10T18:41:49.290+0000: 229.202: [GC
>>>>>> 2722140K->2066843K(6209536K), 0.0288740 secs]
>>>>>> 2015-11-10T18:41:49.756+0000: 229.668: [GC
>>>>>> 2722203K->2066818K(6209536K), 0.0283380 secs]
>>>>>> 2015-11-10T18:41:50.249+0000: 230.161: [GC
>>>>>> 2722178K->2067158K(6209536K), 0.0218690 secs]
>>>>>> 2015-11-10T18:41:50.713+0000: 230.625: [GC
>>>>>> 2722518K->2067236K(6209536K), 0.0278810 secs]
>>>>>>
>>>>>> This is a VM with 12GB of RAM. Highered the HEAP_SIZE to 6GB and
>>>>>> HEAP_NEWSIZE to 800MB.
>>>>>>
>>>>>> Still the same result.
>>>>>>
>>>>>> This looks very similar to following issue:
>>>>>>
>>>>>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAJ=3xgRLsvpnZe0uXEYjG94rKhfXeU+jBR=q3a-_c3rsdd5...@mail.gmail.com%3E
>>>>>>
>>>>>> Is the only possibility to upgrade memory? I mean, I can't believe
>>>>>> it's just loading all it's data in memory. That would require to keep
>>>>>> scaling up the node to keep it work?
>>>>>>
>>>>>>
>>>>>> 2015-11-10 9:36 GMT+01:00 PenguinWhispererThe . <
>>>>>> th3penguinwhispe...@gmail.com>:
>>>>>>
>>>>>>> Correction...
>>>>>>> I was grepping on Segmentation on the strace and it happens a lot.
>>>>>>>
>>>>>>> Do I need to run a scrub?
>>>>>>>
>>>>>>> 2015-11-10 9:30 GMT+01:00 PenguinWhispererThe . <
>>>>>>> th3penguinwhispe...@gmail.com>:
>>>>>>>
>>>>>>>> Hi Rob,
>>>>>>>>
>>>>>>>> Thanks for your reply.
>>>>>>>>
>>>>>>>> 2015-11-09 23:17 GMT+01:00 Robert Coli <rc...@eventbrite.com>:
>>>>>>>>
>>>>>>>>> On Mon, Nov 9, 2015 at 1:29 PM, PenguinWhispererThe . <
>>>>>>>>> th3penguinwhispe...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> In Opscenter I see one of the nodes is orange. It seems like it's
>>>>>>>>>> working on compaction. I used nodetool compactionstats and whenever 
>>>>>>>>>> I did
>>>>>>>>>> this the Completed nad percentage stays the same (even with hours in
>>>>>>>>>> between).
>>>>>>>>>>
>>>>>>>>> Are you the same person from IRC, or a second report today of
>>>>>>>>> compaction hanging in this way?
>>>>>>>>>
>>>>>>>> Same person ;) Just didn't had things to work with from the chat
>>>>>>>> there. I want to understand the issue more, see what I can tune or 
>>>>>>>> fix. I
>>>>>>>> want to do nodetool repair before upgrading to 2.1.11 but the 
>>>>>>>> compaction is
>>>>>>>> blocking it.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> What version of Cassandra?
>>>>>>>>>
>>>>>>>> 2.0.9
>>>>>>>>
>>>>>>>>> I currently don't see cpu load from cassandra on that node. So it
>>>>>>>>>> seems stuck (somewhere mid 60%). Also some other nodes have 
>>>>>>>>>> compaction on
>>>>>>>>>> the same columnfamily. I don't see any progress.
>>>>>>>>>>
>>>>>>>>>>  WARN [RMI TCP Connection(554)-192.168.0.68] 2015-11-09 17:18:13,677 
>>>>>>>>>> ColumnFamilyStore.java (line 2101) Unable to cancel in-progress 
>>>>>>>>>> compactions for usage_record_ptd.  Probably there is an unusually 
>>>>>>>>>> large row in progress somewhere.  It is also possible that buggy 
>>>>>>>>>> code left some sstables compacting after it was done with them
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - How can I assure that nothing is happening?
>>>>>>>>>>
>>>>>>>>>> Find the thread that is doing compaction and strace it. Generally
>>>>>>>>> it is one of the threads with a lower thread priority.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I have 141 threads. Not sure if that's normal.
>>>>>>>>
>>>>>>>> This seems to be the one:
>>>>>>>>  61404 cassandr  24   4 8948m 4.3g 820m R 90.2 36.8 292:54.47 java
>>>>>>>>
>>>>>>>> In the strace I see basically this part repeating (with once in a
>>>>>>>> while the "resource temporarily unavailable"):
>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>> getpriority(PRIO_PROCESS, 61404)        = 16
>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494045, NULL) = -1 EAGAIN
>>>>>>>> (Resource temporarily unavailable)
>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494047, NULL) = 0
>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>> getpriority(PRIO_PROCESS, 61404)        = 16
>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494049, NULL) = 0
>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>> getpriority(PRIO_PROCESS, 61404)        = 16
>>>>>>>>
>>>>>>>> But wait!
>>>>>>>> I also see this:
>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494055, NULL) = 0
>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>>>>>>>>
>>>>>>>> This doesn't seem to happen that often though.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Compaction often appears hung when decompressing a very large row,
>>>>>>>>> but usually not for "hours".
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Is it recommended to disable compaction from a certain data
>>>>>>>>>>    size? (I believe 25GB on each node).
>>>>>>>>>>
>>>>>>>>>> It is almost never recommended to disable compaction.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Can I stop this compaction? nodetool stop compaction
>>>>>>>>>>    doesn't seem to work.
>>>>>>>>>>
>>>>>>>>>> Killing the JVM ("the dungeon collapses!") would certainly stop
>>>>>>>>> it, but it'd likely just start again when you restart the node.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Is stopping the compaction dangerous?
>>>>>>>>>>
>>>>>>>>>>  Not if you're in a version that properly cleans up partial
>>>>>>>>> compactions, which is most of them.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Is killing the cassandra process dangerous while
>>>>>>>>>>    compacting(I did nodetool drain on one node)?
>>>>>>>>>>
>>>>>>>>>> No. But probably nodetool drain couldn't actually stop the
>>>>>>>>> in-progress compaction either, FWIW.
>>>>>>>>>
>>>>>>>>>> This is output of nodetool compactionstats grepped for the
>>>>>>>>>> keyspace that seems stuck.
>>>>>>>>>>
>>>>>>>>>> Do you have gigantic rows in that keyspace? What does cfstats say
>>>>>>>>> about the largest row compaction has seen/do you have log messages 
>>>>>>>>> about
>>>>>>>>> compacting large rows?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't know about the gigantic rows. How can I check?
>>>>>>>>
>>>>>>>> I've checked the logs and found this:
>>>>>>>>  INFO [CompactionExecutor:67] 2015-11-10 02:34:19,077
>>>>>>>> CompactionController.java (line 192) Compacting large row
>>>>>>>> billing/usage_record_ptd:177727:2015-10-14 00\:00Z (243992466 bytes)
>>>>>>>> incrementally
>>>>>>>> So this is from 6 hours ago.
>>>>>>>>
>>>>>>>> I also see a lot of messages like this:
>>>>>>>> INFO [OptionalTasks:1] 2015-11-10 06:36:06,395 MeteredFlusher.java
>>>>>>>> (line 58) flushing high-traffic column family 
>>>>>>>> CFS(Keyspace='mykeyspace',
>>>>>>>> ColumnFamily='mycolumnfamily') (estimated 100317609 bytes)
>>>>>>>> And (although it's unrelated this might impact compaction
>>>>>>>> performance?):
>>>>>>>>  WARN [Native-Transport-Requests:10514] 2015-11-10 06:33:34,172
>>>>>>>> BatchStatement.java (line 223) Batch of prepared statements for
>>>>>>>> [billing.usage_record_ptd] is of size 13834, exceeding specified 
>>>>>>>> threshold
>>>>>>>> of 5120 by 8714.
>>>>>>>>
>>>>>>>> It's like the compaction is only doing one sstable at a time and is
>>>>>>>> doing nothing a long time in between.
>>>>>>>>
>>>>>>>> cfstats for this keyspace and columnfamily gives the following:
>>>>>>>>                 Table: mycolumnfamily
>>>>>>>>                 SSTable count: 26
>>>>>>>>                 Space used (live), bytes: 319858991
>>>>>>>>                 Space used (total), bytes: 319860267
>>>>>>>>                 SSTable Compression Ratio: 0.24265700071674673
>>>>>>>>                 Number of keys (estimate): 6656
>>>>>>>>                 Memtable cell count: 22710
>>>>>>>>                 Memtable data size, bytes: 3310654
>>>>>>>>                 Memtable switch count: 31
>>>>>>>>                 Local read count: 0
>>>>>>>>                 Local read latency: 0.000 ms
>>>>>>>>                 Local write count: 997667
>>>>>>>>                 Local write latency: 0.000 ms
>>>>>>>>                 Pending tasks: 0
>>>>>>>>                 Bloom filter false positives: 0
>>>>>>>>                 Bloom filter false ratio: 0.00000
>>>>>>>>                 Bloom filter space used, bytes: 12760
>>>>>>>>                 Compacted partition minimum bytes: 1332
>>>>>>>>                 Compacted partition maximum bytes: 43388628
>>>>>>>>                 Compacted partition mean bytes: 234682
>>>>>>>>                 Average live cells per slice (last five minutes):
>>>>>>>> 0.0
>>>>>>>>                 Average tombstones per slice (last five minutes):
>>>>>>>> 0.0
>>>>>>>>
>>>>>>>>
>>>>>>>>> I also see frequently lines like this in system.log:
>>>>>>>>>>
>>>>>>>>>> WARN [Native-Transport-Requests:11935] 2015-11-09 20:10:41,886 
>>>>>>>>>> BatchStatement.java (line 223) Batch of prepared statements for 
>>>>>>>>>> [billing.usage_record_by_billing_period, billing.metric] is of size 
>>>>>>>>>> 53086, exceeding specified threshold of 5120 by 47966.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Unrelated.
>>>>>>>>>
>>>>>>>>> =Rob
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Can I upgrade to 2.1.11 without doing a nodetool repair/compaction
>>>>>>>> being stuck?
>>>>>>>> Another thing to mention is that nodetool repair didn't run yet. It
>>>>>>>> got installed but nobody bothered to schedule the repair.
>>>>>>>>
>>>>>>>> Thanks for looking into this!
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Cassandra compaction stuck? Should I disable?

Reply via email to