Re: Cassandra compaction stuck? Should I disable?

Sebastian Estevez Tue, 10 Nov 2015 10:51:44 -0800

Turn off Swap.

http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installRecommendSettings.html?scroll=reference_ds_sxl_gf3_2k__disable-swap



All the best,


[image: datastax_logo.png] <http://www.datastax.com/>

Sebastián Estévez

Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com

[image: linkedin.png] <https://www.linkedin.com/company/datastax> [image:
facebook.png] <https://www.facebook.com/datastax> [image: twitter.png]
<https://twitter.com/datastax> [image: g+.png]
<https://plus.google.com/+Datastax/about>
<http://feeds.feedburner.com/datastax>
<http://goog_410786983>


<http://www.datastax.com/gartner-magic-quadrant-odbms>

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

On Tue, Nov 10, 2015 at 1:48 PM, PenguinWhispererThe . <
th3penguinwhispe...@gmail.com> wrote:

> I also have the following memory usage:
> [root@US-BILLINGDSX4 cassandra]# free -m
>              total       used       free     shared    buffers     cached
> Mem:         12024       9455       2569          0        110       2163
> -/+ buffers/cache:       7180       4844
> Swap:         2047          0       2047
>
> Still a lot free and a lot of free buffers/cache.
>
> 2015-11-10 19:45 GMT+01:00 PenguinWhispererThe . <
> th3penguinwhispe...@gmail.com>:
>
>> Still stuck with this. However I enabled GC logging. This shows the
>> following:
>>
>> [root@myhost cassandra]# tail -f gc-1447180680.log
>> 2015-11-10T18:41:45.516+0000: 225.428: [GC 2721842K->2066508K(6209536K),
>> 0.0199040 secs]
>> 2015-11-10T18:41:45.977+0000: 225.889: [GC 2721868K->2066511K(6209536K),
>> 0.0221910 secs]
>> 2015-11-10T18:41:46.437+0000: 226.349: [GC 2721871K->2066524K(6209536K),
>> 0.0222140 secs]
>> 2015-11-10T18:41:46.897+0000: 226.809: [GC 2721884K->2066539K(6209536K),
>> 0.0224140 secs]
>> 2015-11-10T18:41:47.359+0000: 227.271: [GC 2721899K->2066538K(6209536K),
>> 0.0302520 secs]
>> 2015-11-10T18:41:47.821+0000: 227.733: [GC 2721898K->2066557K(6209536K),
>> 0.0280530 secs]
>> 2015-11-10T18:41:48.293+0000: 228.205: [GC 2721917K->2066571K(6209536K),
>> 0.0218000 secs]
>> 2015-11-10T18:41:48.790+0000: 228.702: [GC 2721931K->2066780K(6209536K),
>> 0.0292470 secs]
>> 2015-11-10T18:41:49.290+0000: 229.202: [GC 2722140K->2066843K(6209536K),
>> 0.0288740 secs]
>> 2015-11-10T18:41:49.756+0000: 229.668: [GC 2722203K->2066818K(6209536K),
>> 0.0283380 secs]
>> 2015-11-10T18:41:50.249+0000: 230.161: [GC 2722178K->2067158K(6209536K),
>> 0.0218690 secs]
>> 2015-11-10T18:41:50.713+0000: 230.625: [GC 2722518K->2067236K(6209536K),
>> 0.0278810 secs]
>>
>> This is a VM with 12GB of RAM. Highered the HEAP_SIZE to 6GB and
>> HEAP_NEWSIZE to 800MB.
>>
>> Still the same result.
>>
>> This looks very similar to following issue:
>>
>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAJ=3xgRLsvpnZe0uXEYjG94rKhfXeU+jBR=q3a-_c3rsdd5...@mail.gmail.com%3E
>>
>> Is the only possibility to upgrade memory? I mean, I can't believe it's
>> just loading all it's data in memory. That would require to keep scaling up
>> the node to keep it work?
>>
>>
>> 2015-11-10 9:36 GMT+01:00 PenguinWhispererThe . <
>> th3penguinwhispe...@gmail.com>:
>>
>>> Correction...
>>> I was grepping on Segmentation on the strace and it happens a lot.
>>>
>>> Do I need to run a scrub?
>>>
>>> 2015-11-10 9:30 GMT+01:00 PenguinWhispererThe . <
>>> th3penguinwhispe...@gmail.com>:
>>>
>>>> Hi Rob,
>>>>
>>>> Thanks for your reply.
>>>>
>>>> 2015-11-09 23:17 GMT+01:00 Robert Coli <rc...@eventbrite.com>:
>>>>
>>>>> On Mon, Nov 9, 2015 at 1:29 PM, PenguinWhispererThe . <
>>>>> th3penguinwhispe...@gmail.com> wrote:
>>>>>>
>>>>>> In Opscenter I see one of the nodes is orange. It seems like it's
>>>>>> working on compaction. I used nodetool compactionstats and whenever I did
>>>>>> this the Completed nad percentage stays the same (even with hours in
>>>>>> between).
>>>>>>
>>>>> Are you the same person from IRC, or a second report today of
>>>>> compaction hanging in this way?
>>>>>
>>>> Same person ;) Just didn't had things to work with from the chat there.
>>>> I want to understand the issue more, see what I can tune or fix. I want to
>>>> do nodetool repair before upgrading to 2.1.11 but the compaction is
>>>> blocking it.
>>>>
>>>>>
>>>>>
>>>>>
>>>> What version of Cassandra?
>>>>>
>>>> 2.0.9
>>>>
>>>>> I currently don't see cpu load from cassandra on that node. So it
>>>>>> seems stuck (somewhere mid 60%). Also some other nodes have compaction on
>>>>>> the same columnfamily. I don't see any progress.
>>>>>>
>>>>>>  WARN [RMI TCP Connection(554)-192.168.0.68] 2015-11-09 17:18:13,677 
>>>>>> ColumnFamilyStore.java (line 2101) Unable to cancel in-progress 
>>>>>> compactions for usage_record_ptd.  Probably there is an unusually large 
>>>>>> row in progress somewhere.  It is also possible that buggy code left 
>>>>>> some sstables compacting after it was done with them
>>>>>>
>>>>>>
>>>>>>    - How can I assure that nothing is happening?
>>>>>>
>>>>>> Find the thread that is doing compaction and strace it. Generally it
>>>>> is one of the threads with a lower thread priority.
>>>>>
>>>>
>>>> I have 141 threads. Not sure if that's normal.
>>>>
>>>> This seems to be the one:
>>>>  61404 cassandr  24   4 8948m 4.3g 820m R 90.2 36.8 292:54.47 java
>>>>
>>>> In the strace I see basically this part repeating (with once in a while
>>>> the "resource temporarily unavailable"):
>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>> getpriority(PRIO_PROCESS, 61404)        = 16
>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 0
>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494045, NULL) = -1 EAGAIN
>>>> (Resource temporarily unavailable)
>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494047, NULL) = 0
>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>> getpriority(PRIO_PROCESS, 61404)        = 16
>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494049, NULL) = 0
>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>> getpriority(PRIO_PROCESS, 61404)        = 16
>>>>
>>>> But wait!
>>>> I also see this:
>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494055, NULL) = 0
>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>>>>
>>>> This doesn't seem to happen that often though.
>>>>
>>>>>
>>>>> Compaction often appears hung when decompressing a very large row, but
>>>>> usually not for "hours".
>>>>>
>>>>>>
>>>>>>    - Is it recommended to disable compaction from a certain data
>>>>>>    size? (I believe 25GB on each node).
>>>>>>
>>>>>> It is almost never recommended to disable compaction.
>>>>>
>>>>>>
>>>>>>    - Can I stop this compaction? nodetool stop compaction doesn't
>>>>>>    seem to work.
>>>>>>
>>>>>> Killing the JVM ("the dungeon collapses!") would certainly stop it,
>>>>> but it'd likely just start again when you restart the node.
>>>>>
>>>>>>
>>>>>>    - Is stopping the compaction dangerous?
>>>>>>
>>>>>>  Not if you're in a version that properly cleans up partial
>>>>> compactions, which is most of them.
>>>>>
>>>>>>
>>>>>>    - Is killing the cassandra process dangerous while compacting(I
>>>>>>    did nodetool drain on one node)?
>>>>>>
>>>>>> No. But probably nodetool drain couldn't actually stop the
>>>>> in-progress compaction either, FWIW.
>>>>>
>>>>>> This is output of nodetool compactionstats grepped for the keyspace
>>>>>> that seems stuck.
>>>>>>
>>>>>> Do you have gigantic rows in that keyspace? What does cfstats say
>>>>> about the largest row compaction has seen/do you have log messages about
>>>>> compacting large rows?
>>>>>
>>>>
>>>> I don't know about the gigantic rows. How can I check?
>>>>
>>>> I've checked the logs and found this:
>>>>  INFO [CompactionExecutor:67] 2015-11-10 02:34:19,077
>>>> CompactionController.java (line 192) Compacting large row
>>>> billing/usage_record_ptd:177727:2015-10-14 00\:00Z (243992466 bytes)
>>>> incrementally
>>>> So this is from 6 hours ago.
>>>>
>>>> I also see a lot of messages like this:
>>>> INFO [OptionalTasks:1] 2015-11-10 06:36:06,395 MeteredFlusher.java
>>>> (line 58) flushing high-traffic column family CFS(Keyspace='mykeyspace',
>>>> ColumnFamily='mycolumnfamily') (estimated 100317609 bytes)
>>>> And (although it's unrelated this might impact compaction performance?):
>>>>  WARN [Native-Transport-Requests:10514] 2015-11-10 06:33:34,172
>>>> BatchStatement.java (line 223) Batch of prepared statements for
>>>> [billing.usage_record_ptd] is of size 13834, exceeding specified threshold
>>>> of 5120 by 8714.
>>>>
>>>> It's like the compaction is only doing one sstable at a time and is
>>>> doing nothing a long time in between.
>>>>
>>>> cfstats for this keyspace and columnfamily gives the following:
>>>>                 Table: mycolumnfamily
>>>>                 SSTable count: 26
>>>>                 Space used (live), bytes: 319858991
>>>>                 Space used (total), bytes: 319860267
>>>>                 SSTable Compression Ratio: 0.24265700071674673
>>>>                 Number of keys (estimate): 6656
>>>>                 Memtable cell count: 22710
>>>>                 Memtable data size, bytes: 3310654
>>>>                 Memtable switch count: 31
>>>>                 Local read count: 0
>>>>                 Local read latency: 0.000 ms
>>>>                 Local write count: 997667
>>>>                 Local write latency: 0.000 ms
>>>>                 Pending tasks: 0
>>>>                 Bloom filter false positives: 0
>>>>                 Bloom filter false ratio: 0.00000
>>>>                 Bloom filter space used, bytes: 12760
>>>>                 Compacted partition minimum bytes: 1332
>>>>                 Compacted partition maximum bytes: 43388628
>>>>                 Compacted partition mean bytes: 234682
>>>>                 Average live cells per slice (last five minutes): 0.0
>>>>                 Average tombstones per slice (last five minutes): 0.0
>>>>
>>>>
>>>>> I also see frequently lines like this in system.log:
>>>>>>
>>>>>> WARN [Native-Transport-Requests:11935] 2015-11-09 20:10:41,886 
>>>>>> BatchStatement.java (line 223) Batch of prepared statements for 
>>>>>> [billing.usage_record_by_billing_period, billing.metric] is of size 
>>>>>> 53086, exceeding specified threshold of 5120 by 47966.
>>>>>>
>>>>>>
>>>>> Unrelated.
>>>>>
>>>>> =Rob
>>>>>
>>>>>
>>>>
>>>> Can I upgrade to 2.1.11 without doing a nodetool repair/compaction
>>>> being stuck?
>>>> Another thing to mention is that nodetool repair didn't run yet. It got
>>>> installed but nobody bothered to schedule the repair.
>>>>
>>>> Thanks for looking into this!
>>>>
>>>
>>>
>>
>

Re: Cassandra compaction stuck? Should I disable?

Reply via email to