Still stuck with this. However I enabled GC logging. This shows the following:
[root@myhost cassandra]# tail -f gc-1447180680.log 2015-11-10T18:41:45.516+0000: 225.428: [GC 2721842K->2066508K(6209536K), 0.0199040 secs] 2015-11-10T18:41:45.977+0000: 225.889: [GC 2721868K->2066511K(6209536K), 0.0221910 secs] 2015-11-10T18:41:46.437+0000: 226.349: [GC 2721871K->2066524K(6209536K), 0.0222140 secs] 2015-11-10T18:41:46.897+0000: 226.809: [GC 2721884K->2066539K(6209536K), 0.0224140 secs] 2015-11-10T18:41:47.359+0000: 227.271: [GC 2721899K->2066538K(6209536K), 0.0302520 secs] 2015-11-10T18:41:47.821+0000: 227.733: [GC 2721898K->2066557K(6209536K), 0.0280530 secs] 2015-11-10T18:41:48.293+0000: 228.205: [GC 2721917K->2066571K(6209536K), 0.0218000 secs] 2015-11-10T18:41:48.790+0000: 228.702: [GC 2721931K->2066780K(6209536K), 0.0292470 secs] 2015-11-10T18:41:49.290+0000: 229.202: [GC 2722140K->2066843K(6209536K), 0.0288740 secs] 2015-11-10T18:41:49.756+0000: 229.668: [GC 2722203K->2066818K(6209536K), 0.0283380 secs] 2015-11-10T18:41:50.249+0000: 230.161: [GC 2722178K->2067158K(6209536K), 0.0218690 secs] 2015-11-10T18:41:50.713+0000: 230.625: [GC 2722518K->2067236K(6209536K), 0.0278810 secs] This is a VM with 12GB of RAM. Highered the HEAP_SIZE to 6GB and HEAP_NEWSIZE to 800MB. Still the same result. This looks very similar to following issue: http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAJ=3xgRLsvpnZe0uXEYjG94rKhfXeU+jBR=q3a-_c3rsdd5...@mail.gmail.com%3E Is the only possibility to upgrade memory? I mean, I can't believe it's just loading all it's data in memory. That would require to keep scaling up the node to keep it work? 2015-11-10 9:36 GMT+01:00 PenguinWhispererThe . < th3penguinwhispe...@gmail.com>: > Correction... > I was grepping on Segmentation on the strace and it happens a lot. > > Do I need to run a scrub? > > 2015-11-10 9:30 GMT+01:00 PenguinWhispererThe . < > th3penguinwhispe...@gmail.com>: > >> Hi Rob, >> >> Thanks for your reply. >> >> 2015-11-09 23:17 GMT+01:00 Robert Coli <rc...@eventbrite.com>: >> >>> On Mon, Nov 9, 2015 at 1:29 PM, PenguinWhispererThe . < >>> th3penguinwhispe...@gmail.com> wrote: >>>> >>>> In Opscenter I see one of the nodes is orange. It seems like it's >>>> working on compaction. I used nodetool compactionstats and whenever I did >>>> this the Completed nad percentage stays the same (even with hours in >>>> between). >>>> >>> Are you the same person from IRC, or a second report today of compaction >>> hanging in this way? >>> >> Same person ;) Just didn't had things to work with from the chat there. I >> want to understand the issue more, see what I can tune or fix. I want to do >> nodetool repair before upgrading to 2.1.11 but the compaction is blocking >> it. >> >>> >>> >>> >> What version of Cassandra? >>> >> 2.0.9 >> >>> I currently don't see cpu load from cassandra on that node. So it seems >>>> stuck (somewhere mid 60%). Also some other nodes have compaction on the >>>> same columnfamily. I don't see any progress. >>>> >>>> WARN [RMI TCP Connection(554)-192.168.0.68] 2015-11-09 17:18:13,677 >>>> ColumnFamilyStore.java (line 2101) Unable to cancel in-progress >>>> compactions for usage_record_ptd. Probably there is an unusually large >>>> row in progress somewhere. It is also possible that buggy code left some >>>> sstables compacting after it was done with them >>>> >>>> >>>> - How can I assure that nothing is happening? >>>> >>>> Find the thread that is doing compaction and strace it. Generally it is >>> one of the threads with a lower thread priority. >>> >> >> I have 141 threads. Not sure if that's normal. >> >> This seems to be the one: >> 61404 cassandr 24 4 8948m 4.3g 820m R 90.2 36.8 292:54.47 java >> >> In the strace I see basically this part repeating (with once in a while >> the "resource temporarily unavailable"): >> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 >> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 >> getpriority(PRIO_PROCESS, 61404) = 16 >> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 >> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 0 >> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494045, NULL) = -1 EAGAIN (Resource >> temporarily unavailable) >> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 >> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 >> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 >> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494047, NULL) = 0 >> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 >> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 >> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 >> getpriority(PRIO_PROCESS, 61404) = 16 >> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 >> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 >> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494049, NULL) = 0 >> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 >> getpriority(PRIO_PROCESS, 61404) = 16 >> >> But wait! >> I also see this: >> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 >> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494055, NULL) = 0 >> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 >> --- SIGSEGV (Segmentation fault) @ 0 (0) --- >> >> This doesn't seem to happen that often though. >> >>> >>> Compaction often appears hung when decompressing a very large row, but >>> usually not for "hours". >>> >>>> >>>> - Is it recommended to disable compaction from a certain data size? >>>> (I believe 25GB on each node). >>>> >>>> It is almost never recommended to disable compaction. >>> >>>> >>>> - Can I stop this compaction? nodetool stop compaction doesn't seem >>>> to work. >>>> >>>> Killing the JVM ("the dungeon collapses!") would certainly stop it, but >>> it'd likely just start again when you restart the node. >>> >>>> >>>> - Is stopping the compaction dangerous? >>>> >>>> Not if you're in a version that properly cleans up partial >>> compactions, which is most of them. >>> >>>> >>>> - Is killing the cassandra process dangerous while compacting(I did >>>> nodetool drain on one node)? >>>> >>>> No. But probably nodetool drain couldn't actually stop the in-progress >>> compaction either, FWIW. >>> >>>> This is output of nodetool compactionstats grepped for the keyspace >>>> that seems stuck. >>>> >>>> Do you have gigantic rows in that keyspace? What does cfstats say about >>> the largest row compaction has seen/do you have log messages about >>> compacting large rows? >>> >> >> I don't know about the gigantic rows. How can I check? >> >> I've checked the logs and found this: >> INFO [CompactionExecutor:67] 2015-11-10 02:34:19,077 >> CompactionController.java (line 192) Compacting large row >> billing/usage_record_ptd:177727:2015-10-14 00\:00Z (243992466 bytes) >> incrementally >> So this is from 6 hours ago. >> >> I also see a lot of messages like this: >> INFO [OptionalTasks:1] 2015-11-10 06:36:06,395 MeteredFlusher.java (line >> 58) flushing high-traffic column family CFS(Keyspace='mykeyspace', >> ColumnFamily='mycolumnfamily') (estimated 100317609 bytes) >> And (although it's unrelated this might impact compaction performance?): >> WARN [Native-Transport-Requests:10514] 2015-11-10 06:33:34,172 >> BatchStatement.java (line 223) Batch of prepared statements for >> [billing.usage_record_ptd] is of size 13834, exceeding specified threshold >> of 5120 by 8714. >> >> It's like the compaction is only doing one sstable at a time and is doing >> nothing a long time in between. >> >> cfstats for this keyspace and columnfamily gives the following: >> Table: mycolumnfamily >> SSTable count: 26 >> Space used (live), bytes: 319858991 >> Space used (total), bytes: 319860267 >> SSTable Compression Ratio: 0.24265700071674673 >> Number of keys (estimate): 6656 >> Memtable cell count: 22710 >> Memtable data size, bytes: 3310654 >> Memtable switch count: 31 >> Local read count: 0 >> Local read latency: 0.000 ms >> Local write count: 997667 >> Local write latency: 0.000 ms >> Pending tasks: 0 >> Bloom filter false positives: 0 >> Bloom filter false ratio: 0.00000 >> Bloom filter space used, bytes: 12760 >> Compacted partition minimum bytes: 1332 >> Compacted partition maximum bytes: 43388628 >> Compacted partition mean bytes: 234682 >> Average live cells per slice (last five minutes): 0.0 >> Average tombstones per slice (last five minutes): 0.0 >> >> >>> I also see frequently lines like this in system.log: >>>> >>>> WARN [Native-Transport-Requests:11935] 2015-11-09 20:10:41,886 >>>> BatchStatement.java (line 223) Batch of prepared statements for >>>> [billing.usage_record_by_billing_period, billing.metric] is of size 53086, >>>> exceeding specified threshold of 5120 by 47966. >>>> >>>> >>> Unrelated. >>> >>> =Rob >>> >>> >> >> Can I upgrade to 2.1.11 without doing a nodetool repair/compaction being >> stuck? >> Another thing to mention is that nodetool repair didn't run yet. It got >> installed but nobody bothered to schedule the repair. >> >> Thanks for looking into this! >> > >