Correction... I was grepping on Segmentation on the strace and it happens a lot.
Do I need to run a scrub? 2015-11-10 9:30 GMT+01:00 PenguinWhispererThe . < th3penguinwhispe...@gmail.com>: > Hi Rob, > > Thanks for your reply. > > 2015-11-09 23:17 GMT+01:00 Robert Coli <rc...@eventbrite.com>: > >> On Mon, Nov 9, 2015 at 1:29 PM, PenguinWhispererThe . < >> th3penguinwhispe...@gmail.com> wrote: >>> >>> In Opscenter I see one of the nodes is orange. It seems like it's >>> working on compaction. I used nodetool compactionstats and whenever I did >>> this the Completed nad percentage stays the same (even with hours in >>> between). >>> >> Are you the same person from IRC, or a second report today of compaction >> hanging in this way? >> > Same person ;) Just didn't had things to work with from the chat there. I > want to understand the issue more, see what I can tune or fix. I want to do > nodetool repair before upgrading to 2.1.11 but the compaction is blocking > it. > >> >> >> > What version of Cassandra? >> > 2.0.9 > >> I currently don't see cpu load from cassandra on that node. So it seems >>> stuck (somewhere mid 60%). Also some other nodes have compaction on the >>> same columnfamily. I don't see any progress. >>> >>> WARN [RMI TCP Connection(554)-192.168.0.68] 2015-11-09 17:18:13,677 >>> ColumnFamilyStore.java (line 2101) Unable to cancel in-progress compactions >>> for usage_record_ptd. Probably there is an unusually large row in progress >>> somewhere. It is also possible that buggy code left some sstables >>> compacting after it was done with them >>> >>> >>> - How can I assure that nothing is happening? >>> >>> Find the thread that is doing compaction and strace it. Generally it is >> one of the threads with a lower thread priority. >> > > I have 141 threads. Not sure if that's normal. > > This seems to be the one: > 61404 cassandr 24 4 8948m 4.3g 820m R 90.2 36.8 292:54.47 java > > In the strace I see basically this part repeating (with once in a while > the "resource temporarily unavailable"): > futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 > getpriority(PRIO_PROCESS, 61404) = 16 > futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x1233854, FUTEX_WAIT_PRIVATE, 494045, NULL) = -1 EAGAIN (Resource > temporarily unavailable) > futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 > futex(0x1233854, FUTEX_WAIT_PRIVATE, 494047, NULL) = 0 > futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 > getpriority(PRIO_PROCESS, 61404) = 16 > futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 > futex(0x1233854, FUTEX_WAIT_PRIVATE, 494049, NULL) = 0 > futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 > getpriority(PRIO_PROCESS, 61404) = 16 > > But wait! > I also see this: > futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x1233854, FUTEX_WAIT_PRIVATE, 494055, NULL) = 0 > futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 > --- SIGSEGV (Segmentation fault) @ 0 (0) --- > > This doesn't seem to happen that often though. > >> >> Compaction often appears hung when decompressing a very large row, but >> usually not for "hours". >> >>> >>> - Is it recommended to disable compaction from a certain data size? >>> (I believe 25GB on each node). >>> >>> It is almost never recommended to disable compaction. >> >>> >>> - Can I stop this compaction? nodetool stop compaction doesn't seem >>> to work. >>> >>> Killing the JVM ("the dungeon collapses!") would certainly stop it, but >> it'd likely just start again when you restart the node. >> >>> >>> - Is stopping the compaction dangerous? >>> >>> Not if you're in a version that properly cleans up partial compactions, >> which is most of them. >> >>> >>> - Is killing the cassandra process dangerous while compacting(I did >>> nodetool drain on one node)? >>> >>> No. But probably nodetool drain couldn't actually stop the in-progress >> compaction either, FWIW. >> >>> This is output of nodetool compactionstats grepped for the keyspace that >>> seems stuck. >>> >>> Do you have gigantic rows in that keyspace? What does cfstats say about >> the largest row compaction has seen/do you have log messages about >> compacting large rows? >> > > I don't know about the gigantic rows. How can I check? > > I've checked the logs and found this: > INFO [CompactionExecutor:67] 2015-11-10 02:34:19,077 > CompactionController.java (line 192) Compacting large row > billing/usage_record_ptd:177727:2015-10-14 00\:00Z (243992466 bytes) > incrementally > So this is from 6 hours ago. > > I also see a lot of messages like this: > INFO [OptionalTasks:1] 2015-11-10 06:36:06,395 MeteredFlusher.java (line > 58) flushing high-traffic column family CFS(Keyspace='mykeyspace', > ColumnFamily='mycolumnfamily') (estimated 100317609 bytes) > And (although it's unrelated this might impact compaction performance?): > WARN [Native-Transport-Requests:10514] 2015-11-10 06:33:34,172 > BatchStatement.java (line 223) Batch of prepared statements for > [billing.usage_record_ptd] is of size 13834, exceeding specified threshold > of 5120 by 8714. > > It's like the compaction is only doing one sstable at a time and is doing > nothing a long time in between. > > cfstats for this keyspace and columnfamily gives the following: > Table: mycolumnfamily > SSTable count: 26 > Space used (live), bytes: 319858991 > Space used (total), bytes: 319860267 > SSTable Compression Ratio: 0.24265700071674673 > Number of keys (estimate): 6656 > Memtable cell count: 22710 > Memtable data size, bytes: 3310654 > Memtable switch count: 31 > Local read count: 0 > Local read latency: 0.000 ms > Local write count: 997667 > Local write latency: 0.000 ms > Pending tasks: 0 > Bloom filter false positives: 0 > Bloom filter false ratio: 0.00000 > Bloom filter space used, bytes: 12760 > Compacted partition minimum bytes: 1332 > Compacted partition maximum bytes: 43388628 > Compacted partition mean bytes: 234682 > Average live cells per slice (last five minutes): 0.0 > Average tombstones per slice (last five minutes): 0.0 > > >> I also see frequently lines like this in system.log: >>> >>> WARN [Native-Transport-Requests:11935] 2015-11-09 20:10:41,886 >>> BatchStatement.java (line 223) Batch of prepared statements for >>> [billing.usage_record_by_billing_period, billing.metric] is of size 53086, >>> exceeding specified threshold of 5120 by 47966. >>> >>> >> Unrelated. >> >> =Rob >> >> > > Can I upgrade to 2.1.11 without doing a nodetool repair/compaction being > stuck? > Another thing to mention is that nodetool repair didn't run yet. It got > installed but nobody bothered to schedule the repair. > > Thanks for looking into this! >