Hi Rob, Thanks for your reply.
2015-11-09 23:17 GMT+01:00 Robert Coli <rc...@eventbrite.com>: > On Mon, Nov 9, 2015 at 1:29 PM, PenguinWhispererThe . < > th3penguinwhispe...@gmail.com> wrote: >> >> In Opscenter I see one of the nodes is orange. It seems like it's working >> on compaction. I used nodetool compactionstats and whenever I did this the >> Completed nad percentage stays the same (even with hours in between). >> > Are you the same person from IRC, or a second report today of compaction > hanging in this way? > Same person ;) Just didn't had things to work with from the chat there. I want to understand the issue more, see what I can tune or fix. I want to do nodetool repair before upgrading to 2.1.11 but the compaction is blocking it. > > > What version of Cassandra? > 2.0.9 > I currently don't see cpu load from cassandra on that node. So it seems >> stuck (somewhere mid 60%). Also some other nodes have compaction on the >> same columnfamily. I don't see any progress. >> >> WARN [RMI TCP Connection(554)-192.168.0.68] 2015-11-09 17:18:13,677 >> ColumnFamilyStore.java (line 2101) Unable to cancel in-progress compactions >> for usage_record_ptd. Probably there is an unusually large row in progress >> somewhere. It is also possible that buggy code left some sstables >> compacting after it was done with them >> >> >> - How can I assure that nothing is happening? >> >> Find the thread that is doing compaction and strace it. Generally it is > one of the threads with a lower thread priority. > I have 141 threads. Not sure if that's normal. This seems to be the one: 61404 cassandr 24 4 8948m 4.3g 820m R 90.2 36.8 292:54.47 java In the strace I see basically this part repeating (with once in a while the "resource temporarily unavailable"): futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 getpriority(PRIO_PROCESS, 61404) = 16 futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x1233854, FUTEX_WAIT_PRIVATE, 494045, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x1233854, FUTEX_WAIT_PRIVATE, 494047, NULL) = 0 futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 getpriority(PRIO_PROCESS, 61404) = 16 futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x1233854, FUTEX_WAIT_PRIVATE, 494049, NULL) = 0 futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 getpriority(PRIO_PROCESS, 61404) = 16 But wait! I also see this: futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x1233854, FUTEX_WAIT_PRIVATE, 494055, NULL) = 0 futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 --- SIGSEGV (Segmentation fault) @ 0 (0) --- This doesn't seem to happen that often though. > > Compaction often appears hung when decompressing a very large row, but > usually not for "hours". > >> >> - Is it recommended to disable compaction from a certain data size? >> (I believe 25GB on each node). >> >> It is almost never recommended to disable compaction. > >> >> - Can I stop this compaction? nodetool stop compaction doesn't seem >> to work. >> >> Killing the JVM ("the dungeon collapses!") would certainly stop it, but > it'd likely just start again when you restart the node. > >> >> - Is stopping the compaction dangerous? >> >> Not if you're in a version that properly cleans up partial compactions, > which is most of them. > >> >> - Is killing the cassandra process dangerous while compacting(I did >> nodetool drain on one node)? >> >> No. But probably nodetool drain couldn't actually stop the in-progress > compaction either, FWIW. > >> This is output of nodetool compactionstats grepped for the keyspace that >> seems stuck. >> >> Do you have gigantic rows in that keyspace? What does cfstats say about > the largest row compaction has seen/do you have log messages about > compacting large rows? > I don't know about the gigantic rows. How can I check? I've checked the logs and found this: INFO [CompactionExecutor:67] 2015-11-10 02:34:19,077 CompactionController.java (line 192) Compacting large row billing/usage_record_ptd:177727:2015-10-14 00\:00Z (243992466 bytes) incrementally So this is from 6 hours ago. I also see a lot of messages like this: INFO [OptionalTasks:1] 2015-11-10 06:36:06,395 MeteredFlusher.java (line 58) flushing high-traffic column family CFS(Keyspace='mykeyspace', ColumnFamily='mycolumnfamily') (estimated 100317609 bytes) And (although it's unrelated this might impact compaction performance?): WARN [Native-Transport-Requests:10514] 2015-11-10 06:33:34,172 BatchStatement.java (line 223) Batch of prepared statements for [billing.usage_record_ptd] is of size 13834, exceeding specified threshold of 5120 by 8714. It's like the compaction is only doing one sstable at a time and is doing nothing a long time in between. cfstats for this keyspace and columnfamily gives the following: Table: mycolumnfamily SSTable count: 26 Space used (live), bytes: 319858991 Space used (total), bytes: 319860267 SSTable Compression Ratio: 0.24265700071674673 Number of keys (estimate): 6656 Memtable cell count: 22710 Memtable data size, bytes: 3310654 Memtable switch count: 31 Local read count: 0 Local read latency: 0.000 ms Local write count: 997667 Local write latency: 0.000 ms Pending tasks: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used, bytes: 12760 Compacted partition minimum bytes: 1332 Compacted partition maximum bytes: 43388628 Compacted partition mean bytes: 234682 Average live cells per slice (last five minutes): 0.0 Average tombstones per slice (last five minutes): 0.0 > I also see frequently lines like this in system.log: >> >> WARN [Native-Transport-Requests:11935] 2015-11-09 20:10:41,886 >> BatchStatement.java (line 223) Batch of prepared statements for >> [billing.usage_record_by_billing_period, billing.metric] is of size 53086, >> exceeding specified threshold of 5120 by 47966. >> >> > Unrelated. > > =Rob > > Can I upgrade to 2.1.11 without doing a nodetool repair/compaction being stuck? Another thing to mention is that nodetool repair didn't run yet. It got installed but nobody bothered to schedule the repair. Thanks for looking into this!