Thank you for the investigation. On Dec 2, 2015 5:21 AM, "PenguinWhispererThe ." < th3penguinwhispe...@gmail.com> wrote:
> So it seems I found the problem. > > The node opening a stream is waiting for the other node to respond but > that node never responds due to a broken pipe which makes Cassandra wait > forever. > > It's basically this issue: > https://issues.apache.org/jira/browse/CASSANDRA-8472 > And this is the workaround/fix: > https://issues.apache.org/jira/browse/CASSANDRA-8611 > > So: > - update cassandra to >=2.0.11 > - add option streaming_socket_timeout_in_ms = 10000 > - do rolling restart of cassandra > > What's weird is that the IOException: Broken pipe is never shown in my > logs (not on any node). And my logging is set to INFO in log4j config. > I have this config in log4j-server.properties: > # output messages into a rolling log file as well as stdout > log4j.rootLogger=INFO,stdout,R > > # stdout > log4j.appender.stdout=org.apache.log4j.ConsoleAppender > log4j.appender.stdout.layout=org.apache.log4j.PatternLayout > log4j.appender.stdout.layout.ConversionPattern=%5p %d{HH:mm:ss,SSS} %m%n > > # rolling log file > log4j.appender.R=org.apache.log4j.RollingFileAppender > log4j.appender.R.maxFileSize=20MB > log4j.appender.R.maxBackupIndex=50 > log4j.appender.R.layout=org.apache.log4j.PatternLayout > log4j.appender.R.layout.ConversionPattern=%5p [%t] %d{ISO8601} %F (line > %L) %m%n > # Edit the next line to point to your logs directory > log4j.appender.R.File=/var/log/cassandra/system.log > > # Application logging options > #log4j.logger.org.apache.cassandra=DEBUG > #log4j.logger.org.apache.cassandra.db=DEBUG > #log4j.logger.org.apache.cassandra.service.StorageProxy=DEBUG > > # Adding this to avoid thrift logging disconnect errors. > log4j.logger.org.apache.thrift.server.TNonblockingServer=ERROR > > Too bad nobody else could point to those. Hope it helps someone else from > wasting a lot of time. > > 2015-11-11 15:42 GMT+01:00 Sebastian Estevez < > sebastian.este...@datastax.com>: > >> Use 'nodetool compactionhistory' >> >> all the best, >> >> Sebastián >> On Nov 11, 2015 3:23 AM, "PenguinWhispererThe ." < >> th3penguinwhispe...@gmail.com> wrote: >> >>> Does compactionstats shows only stats for completed compactions (100%)? >>> It might be that the compaction is running constantly, over and over again. >>> In that case I need to know what I might be able to do to stop this >>> constant compaction so I can start a nodetool repair. >>> >>> Note that there is a lot of traffic on this columnfamily so I'm not sure >>> if temporary disabling compaction is an option. The repair will probably >>> take long as well. >>> >>> Sebastian and Rob: do you might have any more ideas about the things I >>> put in this thread? Any help is appreciated! >>> >>> 2015-11-10 20:03 GMT+01:00 PenguinWhispererThe . < >>> th3penguinwhispe...@gmail.com>: >>> >>>> Hi Sebastian, >>>> >>>> Thanks for your response. >>>> >>>> No swap is used. No offense, I just don't see a reason why having swap >>>> would be the issue here. I put swapiness on 1. I also have jna installed. >>>> That should prevent java being swapped out as wel AFAIK. >>>> >>>> >>>> 2015-11-10 19:50 GMT+01:00 Sebastian Estevez < >>>> sebastian.este...@datastax.com>: >>>> >>>>> Turn off Swap. >>>>> >>>>> >>>>> http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installRecommendSettings.html?scroll=reference_ds_sxl_gf3_2k__disable-swap >>>>> >>>>> >>>>> All the best, >>>>> >>>>> >>>>> [image: datastax_logo.png] <http://www.datastax.com/> >>>>> >>>>> Sebastián Estévez >>>>> >>>>> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com >>>>> >>>>> [image: linkedin.png] <https://www.linkedin.com/company/datastax> [image: >>>>> facebook.png] <https://www.facebook.com/datastax> [image: twitter.png] >>>>> <https://twitter.com/datastax> [image: g+.png] >>>>> <https://plus.google.com/+Datastax/about> >>>>> <http://feeds.feedburner.com/datastax> >>>>> <http://goog_410786983> >>>>> >>>>> >>>>> <http://www.datastax.com/gartner-magic-quadrant-odbms> >>>>> >>>>> DataStax is the fastest, most scalable distributed database >>>>> technology, delivering Apache Cassandra to the world’s most innovative >>>>> enterprises. Datastax is built to be agile, always-on, and predictably >>>>> scalable to any size. With more than 500 customers in 45 countries, >>>>> DataStax >>>>> is the database technology and transactional backbone of choice for the >>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. >>>>> >>>>> On Tue, Nov 10, 2015 at 1:48 PM, PenguinWhispererThe . < >>>>> th3penguinwhispe...@gmail.com> wrote: >>>>> >>>>>> I also have the following memory usage: >>>>>> [root@US-BILLINGDSX4 cassandra]# free -m >>>>>> total used free shared buffers >>>>>> cached >>>>>> Mem: 12024 9455 2569 0 110 >>>>>> 2163 >>>>>> -/+ buffers/cache: 7180 4844 >>>>>> Swap: 2047 0 2047 >>>>>> >>>>>> Still a lot free and a lot of free buffers/cache. >>>>>> >>>>>> 2015-11-10 19:45 GMT+01:00 PenguinWhispererThe . < >>>>>> th3penguinwhispe...@gmail.com>: >>>>>> >>>>>>> Still stuck with this. However I enabled GC logging. This shows the >>>>>>> following: >>>>>>> >>>>>>> [root@myhost cassandra]# tail -f gc-1447180680.log >>>>>>> 2015-11-10T18:41:45.516+0000: 225.428: [GC >>>>>>> 2721842K->2066508K(6209536K), 0.0199040 secs] >>>>>>> 2015-11-10T18:41:45.977+0000: 225.889: [GC >>>>>>> 2721868K->2066511K(6209536K), 0.0221910 secs] >>>>>>> 2015-11-10T18:41:46.437+0000: 226.349: [GC >>>>>>> 2721871K->2066524K(6209536K), 0.0222140 secs] >>>>>>> 2015-11-10T18:41:46.897+0000: 226.809: [GC >>>>>>> 2721884K->2066539K(6209536K), 0.0224140 secs] >>>>>>> 2015-11-10T18:41:47.359+0000: 227.271: [GC >>>>>>> 2721899K->2066538K(6209536K), 0.0302520 secs] >>>>>>> 2015-11-10T18:41:47.821+0000: 227.733: [GC >>>>>>> 2721898K->2066557K(6209536K), 0.0280530 secs] >>>>>>> 2015-11-10T18:41:48.293+0000: 228.205: [GC >>>>>>> 2721917K->2066571K(6209536K), 0.0218000 secs] >>>>>>> 2015-11-10T18:41:48.790+0000: 228.702: [GC >>>>>>> 2721931K->2066780K(6209536K), 0.0292470 secs] >>>>>>> 2015-11-10T18:41:49.290+0000: 229.202: [GC >>>>>>> 2722140K->2066843K(6209536K), 0.0288740 secs] >>>>>>> 2015-11-10T18:41:49.756+0000: 229.668: [GC >>>>>>> 2722203K->2066818K(6209536K), 0.0283380 secs] >>>>>>> 2015-11-10T18:41:50.249+0000: 230.161: [GC >>>>>>> 2722178K->2067158K(6209536K), 0.0218690 secs] >>>>>>> 2015-11-10T18:41:50.713+0000: 230.625: [GC >>>>>>> 2722518K->2067236K(6209536K), 0.0278810 secs] >>>>>>> >>>>>>> This is a VM with 12GB of RAM. Highered the HEAP_SIZE to 6GB and >>>>>>> HEAP_NEWSIZE to 800MB. >>>>>>> >>>>>>> Still the same result. >>>>>>> >>>>>>> This looks very similar to following issue: >>>>>>> >>>>>>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAJ=3xgRLsvpnZe0uXEYjG94rKhfXeU+jBR=q3a-_c3rsdd5...@mail.gmail.com%3E >>>>>>> >>>>>>> Is the only possibility to upgrade memory? I mean, I can't believe >>>>>>> it's just loading all it's data in memory. That would require to keep >>>>>>> scaling up the node to keep it work? >>>>>>> >>>>>>> >>>>>>> 2015-11-10 9:36 GMT+01:00 PenguinWhispererThe . < >>>>>>> th3penguinwhispe...@gmail.com>: >>>>>>> >>>>>>>> Correction... >>>>>>>> I was grepping on Segmentation on the strace and it happens a lot. >>>>>>>> >>>>>>>> Do I need to run a scrub? >>>>>>>> >>>>>>>> 2015-11-10 9:30 GMT+01:00 PenguinWhispererThe . < >>>>>>>> th3penguinwhispe...@gmail.com>: >>>>>>>> >>>>>>>>> Hi Rob, >>>>>>>>> >>>>>>>>> Thanks for your reply. >>>>>>>>> >>>>>>>>> 2015-11-09 23:17 GMT+01:00 Robert Coli <rc...@eventbrite.com>: >>>>>>>>> >>>>>>>>>> On Mon, Nov 9, 2015 at 1:29 PM, PenguinWhispererThe . < >>>>>>>>>> th3penguinwhispe...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>> In Opscenter I see one of the nodes is orange. It seems like >>>>>>>>>>> it's working on compaction. I used nodetool compactionstats and >>>>>>>>>>> whenever I >>>>>>>>>>> did this the Completed nad percentage stays the same (even with >>>>>>>>>>> hours in >>>>>>>>>>> between). >>>>>>>>>>> >>>>>>>>>> Are you the same person from IRC, or a second report today of >>>>>>>>>> compaction hanging in this way? >>>>>>>>>> >>>>>>>>> Same person ;) Just didn't had things to work with from the chat >>>>>>>>> there. I want to understand the issue more, see what I can tune or >>>>>>>>> fix. I >>>>>>>>> want to do nodetool repair before upgrading to 2.1.11 but the >>>>>>>>> compaction is >>>>>>>>> blocking it. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> What version of Cassandra? >>>>>>>>>> >>>>>>>>> 2.0.9 >>>>>>>>> >>>>>>>>>> I currently don't see cpu load from cassandra on that node. So it >>>>>>>>>>> seems stuck (somewhere mid 60%). Also some other nodes have >>>>>>>>>>> compaction on >>>>>>>>>>> the same columnfamily. I don't see any progress. >>>>>>>>>>> >>>>>>>>>>> WARN [RMI TCP Connection(554)-192.168.0.68] 2015-11-09 >>>>>>>>>>> 17:18:13,677 ColumnFamilyStore.java (line 2101) Unable to cancel >>>>>>>>>>> in-progress compactions for usage_record_ptd. Probably there is an >>>>>>>>>>> unusually large row in progress somewhere. It is also possible >>>>>>>>>>> that buggy code left some sstables compacting after it was done >>>>>>>>>>> with them >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - How can I assure that nothing is happening? >>>>>>>>>>> >>>>>>>>>>> Find the thread that is doing compaction and strace it. >>>>>>>>>> Generally it is one of the threads with a lower thread priority. >>>>>>>>>> >>>>>>>>> >>>>>>>>> I have 141 threads. Not sure if that's normal. >>>>>>>>> >>>>>>>>> This seems to be the one: >>>>>>>>> 61404 cassandr 24 4 8948m 4.3g 820m R 90.2 36.8 292:54.47 java >>>>>>>>> >>>>>>>>> In the strace I see basically this part repeating (with once in a >>>>>>>>> while the "resource temporarily unavailable"): >>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 >>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 >>>>>>>>> getpriority(PRIO_PROCESS, 61404) = 16 >>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 >>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 0 >>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494045, NULL) = -1 EAGAIN >>>>>>>>> (Resource temporarily unavailable) >>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 >>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 >>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 >>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494047, NULL) = 0 >>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 >>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 >>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 >>>>>>>>> getpriority(PRIO_PROCESS, 61404) = 16 >>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 >>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1 >>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494049, NULL) = 0 >>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 >>>>>>>>> getpriority(PRIO_PROCESS, 61404) = 16 >>>>>>>>> >>>>>>>>> But wait! >>>>>>>>> I also see this: >>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50, >>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 >>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494055, NULL) = 0 >>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0 >>>>>>>>> --- SIGSEGV (Segmentation fault) @ 0 (0) --- >>>>>>>>> >>>>>>>>> This doesn't seem to happen that often though. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Compaction often appears hung when decompressing a very large >>>>>>>>>> row, but usually not for "hours". >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - Is it recommended to disable compaction from a certain >>>>>>>>>>> data size? (I believe 25GB on each node). >>>>>>>>>>> >>>>>>>>>>> It is almost never recommended to disable compaction. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - Can I stop this compaction? nodetool stop compaction >>>>>>>>>>> doesn't seem to work. >>>>>>>>>>> >>>>>>>>>>> Killing the JVM ("the dungeon collapses!") would certainly stop >>>>>>>>>> it, but it'd likely just start again when you restart the node. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - Is stopping the compaction dangerous? >>>>>>>>>>> >>>>>>>>>>> Not if you're in a version that properly cleans up partial >>>>>>>>>> compactions, which is most of them. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - Is killing the cassandra process dangerous while >>>>>>>>>>> compacting(I did nodetool drain on one node)? >>>>>>>>>>> >>>>>>>>>>> No. But probably nodetool drain couldn't actually stop the >>>>>>>>>> in-progress compaction either, FWIW. >>>>>>>>>> >>>>>>>>>>> This is output of nodetool compactionstats grepped for the >>>>>>>>>>> keyspace that seems stuck. >>>>>>>>>>> >>>>>>>>>>> Do you have gigantic rows in that keyspace? What does cfstats >>>>>>>>>> say about the largest row compaction has seen/do you have log >>>>>>>>>> messages >>>>>>>>>> about compacting large rows? >>>>>>>>>> >>>>>>>>> >>>>>>>>> I don't know about the gigantic rows. How can I check? >>>>>>>>> >>>>>>>>> I've checked the logs and found this: >>>>>>>>> INFO [CompactionExecutor:67] 2015-11-10 02:34:19,077 >>>>>>>>> CompactionController.java (line 192) Compacting large row >>>>>>>>> billing/usage_record_ptd:177727:2015-10-14 00\:00Z (243992466 bytes) >>>>>>>>> incrementally >>>>>>>>> So this is from 6 hours ago. >>>>>>>>> >>>>>>>>> I also see a lot of messages like this: >>>>>>>>> INFO [OptionalTasks:1] 2015-11-10 06:36:06,395 MeteredFlusher.java >>>>>>>>> (line 58) flushing high-traffic column family >>>>>>>>> CFS(Keyspace='mykeyspace', >>>>>>>>> ColumnFamily='mycolumnfamily') (estimated 100317609 bytes) >>>>>>>>> And (although it's unrelated this might impact compaction >>>>>>>>> performance?): >>>>>>>>> WARN [Native-Transport-Requests:10514] 2015-11-10 06:33:34,172 >>>>>>>>> BatchStatement.java (line 223) Batch of prepared statements for >>>>>>>>> [billing.usage_record_ptd] is of size 13834, exceeding specified >>>>>>>>> threshold >>>>>>>>> of 5120 by 8714. >>>>>>>>> >>>>>>>>> It's like the compaction is only doing one sstable at a time and >>>>>>>>> is doing nothing a long time in between. >>>>>>>>> >>>>>>>>> cfstats for this keyspace and columnfamily gives the following: >>>>>>>>> Table: mycolumnfamily >>>>>>>>> SSTable count: 26 >>>>>>>>> Space used (live), bytes: 319858991 >>>>>>>>> Space used (total), bytes: 319860267 >>>>>>>>> SSTable Compression Ratio: 0.24265700071674673 >>>>>>>>> Number of keys (estimate): 6656 >>>>>>>>> Memtable cell count: 22710 >>>>>>>>> Memtable data size, bytes: 3310654 >>>>>>>>> Memtable switch count: 31 >>>>>>>>> Local read count: 0 >>>>>>>>> Local read latency: 0.000 ms >>>>>>>>> Local write count: 997667 >>>>>>>>> Local write latency: 0.000 ms >>>>>>>>> Pending tasks: 0 >>>>>>>>> Bloom filter false positives: 0 >>>>>>>>> Bloom filter false ratio: 0.00000 >>>>>>>>> Bloom filter space used, bytes: 12760 >>>>>>>>> Compacted partition minimum bytes: 1332 >>>>>>>>> Compacted partition maximum bytes: 43388628 >>>>>>>>> Compacted partition mean bytes: 234682 >>>>>>>>> Average live cells per slice (last five minutes): >>>>>>>>> 0.0 >>>>>>>>> Average tombstones per slice (last five minutes): >>>>>>>>> 0.0 >>>>>>>>> >>>>>>>>> >>>>>>>>>> I also see frequently lines like this in system.log: >>>>>>>>>>> >>>>>>>>>>> WARN [Native-Transport-Requests:11935] 2015-11-09 20:10:41,886 >>>>>>>>>>> BatchStatement.java (line 223) Batch of prepared statements for >>>>>>>>>>> [billing.usage_record_by_billing_period, billing.metric] is of size >>>>>>>>>>> 53086, exceeding specified threshold of 5120 by 47966. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> Unrelated. >>>>>>>>>> >>>>>>>>>> =Rob >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> Can I upgrade to 2.1.11 without doing a nodetool repair/compaction >>>>>>>>> being stuck? >>>>>>>>> Another thing to mention is that nodetool repair didn't run yet. >>>>>>>>> It got installed but nobody bothered to schedule the repair. >>>>>>>>> >>>>>>>>> Thanks for looking into this! >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >