Benedict, Thanks for the advice. I've tried turning on PrintSafepointStatistics. However, that info is only sent to the STDOUT console. The cassandra startup script closes the STDOUT when it finishes, so nothing is shown for safepoint statistics once it's done starting up. Do you know how to startup cassandra and send all stdout to a log file and tell cassandra not to close stdout?
Also, we have swap turned off as recommended. thanks On Wed, Jan 29, 2014 at 3:39 PM, Benedict Elliott Smith < belliottsm...@datastax.com> wrote: > Frank, > > > The same advice for investigating holds: add the VM flags > -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 (you > could put something above 1 there, to reduce the amount of logging, since a > pause of 52s will be pretty obvious even if aggregated with lots of other > safe points; the count is the number of safepoints to aggregate into one log > message) > > > 52s is a very extreme pause, and I would be surprised if revoke bias could > cause this. I wonder if the VM is swapping out. > > > > On 29 January 2014 19:02, Frank Ng <fnt...@gmail.com> wrote: > >> Thanks for the update. Our logs indicated that there were 0 pending for >> CompactionManager at that time. Also, there were no nodetool repairs >> running at that time. The log statements above state that the application >> had to stop to reach a safepoint. Yet, it doesn't say what is requesting >> the safepoint. >> >> >> On Wed, Jan 29, 2014 at 1:20 PM, Shao-Chuan Wang < >> shaochuan.w...@bloomreach.com> wrote: >> >>> We had similar latency spikes when pending compactions can't keep it up >>> or repair/streaming taking too much cycles. >>> >>> >>> On Wed, Jan 29, 2014 at 10:07 AM, Frank Ng <fnt...@gmail.com> wrote: >>> >>>> All, >>>> >>>> We've been having intermittent long application pauses (version 1.2.8) >>>> and not sure if it's a cassandra bug. During these pauses, there are >>>> dropped messages in the cassandra log file along with the node seeing other >>>> nodes as down. We've turned on gc logging and the following is an example >>>> of a long "stopped" or pause event in the gc.log file. >>>> >>>> 2014-01-28T23:11:12.183-0500: 1337654.424: Total time for which >>>> application threads were stopped: 0.091450 seconds >>>> 2014-01-28T23:14:11.161-0500: 1337833.401: Total time for which >>>> application threads were stopped: 51.8190260 seconds >>>> 2014-01-28T23:14:19.870-0500: 1337842.111: Total time for which >>>> application threads were stopped: 0.005470 seconds >>>> >>>> As seen above, there was a 0.091450 secs pause, then a 51.8190260 secs >>>> pause. There were no GC log events between those 2 log statements. Since >>>> there's no GC logs in between, something else must be causing the long stop >>>> time to reach a safepoint. >>>> >>>> Could there be a Cassandra thread that is taking a long time to reach a >>>> safepoint and what is it trying to do? Along with the node seeing other >>>> nodes as down in the cassandra log file, the StatusLogger shows 1599 >>>> Pending in ReadStage and 9 Pending in MutationStage. >>>> >>>> There is mention of cassandra batch revoke bias locks as a possible >>>> cause (not GC) via: >>>> http://www.mail-archive.com/user@cassandra.apache.org/msg34401.html >>>> >>>> We have JNA, no swap, and the cluster runs fine besides there >>>> intermittent long pause that can cause a node to appear down to other >>>> nodes. Any ideas as the cause of the long pause above? It seems not >>>> related to GC. >>>> >>>> thanks. >>>> >>>> >>> >> >