Thanks everyone for the feedback. So some additional details... 1. Definitely using Oracle JDK (1.7.0_71-b14) 2. Yes, the segfaulting does go away after a restart 3. No OOM log messages when this occurs 4. We are seeing many GC pauses that take a long time, as in over 2 seconds - we are aware that our GC performance is bad and we believe this is because of IO, which we are addressing. However, we are see these runaway CPU during low load times and even when we took the cluster completely out of use.
Thanks again, Stan On Wed, Nov 26, 2014 at 12:03 PM, Tyler Hobbs <ty...@datastax.com> wrote: > When I see a segfault, my first reaction is to always suspect OpenJDK. > Are you using OpenJDK or the Oracle JDK? If you're using the former, I > recommend the latter. > > On Tue, Nov 25, 2014 at 10:40 PM, Otis Gospodnetic < > otis.gospodne...@gmail.com> wrote: > >> Hi Stan, >> >> Put some monitoring on this. The first thing I think of when I hear >> "chewing up CPU" for Java apps is GC. In SPM <http://sematext.com/spm/> >> you can easily see individual JVM memory pools and see if any of them are >> at (close to) 100%. You can typically correlate that to increased GC times >> and counts. I'd look at that before looking at strace and such. >> >> Otis >> -- >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management >> Solr & Elasticsearch Support * http://sematext.com/ >> >> >> On Tue, Nov 25, 2014 at 11:07 PM, Stan Lemon <sle...@salesforce.com> >> wrote: >> >>> We are using v2.0.11 and have seen several instances in our 24 node >>> cluster where the node becomes unresponsive, when we look into it we find >>> that there is a cassandra process chewing up a lot of CPU. There are no >>> other indications in logs or anything as to what might be happening, >>> however if we strace the process that is chewing up CPU we see a segmental >>> fault: >>> >>> --- SIGSEGV (Segmentation fault) @ 0 (0) --- >>> rt_sigreturn(0x7fd61110f862) = 30618997712 >>> futex(0x7fd614844054, FUTEX_WAIT_PRIVATE, 27333, NULL) = -1 EAGAIN >>> (Resource temporarily unavailable) >>> futex(0x7fd614844028, FUTEX_WAKE_PRIVATE, 1) = 0 >>> futex(0x7fd6148e2e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd6148e2e50, >>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 >>> futex(0x7fd6148e2e28, FUTEX_WAKE_PRIVATE, 1) = 1 >>> futex(0x7fd614844054, FUTEX_WAIT_PRIVATE, 27335, NULL) = 0 >>> futex(0x7fd614844028, FUTEX_WAKE_PRIVATE, 1) = 0 >>> futex(0x7fd6148e2e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd6148e2e50, >>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 >>> futex(0x7fd6148e2e28, FUTEX_WAKE_PRIVATE, 1) = 1 >>> >>> And this happens over and over again while running strafe. >>> >>> Has anyone seen this? Does anyone have any ideas what might be >>> happening, or how we could debug it further? >>> >>> Thanks for your help, >>> >>> Stan >>> >>> >> > > > -- > Tyler Hobbs > DataStax <http://datastax.com/> >