Hi Clint, I think I kind of found the reason for my problem, I doubt you have the exact same problem but here it is:
We're using Zabbix as our monitoring system and it uses /usr/bin/at to schedule it monitoring runs. Every time the "at" command adds another scheduled task, it send a kill signal to the pid of the atd, probably just to check if it's alive, not to kill it. Now, looking at the system calls audit log, it seems like sometimes, although the kill syscall uses one pid (the atd one), it actually send the kill to our C* java process. I'm really starting to think it's some kind of a linux kernel bug.. BTW, atd was always stopped, so I'm not really sure yet if it was part of the problem or not. HTH, Or. On Wed, Aug 13, 2014 at 9:22 AM, Or Sher <or.sh...@gmail.com> wrote: > Will do the same! > Thanks, > Or. > > > On Tue, Aug 12, 2014 at 6:47 PM, Clint Kelly <clint.ke...@gmail.com> > wrote: > >> Hi Or, >> >> For now I removed the test that was failing like this from our suite >> and made a note to revisit it in a couple of weeks. Unfortunately I >> still don't know what the issue is. I'll post here if I figure out it >> (please do the same!). My working hypothesis now is that we had some >> kind of OOM problem. >> >> Best regards, >> Clint >> >> On Tue, Aug 12, 2014 at 12:23 AM, Or Sher <or.sh...@gmail.com> wrote: >> > Clint, did you find anything? >> > I just noticed it happens to us too on only one node in our CI cluster. >> > I don't think there is a special usage before it happens... The last >> line >> > in the log before the shutdown lines in at least an hour before.. >> > We're using C* 2.0.9. >> > >> > >> > On Thu, Aug 7, 2014 at 12:49 AM, Clint Kelly <clint.ke...@gmail.com> >> wrote: >> >> >> >> Hi Rob, >> >> >> >> Thanks for the clarification; this is really useful. I'll run some >> >> experiments to see if the problem is a JVM OOM on our build machine. >> >> >> >> Best regards, >> >> Clint >> >> >> >> On Wed, Aug 6, 2014 at 1:14 PM, Robert Coli <rc...@eventbrite.com> >> wrote: >> >> > On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli <rc...@eventbrite.com> >> >> > wrote: >> >> >> >> >> >> On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands < >> duncan.sa...@gmail.com> >> >> >> wrote: >> >> >>> >> >> >>> this doesn't look like an OOM to me. If the kernel OOM kills >> >> >>> Cassandra >> >> >>> then Cassandra instantly vaporizes, and there will be nothing in >> the >> >> >>> Cassandra logs (you will find information about the OOM in the >> system >> >> >>> logs >> >> >>> though, eg in dmesg). In the log snippet above you see an orderly >> >> >>> shutdown, >> >> >>> this is completely different to the instant OOM kill. >> >> >> >> >> >> >> >> >> Not really. >> >> >> >> >> >> https://issues.apache.org/jira/browse/CASSANDRA-7507 >> >> > >> >> > >> >> > To be clear, there's two different OOMs here, I am talking about the >> JVM >> >> > OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not >> >> > necessarily result in the cassandra process dying, and can in fact >> >> > trigger >> >> > clean shutdown. >> >> > >> >> > System level OOM will in fact send the equivalent of KILL, which will >> >> > not >> >> > trigger the clean shutdown hook in Cassandra. >> >> > >> >> > =Rob >> > >> > >> > >> > >> > -- >> > Or Sher >> > > > > -- > Or Sher > -- Or Sher