Re: Severe Reliability Problems - 0.7 RC2

Adrian Cockcroft Mon, 20 Dec 2010 12:30:32 -0800

What filesystem are you using? You might try EXT3 or 4 vs. XFS as another area 
of diversity. It sounds as if the page cache or filesystem is messed up. Are 
there any clues in /var/log/messages? How much swap space do you have 
configured?

The kernel level debug stuff I know is all for Solaris unfortunately…

Adrian

From: Dan Hendry <dan.hendry.j...@gmail.com<mailto:dan.hendry.j...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Mon, 20 Dec 2010 12:13:56 -0800
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: RE: Severe Reliability Problems - 0.7 RC2

Yes, I have tried that (although only twice). Same impact as a regular kill: 
nothing happens and I get no stacktrace output. It is however on my list of 
things to try again the next time a node dies. I am also not able to attach 
jstack to the process.

I have also tried disabling JNA (did not help) and I have now changed 
disk_access_mode from auto to mmap_index_only on two of the nodes.

Dan

From: Kani [mailto:javier.canil...@gmail.com]
Sent: December-20-10 14:14
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Severe Reliability Problems - 0.7 RC2

Have you tried to send a KILL -3 to the Cassandra process before you send KILL 
-9? This way you will see what the threads are doing (and maybe blocking). The 
majority of the threads may give you the right spot where to look for the 
problem.

I'm not much of a good linux administrator, but when something goes weird on 
one of my own application (java running over linux box) i tried that command to 
see what the application was doing or trying to.

Kani
On Mon, Dec 20, 2010 at 3:48 PM, Dan Hendry 
<dan.hendry.j...@gmail.com<mailto:dan.hendry.j...@gmail.com>> wrote:
I have been having severe and strange reliability problems within my Cassandra 
cluster. This weekend, all four of my nodes were down at once. Even now I am 
loosing one every few hours. I have attached output from all the system 
monitoring commands I can think of.

What seems to happen is that the java process locks up and sits and has 100% 
system cpu usage (but no user-CPU) (there are 8 cores so 100%=1/8 total 
capacity). JMX freezes and the node effectively dies, but there is typically 
nothing unusual in the Cassandra logs. About the only thing which seems to be 
correlated is the flushing of memtables tables. One of the strangest stats I am 
getting when in this state is memory paging: 3727168.00 pages scanned/second 
(see sar -B output). Occasionally, if I leave the process alone (~1 h) it 
recovers (maybe 1 in 5 times), otherwise the only way to terminate the 
Cassandra process is with a kill -9. When this happens, Cassandra memory usage 
(as reported by JMX before it dies) is also reasonable (ex 6 GB out of 12 GB 
heap and 24 GB system).

This feels more like a system level problem than a Cassandra problem so I have 
tried diversifying my cluster, one node runs Ubuntu 10.10, the other three 
10.04. One runs OpenJDK (1.6.0_20), the rest run Sun JDK (1.6.0_22). Neither 
change seems be correlated with the problem. These are pretty much stock ubuntu 
installs so nothing special on that front.

Now this has been a relatively sudden development and I can potentially 
attribute it to a few things:
1. Upgrading to RC2
2. Ever increasing amounts of data (there is less than 100 gb per node so this 
should not be the problem).
3. Migrating from a set of machines where data+commit log directories were on 
four small raid 5 hds to machines with two 500 gig drives: one for data and one 
for commitlog + os. I have seen more IO wait on these new machines. But they 
have the same memory and system settings.

I am about at my wits end on this one, any help would be appreciated.

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.872 / Virus Database: 271.1.1/3327 - Release Date: 12/20/10 
02:34:00

Re: Severe Reliability Problems - 0.7 RC2

Reply via email to