RE: Severe Reliability Problems - 0.7 RC2

2010-12-20 Thread Dan Hendry
the ordinary > How much swap space do you have configured? 2 GB and 24 GB of system memory. Dan From: Chris Goffinet [mailto:c...@chrisgoffinet.com] Sent: December-20-10 17:32 To: user@cassandra.apache.org Subject: Re: Severe Reliability Problems - 0.7 RC2 What kernel versio

Re: Severe Reliability Problems - 0.7 RC2

2010-12-20 Thread Chris Goffinet
What kernel version are you running? I have seen with I/O intense nodes with 2.6.18 to 2.6.24 the kernel has a bug where it locks the JVM and spins to 100%. On Mon, Dec 20, 2010 at 1:14 PM, Brandon Williams wrote: > On Mon, Dec 20, 2010 at 2:13 PM, Dan Hendry wrote: > >> Yes, I have tried that (

Re: Severe Reliability Problems - 0.7 RC2

2010-12-20 Thread Brandon Williams
On Mon, Dec 20, 2010 at 2:13 PM, Dan Hendry wrote: > Yes, I have tried that (although only twice). Same impact as a regular > kill: nothing happens and I get no stacktrace output. It is however on my > list of things to try again the next time a node dies. I am also not able to > attach jstack to

Re: Severe Reliability Problems - 0.7 RC2

2010-12-20 Thread Peter Schuller
> There were a couple of threads on lkml recently that may be relevant, > but I have to run so I can't find the URL:s atm (todo later tonight). Ok, I cannot figure out how to find the "first" message in a thread in any of the lkml archives, but these two threads may be of interest, especially if y

Re: Severe Reliability Problems - 0.7 RC2

2010-12-20 Thread Adrian Cockcroft
@cassandra.apache.org>" mailto:user@cassandra.apache.org>> Subject: RE: Severe Reliability Problems - 0.7 RC2 Yes, I have tried that (although only twice). Same impact as a regular kill: nothing happens and I get no stacktrace output. It is however on my list of things to try agai

RE: Severe Reliability Problems - 0.7 RC2

2010-12-20 Thread Dan Hendry
not help) and I have now changed disk_access_mode from auto to mmap_index_only on two of the nodes. Dan From: Kani [mailto:javier.canil...@gmail.com] Sent: December-20-10 14:14 To: user@cassandra.apache.org Subject: Re: Severe Reliability Problems - 0.7 RC2 Have you tried to send a KILL

Re: Severe Reliability Problems - 0.7 RC2

2010-12-20 Thread Peter Schuller
> be correlated is the flushing of memtables tables. One of the strangest > stats I am getting when in this state is memory paging: 3727168.00 pages > scanned/second (see sar -B output). Occasionally, if I leave the process > alone (~1 h) it recovers (maybe 1 in 5 times), otherwise the only way to

Re: Severe Reliability Problems - 0.7 RC2

2010-12-20 Thread Kani
Have you tried to send a KILL -3 to the Cassandra process before you send KILL -9? This way you will see what the threads are doing (and maybe blocking). The majority of the threads may give you the right spot where to look for the problem. I'm not much of a good linux administrator, but when some