Is the fix as simple as calling close() then? Can you submit a patch for that?
On Fri, Oct 22, 2010 at 2:49 PM, Bill Au <bill.w...@gmail.com> wrote: > Not with the nodeprobe or nodetool command because the JVM these two > commands spawn has a very short life span. > > I am using a webapp to monitor my cassandra cluster. It pretty much uses > the same code as NodeCmd class. For each incoming request, it creates an > NodeProbe object and use it to get get various status of the cluster. I can > reproduce the Cassandra JVM crash by issuing requests to this webapp in a > bash while loop. I took a deeper look and here is what I discovered: > > In the webapp when NodeProbe creates a JMXConnector to connect to the > Cassandra JMX port, a thread > (com.sun.jmx.remote.internal.ClientCommunicatorAdmin$Checker) is created and > run in the webapp's JVM. Meanwhile in the Cassamdra JVM there is a > com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout thread to > timeout remote JMX connection. However, since NodeProbe does not call > JMXConnector.close(), the JMX client checker threads remains in the webapp's > JVM even after the NobeProbe object has been garbage collected. So this JMX > connection is still considered open and that keeps the JMX timeout thread > running inside the Cassandra JVM. The number of JMX client checker threads > in my webapp's JVM matches up with the number of JMX server timeout threads > in my Cassandra's JVM. If I stop my webapp's JVM, > all the JMX server timeout threads in my Cassandra's JVM all disappear after > 2 minutes, the default timeout for a JMX connection. This is why the > problem cannot be reproduced by nodeprobe or nodetool. Even though > JMXConnector.close() is not called, the JVM exits shortly so the JMX client > checker thread do not stay around. So their corresponding JMX server > timeout thread goes away after two minutes. This is not the case with my > weabpp since its JVM keeps running, so all the JMX client checker threads > keep running as well. The threads keep piling up until it crashes > Casssandra's JVM. > > In my case I think I can change my webapp to use a static NodeProbe instead > of creating a new one for every request. That should get around the leak. > > However, I have seen the leak occurs in another situation. On more than one > occasions when I restarted one node in a live multi-node clusters, I see > that the JMX server timeout threads quickly piled up (number in the > thousands) in Cassandra's JVM. It only happened on a live cluster that is > servicing read and write requests. I am guessing the hinted hand off might > have something to do with it. I am still trying to understand what is > happening there. > > Bill > > > On Wed, Oct 20, 2010 at 5:16 PM, Jonathan Ellis <jbel...@gmail.com> wrote: >> >> can you reproduce this by, say, running nodeprobe ring in a bash while >> loop? >> >> On Wed, Oct 20, 2010 at 3:09 PM, Bill Au <bill.w...@gmail.com> wrote: >> > One of my Cassandra server crashed with the following: >> > >> > ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn] 2010-10-19 00:25:10,419 >> > CassandraDaemon.java (line 82) Uncaught exception in thread >> > Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main] >> > java.lang.OutOfMemoryError: unable to create new native thread >> > at java.lang.Thread.start0(Native Method) >> > at java.lang.Thread.start(Thread.java:597) >> > at >> > >> > org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533) >> > >> > >> > I took threads dump in the JVM on all the other Cassandra severs in my >> > cluster. They all have thousand of threads looking like this: >> > >> > "JMX server connection timeout 183373" daemon prio=10 >> > tid=0x00002aad230db800 >> > nid=0x5cf6 in Object.wait() [0x00002aad7a316000] >> > java.lang.Thread.State: TIMED_WAITING (on object monitor) >> > at java.lang.Object.wait(Native Method) >> > at >> > >> > com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150) >> > - locked <0x00002aab056ccee0> (a [I) >> > at java.lang.Thread.run(Thread.java:619) >> > >> > It seems to me that there is a JMX threads leak in Cassandra. NodeProbe >> > creates a JMXConnector but never calls its close() method. I tried >> > setting >> > jmx.remote.x.server.connection.timeout to 0 hoping that would disable >> > the >> > JMX server connection timeout threads. But that did not make any >> > difference. >> > >> > Has anyone else seen this? >> > >> > Bill >> > >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of Riptano, the source for professional Cassandra support >> http://riptano.com > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com