Not with the nodeprobe or nodetool command because the JVM these two
commands spawn has a very short life span.

I am using a webapp to monitor my cassandra cluster.  It pretty much uses
the same code as NodeCmd class.  For each incoming request, it creates an
NodeProbe object and use it to get get various status of the cluster.  I can
reproduce the Cassandra JVM crash by issuing requests to this webapp in a
bash while loop.  I took a deeper look and here is what I discovered:

In the webapp when NodeProbe creates a JMXConnector to connect to the
Cassandra JMX port, a thread
(com.sun.jmx.remote.internal.ClientCommunicatorAdmin$Checker) is created and
run in the webapp's JVM.  Meanwhile in the Cassamdra JVM there is a
com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout thread to
timeout remote JMX connection.  However, since NodeProbe does not call
JMXConnector.close(), the JMX client checker threads remains in the webapp's
JVM even after the NobeProbe object has been garbage collected.  So this JMX
connection is still considered open and that keeps the JMX timeout thread
running inside the Cassandra JVM.  The number of JMX client checker threads
in my webapp's JVM matches up with the number of JMX server timeout threads
in my Cassandra's JVM.  If I stop my webapp's JVM,
all the JMX server timeout threads in my Cassandra's JVM all disappear after
2 minutes, the default timeout for a JMX connection.  This is why the
problem cannot be reproduced by nodeprobe or nodetool.  Even though
JMXConnector.close() is not called, the JVM exits shortly so the JMX client
checker thread do not stay around.  So their corresponding JMX server
timeout thread goes away after two minutes.  This is not the case with my
weabpp since its JVM keeps running, so all the JMX client checker threads
keep running as well.  The threads keep piling up until it crashes
Casssandra's JVM.

In my case I think I can change my webapp to use a static NodeProbe instead
of creating a new one for every request.  That should get around the leak.

However, I have seen the leak occurs in another situation.  On more than one
occasions when I restarted one node in a live multi-node clusters, I see
that the JMX server timeout threads quickly piled up (number in the
thousands) in Cassandra's JVM.  It only happened on a live cluster that is
servicing read and write requests.  I am guessing the hinted hand off might
have something to do with it.  I am still trying to understand what is
happening there.

Bill


On Wed, Oct 20, 2010 at 5:16 PM, Jonathan Ellis <jbel...@gmail.com> wrote:

> can you reproduce this by, say, running nodeprobe ring in a bash while
> loop?
>
> On Wed, Oct 20, 2010 at 3:09 PM, Bill Au <bill.w...@gmail.com> wrote:
> > One of my Cassandra server crashed with the following:
> >
> > ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn] 2010-10-19 00:25:10,419
> > CassandraDaemon.java (line 82) Uncaught exception in thread
> > Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main]
> > java.lang.OutOfMemoryError: unable to create new native thread
> >         at java.lang.Thread.start0(Native Method)
> >         at java.lang.Thread.start(Thread.java:597)
> >         at
> >
> org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533)
> >
> >
> > I took threads dump in the JVM on all the other Cassandra severs in my
> > cluster.  They all have thousand of threads looking like this:
> >
> > "JMX server connection timeout 183373" daemon prio=10
> tid=0x00002aad230db800
> > nid=0x5cf6 in Object.wait() [0x00002aad7a316000]
> >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> >         at java.lang.Object.wait(Native Method)
> >         at
> >
> com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150)
> >         - locked <0x00002aab056ccee0> (a [I)
> >         at java.lang.Thread.run(Thread.java:619)
> >
> > It seems to me that there is a JMX threads leak in Cassandra.  NodeProbe
> > creates a JMXConnector but never calls its close() method.  I tried
> setting
> > jmx.remote.x.server.connection.timeout to 0 hoping that would disable the
> > JMX server connection timeout threads.  But that did not make any
> > difference.
> >
> > Has anyone else seen this?
> >
> > Bill
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Reply via email to