Re: Cassandra crashed - possible JMX threads leak

Jonathan Ellis Fri, 22 Oct 2010 13:33:48 -0700

Is the fix as simple as calling close() then?  Can you submit a patch for that?


On Fri, Oct 22, 2010 at 2:49 PM, Bill Au <bill.w...@gmail.com> wrote:
> Not with the nodeprobe or nodetool command because the JVM these two
> commands spawn has a very short life span.
>
> I am using a webapp to monitor my cassandra cluster.  It pretty much uses
> the same code as NodeCmd class.  For each incoming request, it creates an
> NodeProbe object and use it to get get various status of the cluster.  I can
> reproduce the Cassandra JVM crash by issuing requests to this webapp in a
> bash while loop.  I took a deeper look and here is what I discovered:
>
> In the webapp when NodeProbe creates a JMXConnector to connect to the
> Cassandra JMX port, a thread
> (com.sun.jmx.remote.internal.ClientCommunicatorAdmin$Checker) is created and
> run in the webapp's JVM.  Meanwhile in the Cassamdra JVM there is a
> com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout thread to
> timeout remote JMX connection.  However, since NodeProbe does not call
> JMXConnector.close(), the JMX client checker threads remains in the webapp's
> JVM even after the NobeProbe object has been garbage collected.  So this JMX
> connection is still considered open and that keeps the JMX timeout thread
> running inside the Cassandra JVM.  The number of JMX client checker threads
> in my webapp's JVM matches up with the number of JMX server timeout threads
> in my Cassandra's JVM.  If I stop my webapp's JVM,
> all the JMX server timeout threads in my Cassandra's JVM all disappear after
> 2 minutes, the default timeout for a JMX connection.  This is why the
> problem cannot be reproduced by nodeprobe or nodetool.  Even though
> JMXConnector.close() is not called, the JVM exits shortly so the JMX client
> checker thread do not stay around.  So their corresponding JMX server
> timeout thread goes away after two minutes.  This is not the case with my
> weabpp since its JVM keeps running, so all the JMX client checker threads
> keep running as well.  The threads keep piling up until it crashes
> Casssandra's JVM.
>
> In my case I think I can change my webapp to use a static NodeProbe instead
> of creating a new one for every request.  That should get around the leak.
>
> However, I have seen the leak occurs in another situation.  On more than one
> occasions when I restarted one node in a live multi-node clusters, I see
> that the JMX server timeout threads quickly piled up (number in the
> thousands) in Cassandra's JVM.  It only happened on a live cluster that is
> servicing read and write requests.  I am guessing the hinted hand off might
> have something to do with it.  I am still trying to understand what is
> happening there.
>
> Bill
>
>
> On Wed, Oct 20, 2010 at 5:16 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>
>> can you reproduce this by, say, running nodeprobe ring in a bash while
>> loop?
>>
>> On Wed, Oct 20, 2010 at 3:09 PM, Bill Au <bill.w...@gmail.com> wrote:
>> > One of my Cassandra server crashed with the following:
>> >
>> > ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn] 2010-10-19 00:25:10,419
>> > CassandraDaemon.java (line 82) Uncaught exception in thread
>> > Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main]
>> > java.lang.OutOfMemoryError: unable to create new native thread
>> >         at java.lang.Thread.start0(Native Method)
>> >         at java.lang.Thread.start(Thread.java:597)
>> >         at
>> >
>> > org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533)
>> >
>> >
>> > I took threads dump in the JVM on all the other Cassandra severs in my
>> > cluster.  They all have thousand of threads looking like this:
>> >
>> > "JMX server connection timeout 183373" daemon prio=10
>> > tid=0x00002aad230db800
>> > nid=0x5cf6 in Object.wait() [0x00002aad7a316000]
>> >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>> >         at java.lang.Object.wait(Native Method)
>> >         at
>> >
>> > com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150)
>> >         - locked <0x00002aab056ccee0> (a [I)
>> >         at java.lang.Thread.run(Thread.java:619)
>> >
>> > It seems to me that there is a JMX threads leak in Cassandra.  NodeProbe
>> > creates a JMXConnector but never calls its close() method.  I tried
>> > setting
>> > jmx.remote.x.server.connection.timeout to 0 hoping that would disable
>> > the
>> > JMX server connection timeout threads.  But that did not make any
>> > difference.
>> >
>> > Has anyone else seen this?
>> >
>> > Bill
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra crashed - possible JMX threads leak

Reply via email to