The issue below could result in abandoned threads under high contention, so we'll get that fixed.
But we are not sure how/why it would be called so many times. If you could provide a full list of threads and the output from nodetool gossipinfo that would help. Cheers ----------------- Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 1/05/2013, at 8:34 AM, aaron morton <aa...@thelastpickle.com> wrote: >> Many many many of the threads are trying to talk to IPs that aren't in the >> cluster (I assume they are the IP's of dead hosts). > Are these IP's from before the upgrade ? Are they IP's you expect to see ? > > Cross reference them with the output from nodetool gossipinfo to see why the > node thinks they should be used. > Could you provide a list of the thread names ? > > One way to remove those IPs that may be to rolling restart with > -Dcassandra.load_ring_state=false i the JVM opts at the bottom of > cassandra-env.sh > > The OutboundTcpConnection threads are created in pairs by the > OutboundTcpConnectionPool, which is created here > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L502 > The threads are created in the OutboundTcpConnectionPool constructor > checking to see if this could be the source of the leak. > > Cheers > > ----------------- > Aaron Morton > Freelance Cassandra Consultant > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 1/05/2013, at 2:18 AM, William Oberman <ober...@civicscience.com> wrote: > >> I use phpcassa. >> >> I did a thread dump. 99% of the threads look very similar (I'm using 1.1.9 >> in terms of matching source lines). The thread names are all like this: >> "WRITE-/10.x.y.z". There are a LOT of duplicates (in terms of the same IP). >> Many many many of the threads are trying to talk to IPs that aren't in the >> cluster (I assume they are the IP's of dead hosts). The stack trace is >> basically the same for them all, attached at the bottom. >> >> There is a lot of things I could talk about in terms of my situation, but >> what I think might be pertinent to this thread: I hit a "tipping point" >> recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge >> (rolling, one at a time). 7 of the 9 upgraded fine and work great. 2 of >> the 9 keep struggling. I've replaced them many times now, each time using >> this process: >> http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node >> And even this morning the only two nodes with a high number of threads are >> those two (yet again). And at some point they'll OOM. >> >> Seems like there is something about my cluster (caused by the recent >> upgrade?) that causes a thread leak on OutboundTcpConnection But I don't >> know how to escape from the trap. Any ideas? >> >> >> -------- >> stackTrace = [ { >> className = sun.misc.Unsafe; >> fileName = Unsafe.java; >> lineNumber = -2; >> methodName = park; >> nativeMethod = true; >> }, { >> className = java.util.concurrent.locks.LockSupport; >> fileName = LockSupport.java; >> lineNumber = 158; >> methodName = park; >> nativeMethod = false; >> }, { >> className = >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject; >> fileName = AbstractQueuedSynchronizer.java; >> lineNumber = 1987; >> methodName = await; >> nativeMethod = false; >> }, { >> className = java.util.concurrent.LinkedBlockingQueue; >> fileName = LinkedBlockingQueue.java; >> lineNumber = 399; >> methodName = take; >> nativeMethod = false; >> }, { >> className = org.apache.cassandra.net.OutboundTcpConnection; >> fileName = OutboundTcpConnection.java; >> lineNumber = 104; >> methodName = run; >> nativeMethod = false; >> } ]; >> ---------- >> >> >> >> >> On Mon, Apr 29, 2013 at 4:31 PM, aaron morton <aa...@thelastpickle.com> >> wrote: >>> I used JMX to check current number of threads in a production cassandra >>> machine, and it was ~27,000. >> That does not sound too good. >> >> My first guess would be lots of client connections. What client are you >> using, does it do connection pooling ? >> See the comments in cassandra.yaml around rpc_server_type, the default uses >> sync uses one thread per connection, you may be better with HSHA. But if >> your app is leaking connection you should probably deal with that first. >> >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Consultant >> New Zealand >> >> @aaronmorton >> http://www.thelastpickle.com >> >> On 30/04/2013, at 3:07 AM, William Oberman <ober...@civicscience.com> wrote: >> >>> Hi, >>> >>> I'm having some issues. I keep getting: >>> ------------ >>> ERROR [GossipStage:1] 2013-04-28 07:48:48,876 AbstractCassandraDaemon.java >>> (line 135) Exception in thread Thread[GossipStage:1,5,main] >>> java.lang.OutOfMemoryError: unable to create new native thread >>> -------------- >>> after a day or two of runtime. I've checked and my system settings seem >>> acceptable: >>> memlock=unlimited >>> nofiles=100000 >>> nproc=122944 >>> >>> I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS), >>> and I keep OOM'ing with the above error. >>> >>> I've found some (what seem to me) to be obscure references to the stack >>> size interacting with # of threads. If I'm understanding it correctly, to >>> reason about Java mem usage I have to think of OS + Heap as being locked >>> down, and the stack gets the "leftovers" of physical memory and each thread >>> gets a stack. >>> >>> For me, the system ulimit setting on stack is 10240k (no idea if java sees >>> or respects this setting). My -Xss for cassandra is the default (I hope, >>> don't remember messing with it) of 180k. I used JMX to check current >>> number of threads in a production cassandra machine, and it was ~27,000. >>> Is that a normal thread count? Could my OOM be related to stack + number >>> of threads, or am I overlooking something more simple? >>> >>> will >>> >> >> >> >> >> >