> Many many many of the threads are trying to talk to IPs that aren't in the > cluster (I assume they are the IP's of dead hosts). Are these IP's from before the upgrade ? Are they IP's you expect to see ?
Cross reference them with the output from nodetool gossipinfo to see why the node thinks they should be used. Could you provide a list of the thread names ? One way to remove those IPs that may be to rolling restart with -Dcassandra.load_ring_state=false i the JVM opts at the bottom of cassandra-env.sh The OutboundTcpConnection threads are created in pairs by the OutboundTcpConnectionPool, which is created here https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L502 The threads are created in the OutboundTcpConnectionPool constructor checking to see if this could be the source of the leak. Cheers ----------------- Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 1/05/2013, at 2:18 AM, William Oberman <ober...@civicscience.com> wrote: > I use phpcassa. > > I did a thread dump. 99% of the threads look very similar (I'm using 1.1.9 > in terms of matching source lines). The thread names are all like this: > "WRITE-/10.x.y.z". There are a LOT of duplicates (in terms of the same IP). > Many many many of the threads are trying to talk to IPs that aren't in the > cluster (I assume they are the IP's of dead hosts). The stack trace is > basically the same for them all, attached at the bottom. > > There is a lot of things I could talk about in terms of my situation, but > what I think might be pertinent to this thread: I hit a "tipping point" > recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge > (rolling, one at a time). 7 of the 9 upgraded fine and work great. 2 of the > 9 keep struggling. I've replaced them many times now, each time using this > process: > http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node > And even this morning the only two nodes with a high number of threads are > those two (yet again). And at some point they'll OOM. > > Seems like there is something about my cluster (caused by the recent > upgrade?) that causes a thread leak on OutboundTcpConnection But I don't > know how to escape from the trap. Any ideas? > > > -------- > stackTrace = [ { > className = sun.misc.Unsafe; > fileName = Unsafe.java; > lineNumber = -2; > methodName = park; > nativeMethod = true; > }, { > className = java.util.concurrent.locks.LockSupport; > fileName = LockSupport.java; > lineNumber = 158; > methodName = park; > nativeMethod = false; > }, { > className = > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject; > fileName = AbstractQueuedSynchronizer.java; > lineNumber = 1987; > methodName = await; > nativeMethod = false; > }, { > className = java.util.concurrent.LinkedBlockingQueue; > fileName = LinkedBlockingQueue.java; > lineNumber = 399; > methodName = take; > nativeMethod = false; > }, { > className = org.apache.cassandra.net.OutboundTcpConnection; > fileName = OutboundTcpConnection.java; > lineNumber = 104; > methodName = run; > nativeMethod = false; > } ]; > ---------- > > > > > On Mon, Apr 29, 2013 at 4:31 PM, aaron morton <aa...@thelastpickle.com> wrote: >> I used JMX to check current number of threads in a production cassandra >> machine, and it was ~27,000. > That does not sound too good. > > My first guess would be lots of client connections. What client are you > using, does it do connection pooling ? > See the comments in cassandra.yaml around rpc_server_type, the default uses > sync uses one thread per connection, you may be better with HSHA. But if your > app is leaking connection you should probably deal with that first. > > Cheers > > ----------------- > Aaron Morton > Freelance Cassandra Consultant > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 30/04/2013, at 3:07 AM, William Oberman <ober...@civicscience.com> wrote: > >> Hi, >> >> I'm having some issues. I keep getting: >> ------------ >> ERROR [GossipStage:1] 2013-04-28 07:48:48,876 AbstractCassandraDaemon.java >> (line 135) Exception in thread Thread[GossipStage:1,5,main] >> java.lang.OutOfMemoryError: unable to create new native thread >> -------------- >> after a day or two of runtime. I've checked and my system settings seem >> acceptable: >> memlock=unlimited >> nofiles=100000 >> nproc=122944 >> >> I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS), and >> I keep OOM'ing with the above error. >> >> I've found some (what seem to me) to be obscure references to the stack size >> interacting with # of threads. If I'm understanding it correctly, to reason >> about Java mem usage I have to think of OS + Heap as being locked down, and >> the stack gets the "leftovers" of physical memory and each thread gets a >> stack. >> >> For me, the system ulimit setting on stack is 10240k (no idea if java sees >> or respects this setting). My -Xss for cassandra is the default (I hope, >> don't remember messing with it) of 180k. I used JMX to check current number >> of threads in a production cassandra machine, and it was ~27,000. Is that a >> normal thread count? Could my OOM be related to stack + number of threads, >> or am I overlooking something more simple? >> >> will >> > > > > >