That has GOT to be it. 1.1.10 upgrade it is...
On Wed, May 1, 2013 at 5:09 PM, Janne Jalkanen <janne.jalka...@ecyrd.com>wrote: > > This sounds very much like > https://issues.apache.org/jira/browse/CASSANDRA-5175, which was fixed in > 1.1.10. > > /Janne > > On Apr 30, 2013, at 23:34 , aaron morton <aa...@thelastpickle.com> wrote: > > Many many many of the threads are trying to talk to IPs that aren't in > the cluster (I assume they are the IP's of dead hosts). > > Are these IP's from before the upgrade ? Are they IP's you expect to see ? > > Cross reference them with the output from nodetool gossipinfo to see why > the node thinks they should be used. > Could you provide a list of the thread names ? > > One way to remove those IPs that may be to rolling restart with > -Dcassandra.load_ring_state=false i the JVM opts at the bottom of > cassandra-env.sh > > The OutboundTcpConnection threads are created in pairs by the > OutboundTcpConnectionPool, which is created here > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L502 > The > threads are created in the OutboundTcpConnectionPool constructor checking > to see if this could be the source of the leak. > > Cheers > > ----------------- > Aaron Morton > Freelance Cassandra Consultant > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 1/05/2013, at 2:18 AM, William Oberman <ober...@civicscience.com> > wrote: > > I use phpcassa. > > I did a thread dump. 99% of the threads look very similar (I'm using > 1.1.9 in terms of matching source lines). The thread names are all like > this: "WRITE-/10.x.y.z". There are a LOT of duplicates (in terms of the > same IP). Many many many of the threads are trying to talk to IPs that > aren't in the cluster (I assume they are the IP's of dead hosts). The > stack trace is basically the same for them all, attached at the bottom. > > There is a lot of things I could talk about in terms of my situation, but > what I think might be pertinent to this thread: I hit a "tipping point" > recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge > (rolling, one at a time). 7 of the 9 upgraded fine and work great. 2 of > the 9 keep struggling. I've replaced them many times now, each time using > this process: > http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node > And even this morning the only two nodes with a high number of threads are > those two (yet again). And at some point they'll OOM. > > Seems like there is something about my cluster (caused by the recent > upgrade?) that causes a thread leak on OutboundTcpConnection But I don't > know how to escape from the trap. Any ideas? > > > -------- > stackTrace = [ { > className = sun.misc.Unsafe; > fileName = Unsafe.java; > lineNumber = -2; > methodName = park; > nativeMethod = true; > }, { > className = java.util.concurrent.locks.LockSupport; > fileName = LockSupport.java; > lineNumber = 158; > methodName = park; > nativeMethod = false; > }, { > className = > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject; > fileName = AbstractQueuedSynchronizer.java; > lineNumber = 1987; > methodName = await; > nativeMethod = false; > }, { > className = java.util.concurrent.LinkedBlockingQueue; > fileName = LinkedBlockingQueue.java; > lineNumber = 399; > methodName = take; > nativeMethod = false; > }, { > className = org.apache.cassandra.net.OutboundTcpConnection; > fileName = OutboundTcpConnection.java; > lineNumber = 104; > methodName = run; > nativeMethod = false; > } ]; > ---------- > > > > > On Mon, Apr 29, 2013 at 4:31 PM, aaron morton <aa...@thelastpickle.com>wrote: > >> I used JMX to check current number of threads in a production cassandra >> machine, and it was ~27,000. >> >> That does not sound too good. >> >> My first guess would be lots of client connections. What client are you >> using, does it do connection pooling ? >> See the comments in cassandra.yaml around rpc_server_type, the default >> uses sync uses one thread per connection, you may be better with HSHA. But >> if your app is leaking connection you should probably deal with that first. >> >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Consultant >> New Zealand >> >> @aaronmorton >> http://www.thelastpickle.com >> >> On 30/04/2013, at 3:07 AM, William Oberman <ober...@civicscience.com> >> wrote: >> >> Hi, >> >> I'm having some issues. I keep getting: >> ------------ >> ERROR [GossipStage:1] 2013-04-28 07:48:48,876 >> AbstractCassandraDaemon.java (line 135) Exception in thread >> Thread[GossipStage:1,5,main] >> java.lang.OutOfMemoryError: unable to create new native thread >> -------------- >> after a day or two of runtime. I've checked and my system settings seem >> acceptable: >> memlock=unlimited >> nofiles=100000 >> nproc=122944 >> >> I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS), >> and I keep OOM'ing with the above error. >> >> I've found some (what seem to me) to be obscure references to the stack >> size interacting with # of threads. If I'm understanding it correctly, to >> reason about Java mem usage I have to think of OS + Heap as being locked >> down, and the stack gets the "leftovers" of physical memory and each thread >> gets a stack. >> >> For me, the system ulimit setting on stack is 10240k (no idea if java >> sees or respects this setting). My -Xss for cassandra is the default (I >> hope, don't remember messing with it) of 180k. I used JMX to check current >> number of threads in a production cassandra machine, and it was ~27,000. >> Is that a normal thread count? Could my OOM be related to stack + number >> of threads, or am I overlooking something more simple? >> >> will >> >> >> > > > > > >