>  Many many many of the threads are trying to talk to IPs that aren't in the 
> cluster (I assume they are the IP's of dead hosts). 
Are these IP's from before the upgrade ? Are they IP's you expect to see ? 

Cross reference them with the output from nodetool gossipinfo to see why the 
node thinks they should be used. 
Could you provide a list of the thread names ? 

One way to remove those IPs that may be to rolling restart with 
-Dcassandra.load_ring_state=false i the JVM opts at the bottom of 
cassandra-env.sh

The OutboundTcpConnection threads are created in pairs by the 
OutboundTcpConnectionPool, which is created here 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L502
 The threads are created in the OutboundTcpConnectionPool constructor checking 
to see if this could be the source of the leak. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 1/05/2013, at 2:18 AM, William Oberman <ober...@civicscience.com> wrote:

> I use phpcassa.
> 
> I did a thread dump.  99% of the threads look very similar (I'm using 1.1.9 
> in terms of matching source lines).  The thread names are all like this: 
> "WRITE-/10.x.y.z".  There are a LOT of duplicates (in terms of the same IP).  
> Many many many of the threads are trying to talk to IPs that aren't in the 
> cluster (I assume they are the IP's of dead hosts).  The stack trace is 
> basically the same for them all, attached at the bottom.   
> 
> There is a lot of things I could talk about in terms of my situation, but 
> what I think might be pertinent to this thread: I hit a "tipping point" 
> recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge 
> (rolling, one at a time).  7 of the 9 upgraded fine and work great.  2 of the 
> 9 keep struggling.  I've replaced them many times now, each time using this 
> process:
> http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node
> And even this morning the only two nodes with a high number of threads are 
> those two (yet again).  And at some point they'll OOM.
> 
> Seems like there is something about my cluster (caused by the recent 
> upgrade?) that causes a thread leak on OutboundTcpConnection   But I don't 
> know how to escape from the trap.  Any ideas?
> 
> 
> --------
>   stackTrace = [ { 
>     className = sun.misc.Unsafe;
>     fileName = Unsafe.java;
>     lineNumber = -2;
>     methodName = park;
>     nativeMethod = true;
>    }, { 
>     className = java.util.concurrent.locks.LockSupport;
>     fileName = LockSupport.java;
>     lineNumber = 158;
>     methodName = park;
>     nativeMethod = false;
>    }, { 
>     className = 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject;
>     fileName = AbstractQueuedSynchronizer.java;
>     lineNumber = 1987;
>     methodName = await;
>     nativeMethod = false;
>    }, { 
>     className = java.util.concurrent.LinkedBlockingQueue;
>     fileName = LinkedBlockingQueue.java;
>     lineNumber = 399;
>     methodName = take;
>     nativeMethod = false;
>    }, { 
>     className = org.apache.cassandra.net.OutboundTcpConnection;
>     fileName = OutboundTcpConnection.java;
>     lineNumber = 104;
>     methodName = run;
>     nativeMethod = false;
>    } ];
> ----------
> 
> 
> 
> 
> On Mon, Apr 29, 2013 at 4:31 PM, aaron morton <aa...@thelastpickle.com> wrote:
>>  I used JMX to check current number of threads in a production cassandra 
>> machine, and it was ~27,000.
> That does not sound too good. 
> 
> My first guess would be lots of client connections. What client are you 
> using, does it do connection pooling ?
> See the comments in cassandra.yaml around rpc_server_type, the default uses 
> sync uses one thread per connection, you may be better with HSHA. But if your 
> app is leaking connection you should probably deal with that first. 
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 30/04/2013, at 3:07 AM, William Oberman <ober...@civicscience.com> wrote:
> 
>> Hi,
>> 
>> I'm having some issues.  I keep getting:
>> ------------
>> ERROR [GossipStage:1] 2013-04-28 07:48:48,876 AbstractCassandraDaemon.java 
>> (line 135) Exception in thread Thread[GossipStage:1,5,main]
>> java.lang.OutOfMemoryError: unable to create new native thread
>> --------------
>> after a day or two of runtime.  I've checked and my system settings seem 
>> acceptable:
>> memlock=unlimited
>> nofiles=100000
>> nproc=122944
>> 
>> I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS), and 
>> I keep OOM'ing with the above error.
>> 
>> I've found some (what seem to me) to be obscure references to the stack size 
>> interacting with # of threads.  If I'm understanding it correctly, to reason 
>> about Java mem usage I have to think of OS + Heap as being locked down, and 
>> the stack gets the "leftovers" of physical memory and each thread gets a 
>> stack.
>> 
>> For me, the system ulimit setting on stack is 10240k (no idea if java sees 
>> or respects this setting).  My -Xss for cassandra is the default (I hope, 
>> don't remember messing with it) of 180k.  I used JMX to check current number 
>> of threads in a production cassandra machine, and it was ~27,000.  Is that a 
>> normal thread count?  Could my OOM be related to stack + number of threads, 
>> or am I overlooking something more simple?
>> 
>> will
>> 
> 
> 
> 
> 
> 

Reply via email to