The issue below could result in abandoned threads under high contention, so 
we'll get that fixed. 

But we are not sure how/why it would be called so many times. If you could 
provide a full list of threads and the output from nodetool gossipinfo that 
would help. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 1/05/2013, at 8:34 AM, aaron morton <aa...@thelastpickle.com> wrote:

>>  Many many many of the threads are trying to talk to IPs that aren't in the 
>> cluster (I assume they are the IP's of dead hosts). 
> Are these IP's from before the upgrade ? Are they IP's you expect to see ? 
> 
> Cross reference them with the output from nodetool gossipinfo to see why the 
> node thinks they should be used. 
> Could you provide a list of the thread names ? 
> 
> One way to remove those IPs that may be to rolling restart with 
> -Dcassandra.load_ring_state=false i the JVM opts at the bottom of 
> cassandra-env.sh
> 
> The OutboundTcpConnection threads are created in pairs by the 
> OutboundTcpConnectionPool, which is created here 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L502
>  The threads are created in the OutboundTcpConnectionPool constructor 
> checking to see if this could be the source of the leak. 
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 1/05/2013, at 2:18 AM, William Oberman <ober...@civicscience.com> wrote:
> 
>> I use phpcassa.
>> 
>> I did a thread dump.  99% of the threads look very similar (I'm using 1.1.9 
>> in terms of matching source lines).  The thread names are all like this: 
>> "WRITE-/10.x.y.z".  There are a LOT of duplicates (in terms of the same IP). 
>>  Many many many of the threads are trying to talk to IPs that aren't in the 
>> cluster (I assume they are the IP's of dead hosts).  The stack trace is 
>> basically the same for them all, attached at the bottom.   
>> 
>> There is a lot of things I could talk about in terms of my situation, but 
>> what I think might be pertinent to this thread: I hit a "tipping point" 
>> recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge 
>> (rolling, one at a time).  7 of the 9 upgraded fine and work great.  2 of 
>> the 9 keep struggling.  I've replaced them many times now, each time using 
>> this process:
>> http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node
>> And even this morning the only two nodes with a high number of threads are 
>> those two (yet again).  And at some point they'll OOM.
>> 
>> Seems like there is something about my cluster (caused by the recent 
>> upgrade?) that causes a thread leak on OutboundTcpConnection   But I don't 
>> know how to escape from the trap.  Any ideas?
>> 
>> 
>> --------
>>   stackTrace = [ { 
>>     className = sun.misc.Unsafe;
>>     fileName = Unsafe.java;
>>     lineNumber = -2;
>>     methodName = park;
>>     nativeMethod = true;
>>    }, { 
>>     className = java.util.concurrent.locks.LockSupport;
>>     fileName = LockSupport.java;
>>     lineNumber = 158;
>>     methodName = park;
>>     nativeMethod = false;
>>    }, { 
>>     className = 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject;
>>     fileName = AbstractQueuedSynchronizer.java;
>>     lineNumber = 1987;
>>     methodName = await;
>>     nativeMethod = false;
>>    }, { 
>>     className = java.util.concurrent.LinkedBlockingQueue;
>>     fileName = LinkedBlockingQueue.java;
>>     lineNumber = 399;
>>     methodName = take;
>>     nativeMethod = false;
>>    }, { 
>>     className = org.apache.cassandra.net.OutboundTcpConnection;
>>     fileName = OutboundTcpConnection.java;
>>     lineNumber = 104;
>>     methodName = run;
>>     nativeMethod = false;
>>    } ];
>> ----------
>> 
>> 
>> 
>> 
>> On Mon, Apr 29, 2013 at 4:31 PM, aaron morton <aa...@thelastpickle.com> 
>> wrote:
>>>  I used JMX to check current number of threads in a production cassandra 
>>> machine, and it was ~27,000.
>> That does not sound too good. 
>> 
>> My first guess would be lots of client connections. What client are you 
>> using, does it do connection pooling ?
>> See the comments in cassandra.yaml around rpc_server_type, the default uses 
>> sync uses one thread per connection, you may be better with HSHA. But if 
>> your app is leaking connection you should probably deal with that first. 
>> 
>> Cheers
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Consultant
>> New Zealand
>> 
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 30/04/2013, at 3:07 AM, William Oberman <ober...@civicscience.com> wrote:
>> 
>>> Hi,
>>> 
>>> I'm having some issues.  I keep getting:
>>> ------------
>>> ERROR [GossipStage:1] 2013-04-28 07:48:48,876 AbstractCassandraDaemon.java 
>>> (line 135) Exception in thread Thread[GossipStage:1,5,main]
>>> java.lang.OutOfMemoryError: unable to create new native thread
>>> --------------
>>> after a day or two of runtime.  I've checked and my system settings seem 
>>> acceptable:
>>> memlock=unlimited
>>> nofiles=100000
>>> nproc=122944
>>> 
>>> I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS), 
>>> and I keep OOM'ing with the above error.
>>> 
>>> I've found some (what seem to me) to be obscure references to the stack 
>>> size interacting with # of threads.  If I'm understanding it correctly, to 
>>> reason about Java mem usage I have to think of OS + Heap as being locked 
>>> down, and the stack gets the "leftovers" of physical memory and each thread 
>>> gets a stack.
>>> 
>>> For me, the system ulimit setting on stack is 10240k (no idea if java sees 
>>> or respects this setting).  My -Xss for cassandra is the default (I hope, 
>>> don't remember messing with it) of 180k.  I used JMX to check current 
>>> number of threads in a production cassandra machine, and it was ~27,000.  
>>> Is that a normal thread count?  Could my OOM be related to stack + number 
>>> of threads, or am I overlooking something more simple?
>>> 
>>> will
>>> 
>> 
>> 
>> 
>> 
>> 
> 

Reply via email to