Re: Why the StageManager thread pools have 60 seconds keepalive time?

aaron morton Sun, 19 Aug 2012 01:21:41 -0700

Your seeing dropped mutations reported from nodetool tpstats ? 

Take a look at the logs. Look for messages from the MessagingService with the 
pattern "{} {} messages dropped in last {}ms" They will be followed by info 
about the TP stats.


First would be the workload. Are you sending very big batch_mutate or multiget 
requests? Each row in the requests turns into a command in the appropriate 
thread pool. This can result in other requests waiting a long time for their 
commands to get processed. 

Next would be looking for GC and checking the memtable_flush_queue_size is set 
high enough (check yaml for docs). 

After that I would look at winding  concurrent_writers (and I assume 
concurrent_readers) back. Anytime I see weirdness I look for config changes and 
see what happens when they are returned to the default or near default.  Do you 
have 16 _physical_ cores?

Hope that helps. 
  
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 18/08/2012, at 10:01 AM, Guillermo Winkler <gwink...@inconcertcc.com> wrote:

> Aaron, thanks for your answer.
> 
> I'm actually tracking a problem where mutations get dropped and cfstats show 
> no activity whatsoever, I have 100 threads for the mutation pool, no running 
> or pending tasks, but some mutations get dropped none the less.
> 
> I'm thinking about some scheduling problems but not really sure yet.
> 
> Have you ever seen a case of dropped mutations with the system under light 
> load?
> 
> Thanks,
> Guille
> 
> 
> On Thu, Aug 16, 2012 at 8:22 PM, aaron morton <aa...@thelastpickle.com> wrote:
> That's some pretty old code. I would guess it was done that way to conserve 
> resources. And _i think_ thread creation is pretty light weight.
> 
> Jonathan / Brandon / others - opinions ? 
> 
> Cheers
> 
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 17/08/2012, at 8:09 AM, Guillermo Winkler <gwink...@inconcertcc.com> wrote:
> 
>> Hi, I have a cassandra cluster where I'm seeing a lot of thread trashing 
>> from the mutation pool.
>> 
>> MutationStage:72031
>> 
>> Where threads get created and disposed in 100's batches every few minutes, 
>> since it's a 16 core server concurrent_writes is set in 100 in the 
>> cassandra.yaml. 
>> 
>> concurrent_writes: 100
>> 
>> I've seen in the StageManager class this pools get created with 60 seconds 
>> keepalive time.
>> 
>> DebuggableThreadPoolExecutor -> allowCoreThreadTimeOut(true);
>> 
>> StageManager-> public static final long KEEPALIVE = 60; // seconds to keep 
>> "extra" threads alive for when idle
>> 
>> Is it a reason for it to be this way? 
>> 
>> Why not have a fixed size pool with Integer.MAX_VALUE as keepalive since 
>> corePoolSize and maxPoolSize are set at the same size?
>> 
>> Thanks,
>> Guille
>> 
> 
>

Re: Why the StageManager thread pools have 60 seconds keepalive time?

Reply via email to