Your seeing dropped mutations reported from nodetool tpstats ?
Take a look at the logs. Look for messages from the MessagingService with the
pattern "{} {} messages dropped in last {}ms" They will be followed by info
about the TP stats.
First would be the workload. Are you sending very big batch_mutate or multiget
requests? Each row in the requests turns into a command in the appropriate
thread pool. This can result in other requests waiting a long time for their
commands to get processed.
Next would be looking for GC and checking the memtable_flush_queue_size is set
high enough (check yaml for docs).
After that I would look at winding concurrent_writers (and I assume
concurrent_readers) back. Anytime I see weirdness I look for config changes and
see what happens when they are returned to the default or near default. Do you
have 16 _physical_ cores?
Hope that helps.
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com
On 18/08/2012, at 10:01 AM, Guillermo Winkler <[email protected]> wrote:
> Aaron, thanks for your answer.
>
> I'm actually tracking a problem where mutations get dropped and cfstats show
> no activity whatsoever, I have 100 threads for the mutation pool, no running
> or pending tasks, but some mutations get dropped none the less.
>
> I'm thinking about some scheduling problems but not really sure yet.
>
> Have you ever seen a case of dropped mutations with the system under light
> load?
>
> Thanks,
> Guille
>
>
> On Thu, Aug 16, 2012 at 8:22 PM, aaron morton <[email protected]> wrote:
> That's some pretty old code. I would guess it was done that way to conserve
> resources. And _i think_ thread creation is pretty light weight.
>
> Jonathan / Brandon / others - opinions ?
>
> Cheers
>
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 17/08/2012, at 8:09 AM, Guillermo Winkler <[email protected]> wrote:
>
>> Hi, I have a cassandra cluster where I'm seeing a lot of thread trashing
>> from the mutation pool.
>>
>> MutationStage:72031
>>
>> Where threads get created and disposed in 100's batches every few minutes,
>> since it's a 16 core server concurrent_writes is set in 100 in the
>> cassandra.yaml.
>>
>> concurrent_writes: 100
>>
>> I've seen in the StageManager class this pools get created with 60 seconds
>> keepalive time.
>>
>> DebuggableThreadPoolExecutor -> allowCoreThreadTimeOut(true);
>>
>> StageManager-> public static final long KEEPALIVE = 60; // seconds to keep
>> "extra" threads alive for when idle
>>
>> Is it a reason for it to be this way?
>>
>> Why not have a fixed size pool with Integer.MAX_VALUE as keepalive since
>> corePoolSize and maxPoolSize are set at the same size?
>>
>> Thanks,
>> Guille
>>
>
>