Your seeing dropped mutations reported from nodetool tpstats ? Take a look at the logs. Look for messages from the MessagingService with the pattern "{} {} messages dropped in last {}ms" They will be followed by info about the TP stats.
First would be the workload. Are you sending very big batch_mutate or multiget requests? Each row in the requests turns into a command in the appropriate thread pool. This can result in other requests waiting a long time for their commands to get processed. Next would be looking for GC and checking the memtable_flush_queue_size is set high enough (check yaml for docs). After that I would look at winding concurrent_writers (and I assume concurrent_readers) back. Anytime I see weirdness I look for config changes and see what happens when they are returned to the default or near default. Do you have 16 _physical_ cores? Hope that helps. ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 18/08/2012, at 10:01 AM, Guillermo Winkler <gwink...@inconcertcc.com> wrote: > Aaron, thanks for your answer. > > I'm actually tracking a problem where mutations get dropped and cfstats show > no activity whatsoever, I have 100 threads for the mutation pool, no running > or pending tasks, but some mutations get dropped none the less. > > I'm thinking about some scheduling problems but not really sure yet. > > Have you ever seen a case of dropped mutations with the system under light > load? > > Thanks, > Guille > > > On Thu, Aug 16, 2012 at 8:22 PM, aaron morton <aa...@thelastpickle.com> wrote: > That's some pretty old code. I would guess it was done that way to conserve > resources. And _i think_ thread creation is pretty light weight. > > Jonathan / Brandon / others - opinions ? > > Cheers > > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 17/08/2012, at 8:09 AM, Guillermo Winkler <gwink...@inconcertcc.com> wrote: > >> Hi, I have a cassandra cluster where I'm seeing a lot of thread trashing >> from the mutation pool. >> >> MutationStage:72031 >> >> Where threads get created and disposed in 100's batches every few minutes, >> since it's a 16 core server concurrent_writes is set in 100 in the >> cassandra.yaml. >> >> concurrent_writes: 100 >> >> I've seen in the StageManager class this pools get created with 60 seconds >> keepalive time. >> >> DebuggableThreadPoolExecutor -> allowCoreThreadTimeOut(true); >> >> StageManager-> public static final long KEEPALIVE = 60; // seconds to keep >> "extra" threads alive for when idle >> >> Is it a reason for it to be this way? >> >> Why not have a fixed size pool with Integer.MAX_VALUE as keepalive since >> corePoolSize and maxPoolSize are set at the same size? >> >> Thanks, >> Guille >> > >