Aaron, thanks for your answer. We do have big batch updates not always with the columns belonging to the same row(i.e. many threads are needed to handle the updates), but it did not not represented a problem when the CFs had less data on them.
One thing we did change in the past weeks was the memtable_flush_queue_size in order to occupy less heap space with memtables, this was due to having received this warning message and some OOM exceptions: logger.warn(String.format( "Reducing %s capacity from %d to %s to reduce memory pressure", cacheType, getCapacity(), newCapacity )); * * Do you know any strategy to diagnose if memtables flushing to disk and locking on the switchLock being the main cause of the dropped messages? I've went through the source code but haven't seen any metrics reporting on maybeSwitchMemtable blocking times. Thanks again, Guille On Sun, Aug 19, 2012 at 5:21 AM, aaron morton <aa...@thelastpickle.com>wrote: > Your seeing dropped mutations reported from nodetool tpstats ? > > Take a look at the logs. Look for messages from the MessagingService with > the pattern "{} {} messages dropped in last {}ms" They will be followed by > info about the TP stats. > > First would be the workload. Are you sending very big batch_mutate or > multiget requests? Each row in the requests turns into a command in the > appropriate thread pool. This can result in other requests waiting a long > time for their commands to get processed. > > Next would be looking for GC and checking the memtable_flush_queue_size is > set high enough (check yaml for docs). > > After that I would look at winding concurrent_writers (and I assume > concurrent_readers) back. Anytime I see weirdness I look for config changes > and see what happens when they are returned to the default or near default. > Do you have 16 _physical_ cores? > > Hope that helps. > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 18/08/2012, at 10:01 AM, Guillermo Winkler <gwink...@inconcertcc.com> > wrote: > > Aaron, thanks for your answer. > > I'm actually tracking a problem where mutations get dropped and cfstats > show no activity whatsoever, I have 100 threads for the mutation pool, no > running or pending tasks, but some mutations get dropped none the less. > > I'm thinking about some scheduling problems but not really sure yet. > > Have you ever seen a case of dropped mutations with the system under light > load? > > Thanks, > Guille > > > On Thu, Aug 16, 2012 at 8:22 PM, aaron morton <aa...@thelastpickle.com>wrote: > >> That's some pretty old code. I would guess it was done that way to >> conserve resources. And _i think_ thread creation is pretty light weight. >> >> Jonathan / Brandon / others - opinions ? >> >> Cheers >> >> >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 17/08/2012, at 8:09 AM, Guillermo Winkler <gwink...@inconcertcc.com> >> wrote: >> >> Hi, I have a cassandra cluster where I'm seeing a lot of thread trashing >> from the mutation pool. >> >> MutationStage:72031 >> >> Where threads get created and disposed in 100's batches every few >> minutes, since it's a 16 core server concurrent_writes is set in 100 in the >> cassandra.yaml. >> >> concurrent_writes: 100 >> >> I've seen in the StageManager class this pools get created with 60 >> seconds keepalive time. >> >> DebuggableThreadPoolExecutor -> allowCoreThreadTimeOut(true); >> >> StageManager-> public static final long KEEPALIVE = 60; // seconds to >> keep "extra" threads alive for when idle >> >> Is it a reason for it to be this way? >> >> Why not have a fixed size pool with Integer.MAX_VALUE as keepalive since >> corePoolSize and maxPoolSize are set at the same size? >> >> Thanks, >> Guille >> >> >> > >