What's the output of 'nodetool tpstats' while this is happening? Specifically is Flush Writer "All time blocked" increasing? If so, play around with turning up memtable_flush_writers and memtable_flush_queue_size and see if that helps.
On Sat, Feb 1, 2014 at 9:03 AM, Robert Wille <rwi...@fold3.com> wrote: > A few days ago I posted about an issue I'm having where GC takes a long > time (20-30 seconds), and it happens repeatedly and basically no work gets > done. I've done further investigation, and I now believe that I know the > cause. If I do a lot of deletes, it creates memory pressure until the > memtables are flushed, but Cassandra doesn't flush them. If I manually > flush, then life is good again (although that takes a very long time > because of the GC issue). If I just leave the flushing to Cassandra, then I > end up with death by GC. I believe that when the memtables are full of > tombstones, Cassadnra doesn't realize how much memory the memtables are > actually taking up, and so it doesn't proactively flush them in order to > free up heap. > > As I was deleting records out of one of my tables, I was watching it via > nodetool cfstats, and I found a very curious thing: > > Memtable cell count: 1285 > Memtable data size, bytes: 0 > Memtable switch count: 56 > > As the deletion process was chugging away, the memtable cell count > increased, as expected, but the data size stayed at 0. No flushing > occurred. > > Here's the schema for this table: > > CREATE TABLE bdn_index_pub ( > > tshard VARCHAR, > > pord INT, > > ord INT, > > hpath VARCHAR, > > page BIGINT, > > PRIMARY KEY (tshard, pord) > > ) WITH gc_grace_seconds = 0 AND compaction = { 'class' : > 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 }; > > I have a few tables that I run this cleaning process on, and not all of > them exhibit this behavior. One of them reported an increasing number of > bytes, as expected, and it also flushed as expected. Here's the schema for > that table: > > > CREATE TABLE bdn_index_child ( > > ptshard VARCHAR, > > ord INT, > > hpath VARCHAR, > > PRIMARY KEY (ptshard, ord) > > ) WITH gc_grace_seconds = 0 AND compaction = { 'class' : > 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 }; > > In both cases, I'm deleting the entire record (i.e. specifying just the > first component of the primary key in the delete statement). Most records > in bdn_index_pub have 10,000 rows per record. bdn_index_child usually has > just a handful of rows, but a few records can have up 10,000. > > Still a further mystery, 1285 tombstones in the bdn_index_pub memtable > doesn't seem like nearly enough to create a memory problem. Perhaps there > are other flaws in the memory metering. Or perhaps there is some other > issue that causes Cassandra to mismanage the heap when there are a lot of > deletes. One other thought I had is that I page through these tables and > clean them out as I go. Perhaps there is some interaction between the > paging and the deleting that causes the GC problems and I should create a > list of keys to delete and then delete them after I've finished reading the > entire table. > > I reduced memtable_total_space_in_mb from the default (probably 2.7 GB) to > 1 GB, in hopes that it would force Cassandra to flush tables before I ran > into death by GC, but it didn't seem to help. > > I'm using Cassandra 2.0.4. > > Any insights would be greatly appreciated. I can't be the only one that > has periodic delete-heavy workloads. Hopefully someone else has run into > this and can give advice. > > Thanks > > Robert > -- ----------------- Nate McCall Austin, TX @zznate Co-Founder & Sr. Technical Consultant Apache Cassandra Consulting http://www.thelastpickle.com