A few days ago I posted about an issue I¹m having where GC takes a long time
(20-30 seconds), and it happens repeatedly and basically no work gets done.
I¹ve done further investigation, and I now believe that I know the cause. If
I do a lot of deletes, it creates memory pressure until the memtables are
flushed, but Cassandra doesn¹t flush them. If I manually flush, then life is
good again (although that takes a very long time because of the GC issue).
If I just leave the flushing to Cassandra, then I end up with death by GC. I
believe that when the memtables are full of tombstones, Cassadnra doesn¹t
realize how much memory the memtables are actually taking up, and so it
doesn¹t proactively flush them in order to free up heap.

As I was deleting records out of one of my tables, I was watching it via
nodetool cfstats, and I found a very curious thing:

                Memtable cell count: 1285
                Memtable data size, bytes: 0
                Memtable switch count: 56

As the deletion process was chugging away, the memtable cell count
increased, as expected, but the data size stayed at 0. No flushing occurred.

Here¹s the schema for this table:

CREATE TABLE bdn_index_pub (

tshard VARCHAR,

pord INT,

ord INT,

hpath VARCHAR,

page BIGINT,

PRIMARY KEY (tshard, pord)

) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };


I have a few tables that I run this cleaning process on, and not all of them
exhibit this behavior. One of them reported an increasing number of bytes,
as expected, and it also flushed as expected. Here¹s the schema for that
table:


CREATE TABLE bdn_index_child (

ptshard VARCHAR,

ord INT,

hpath VARCHAR,

PRIMARY KEY (ptshard, ord)

) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };


In both cases, I¹m deleting the entire record (i.e. specifying just the
first component of the primary key in the delete statement). Most records in
bdn_index_pub have 10,000 rows per record. bdn_index_child usually has just
a handful of rows, but a few records can have up 10,000.

Still a further mystery, 1285 tombstones in the bdn_index_pub memtable
doesn¹t seem like nearly enough to create a memory problem. Perhaps there
are other flaws in the memory metering. Or perhaps there is some other issue
that causes Cassandra to mismanage the heap when there are a lot of deletes.
One other thought I had is that I page through these tables and clean them
out as I go. Perhaps there is some interaction between the paging and the
deleting that causes the GC problems and I should create a list of keys to
delete and then delete them after I¹ve finished reading the entire table.

I reduced memtable_total_space_in_mb from the default (probably 2.7 GB) to 1
GB, in hopes that it would force Cassandra to flush tables before I ran into
death by GC, but it didn¹t seem to help.

I¹m using Cassandra 2.0.4.

Any insights would be greatly appreciated. I can¹t be the only one that has
periodic delete-heavy workloads. Hopefully someone else has run into this
and can give advice.

Thanks

Robert


Reply via email to