Hi all

Running Cassandra 1.0.7, I recently changed a few read heavy column
families from SizeTieredCompactionStrategy to LeveledCompactionStrategy and
added in SnappyCompressor, all with defaults so 5MB files and if memory
serves me correctly 64k chunk size for compression.
The results were amazingly good, my data size halved and my heap usage and
performance stabilised nicely, until it came time to run a repair.

When a repair isn't running I'm seeing a saw toothed pattern on my heap
graphs with CMS clearing out about 1.5GB each GC run. The CMS GC appears as
a sudden vertical drop on the Old Gen usage graph. In addition to what I
consider a healthy looking heap usage, my par new and CMS collections are
running far quicker than before I made the changes.

However, when I run a repair my CMS usage graph no longer shows sudden
drops but rather gradual slopes and only manages to clear around 300MB each
GC. This seems to occur on 2 other nodes in my cluster around the same
time, I assume this is because they're the replicas (we use 3 replicas).
Parnew collections look about the same on my graphs with or without repair
running so no trouble there so far as I can tell.
The symptom of the memory pressure during repair is either the node running
the repair of one of the two replicas tends to perform badly with read
stage backing up into the thousands at times.
If I run a repair on more than one or two nodes at the same time (it's a 7
node cluster) the memory pressure is so bad that half the cluster ends up
OOMing, and this happened during off-peak when it's doing about half the
reads we handle during peak so not particularly loaded.

The question I'm asking is has anyone run into this behaviour before, and
if so how was it dealt with?

Once I have nursed the cluster thru the repair it's currently running I
will be turning off compression on one of my larger CFs to see if it makes
a difference, I'll send the results of that test tomorrow.

Reply via email to