Hi,

I'm running a repair on a node in my 3.7 cluster and today got alerted on
disk space usage. We keep the data and commit log directories on separate
EBS volumes. The data volume is 2TB. The node went down due to EBS failure
on the commit log drive. I stopped the instance and was later told by AWS
support that the drive had recovered. I started the node back up and saw
that it couldn't replay commit logs due to corrupted data, so I cleared the
commit logs and then it started up again just fine. I'm not worried about
anything there that wasn't flushed, I can replay that. I was unfortunately
just outside the hinted handoff window so decided to run a repair.

Roughly 24 hours after I started the repair is when I got the alert on disk
space. I checked and saw that right before I started the repair the node
was using almost 1TB of space, which is right where all the nodes sit, and
over the course of 24 hours had dropped to about 200GB free.

My gut reaction was that the repair must have caused this increase, but I'm
not convinced since the disk usage doubled and continues to grow. I figured
we would see at most an increase of 2x the size of an SSTable undergoing
compaction, unless there's more to the disk usage profile of a node during
repair. We use SizeTieredCompactionStrategy on all the tables in this
keyspace.

Running nodetool compactionstats shows that there are a higher than usual
number of pending compactions (currently 20), and there's been a large one
of 292.82GB moving slowly.

Is it plausible that the repair is the cause of this sudden increase in
disk space usage? Are there any other things I can check that might provide
insight into what happened?

Thanks,
Paul

Reply via email to