On 6/6/2011 11:25 PM, Benjamin Coverston wrote:
Currently, my data dir has about 16 sets. I thought that compaction
(with nodetool) would clean-up these files, but it doesn't. Neither
does cleanup or repair.
You're not even talking about snapshots using nodetool snapshot yet.
Also nodetool compact does compact all of the live files, however the
compacted SSTables will not be cleaned up until a garbage collection
is triggered, or a capacity threshold is met.
Ok, so after a compaction, Cass is still not done with the older sets of
.db files and I should let Cass delete them? But, I thought one of the
main purposes of compaction was to reclaim disk storage resources. I'm
only playing around with a small data set so I can't tell how fast the
data grows. I'm trying to plan my storage requirements. Is each
newly-generated set as large in size as the previous?
The reason I ask is it seems a snapshot is...
Q1: Should the files with the lower index #'s (under the
data/{keyspace} directory) be manually deleted? Or, do ALL of the
files in this directory need to be backed-up?
Do not ever delete files in your data directory if you care about data
on that replica, unless they are from a column family that no longer
exists on that server. There may be some duplicate data in the files,
but if the files are in the data directory, as a general rule, they
are there because they contain some set of data that is in none of the
other SSTables.
... It seems a snapshot is implemented, unsurprisingly, as just a link
to the latest (highest indexed) set; not the previous sets. So,
obviously, only the latest *.db files will get backed-up. Therefore,
the previous sets must be worthless.