Re: Backups, Snapshots, SSTable Data Files, Compaction

Benjamin Coverston Mon, 06 Jun 2011 22:26:41 -0700

Hi AJ,

inline:


On 6/6/11 11:03 PM, AJ wrote:

Hi,
I am working on a backup strategy and am trying to understand what isgoing on in the data directory.
I notice that after a write to a CF and then flush, a new set of datafiles are created with an index number incremented in their names,such as:
Initially:
Users-e-1-Filter.db
Users-e-1-Index.db
Users-e-1-Statistics.db

Then, after a write to the Users CF, followed by a flush:
Users-e-2-Filter.db
Users-e-2-Index.db
Users-e-2-Statistics.db
Currently, my data dir has about 16 sets. I thought that compaction(with nodetool) would clean-up these files, but it doesn't. Neitherdoes cleanup or repair.

You're not even talking about snapshots using nodetool snapshot yet.Also nodetool compact does compact all of the live files, however thecompacted SSTables will not be cleaned up until a garbage collection istriggered, or a capacity threshold is met.

Q1: Should the files with the lower index #'s (under thedata/{keyspace} directory) be manually deleted? Or, do ALL of thefiles in this directory need to be backed-up?

Do not ever delete files in your data directory if you care about dataon that replica, unless they are from a column family that no longerexists on that server. There may be some duplicate data in the files,but if the files are in the data directory, as a general rule, they arethere because they contain some set of data that is in none of the otherSSTables.

Q2: Can someone elaborate on the structure of these files and if theyare interrelated? I'm guessing that maybe each incremental set islike an incremental or differential backup of the SSTable, but I'm notsure. The reason I ask is because I hope that each set isn't a fullcopy of the data, eg, if my data set size for a given CF is 1 TB, Iwill not end up with 16 TB worth of data files after 16 calls toflush... I suspect not, but I'm just double-checking ;o)

The are interrelated only in the sense that they contain data associatedwith the same column family. Each set may contain a complete, partial,or entirely independent set of the data depending on your write patternand the frequency of minor compactions.

Q3: When and how are these extra files removed or reduced?

A GC or a threshold.


Thanks!


--
Ben Coverston
Director of Operations
DataStax -- The Apache Cassandra Company
http://www.datastax.com/

Re: Backups, Snapshots, SSTable Data Files, Compaction

Reply via email to