Quote: " To be clear, "incremental backup" feature backs up the data being modified in that period, because it writes only those files to the incremental backup dir as hard links, between full snapshots." I thought I was clearer, but your clarification confused me again.My understanding so far from all the answer I got so far, I believe, the more accurate statement of "incremental backup" should be "incremental backup" feature backs up the SSTable files being generated in that period. But there is no way we can be sure that these SSTable files will ONLY contain modified data. So the statement being quoted above is not exactly right. I agree that all the modified data in that period will be in the incremental sstable files, but a lot of other unmodified data will be in them too. If we have 2 rows data with different row key in the same memtable, and if only 2nd row being modified. When the memtable is flushed to SSTable file, it will contain both rows, and both will be in the incremental backup files. So for first row, nothing change, but it will be in the incremental backup. If I have one row with one column, now a new column is added, and whole row in one memtable being flushed to SSTable file, as also in this incremental backup. For first column, nothing change, but it will still be in incremental backup file. The point I tried to make is this is important if I design an ETL to consume the incremental backup SSTable files. As above example, I have to realize that in the incremental backup sstable files, they could or most likely contain old data which was previous being processed already. That will require additional logic and responsibility in the ETL to handle it, or any outsider SSTable consumer to pay attention to it. Yong Date: Tue, 17 Sep 2013 18:01:45 -0700 Subject: Re: questions related to the SSTable file From: rc...@eventbrite.com To: user@cassandra.apache.org
On Tue, Sep 17, 2013 at 5:46 PM, Takenori Sato <ts...@cloudian.com> wrote: > So in fact, incremental backup of Cassandra is just hard link all the new > SSTable files being generated during the incremental backup period. It could > contain any data, not just the data being update/insert/delete in this > period, correct? Correct. But over time, some old enough SSTable files are usually shared across multiple snapshots. To be clear, "incremental backup" feature backs up the data being modified in that period, because it writes only those files to the incremental backup dir as hard links, between full snapshots. http://www.datastax.com/docs/1.0/operations/backup_restore "When incremental backups are enabled (disabled by default), Cassandra hard-links each flushed SSTable to a backups directory under the keyspace data directory. This allows you to store backups offsite without transferring entire snapshots. Also, incremental backups combine with snapshots to provide a dependable, up-to-date backup mechanism. " What Takenori is referring to is that a full snapshot is in some ways an "incremental backup" because it shares hard linked SSTables with other snapshots. =Rob