Re: questions related to the SSTable file

Takenori Sato Tue, 17 Sep 2013 20:31:22 -0700

Yong,

It seems there is still a misunderstanding.


> But there is no way we can be sure that these SSTable files will ONLY
contain modified data. So the statement being quoted above is not exactly
right. I agree that all the modified data in that period will be in the
incremental sstable files, but a lot of other unmodified data will be in
them too.

memtable(a new sstable) contains only modified data as I explained by the
example.

> If we have 2 rows data with different row key in the same memtable, and
if only 2nd row being modified. When the memtable is flushed to SSTable
file, it will contain both rows, and both will be in the incremental backup
files. So for first row, nothing change, but it will be in the incremental
backup.

Unless the first row is modified, it does not exist in memtable at all.

> If I have one row with one column, now a new column is added, and whole
row in one memtable being flushed to SSTable file, as also in this
incremental backup. For first column, nothing change, but it will still be
in incremental backup file.

For example, if it works as you understand, then, Color-2 should contain
two more rows, Lavender, and Blue with an existing column, hex, like the
following. But it's not.

- Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}, {Blue: {hex: #0000FF}}]
- Color-2-Data.db: [{Green: {hex: #008000}}, {Blue: {hex2: #2c86ff}}]

--> your understanding
- Color-2-Data.db: [{Lavender: {hex: #E6E6FA}}, {Green: {hex: #008000}},
{Blue: {hex: #0000FF}, {hex2: #2c86ff}}]
* Row, Lavender, and Column Blue's hex have no changes


> The point I tried to make is this is important if I design an ETL to
consume the incremental backup SSTable files. As above example, I have to
realize that in the incremental backup sstable files, they could or most
likely contain old data which was previous being processed already. That
will require additional logic and responsibility in the ETL to handle it,
or any outsider SSTable consumer to pay attention to it.

I suggest to try org.apache.cassandra.tools.SSTableExport, then you will
see what's going on under the hood.

- Takenori








On Wed, Sep 18, 2013 at 10:51 AM, java8964 java8964 <java8...@hotmail.com>wrote:

> Quote:
>
> "
> To be clear, "incremental backup" feature backs up the data being modified
> in that period, because it writes only those files to the incremental
> backup dir as hard links, between full snapshots.
> "
>
> I thought I was clearer, but your clarification confused me again.
> My understanding so far from all the answer I got so far, I believe, the
> more accurate statement of "incremental backup" should be "incremental
> backup" feature backs up the SSTable files being generated in that period.
>
> But there is no way we can be sure that these SSTable files will ONLY
> contain modified data. So the statement being quoted above is not exactly
> right. I agree that all the modified data in that period will be in the
> incremental sstable files, but a lot of other unmodified data will be in
> them too.
>
> If we have 2 rows data with different row key in the same memtable, and if
> only 2nd row being modified. When the memtable is flushed to SSTable file,
> it will contain both rows, and both will be in the incremental backup
> files. So for first row, nothing change, but it will be in the incremental
> backup.
>
> If I have one row with one column, now a new column is added, and whole
> row in one memtable being flushed to SSTable file, as also in this
> incremental backup. For first column, nothing change, but it will still be
> in incremental backup file.
>
> The point I tried to make is this is important if I design an ETL to
> consume the incremental backup SSTable files. As above example, I have to
> realize that in the incremental backup sstable files, they could or most
> likely contain old data which was previous being processed already. That
> will require additional logic and responsibility in the ETL to handle it,
> or any outsider SSTable consumer to pay attention to it.
>
> Yong
>
> ------------------------------
> Date: Tue, 17 Sep 2013 18:01:45 -0700
>
> Subject: Re: questions related to the SSTable file
> From: rc...@eventbrite.com
> To: user@cassandra.apache.org
>
>
> On Tue, Sep 17, 2013 at 5:46 PM, Takenori Sato <ts...@cloudian.com> wrote:
>
> > So in fact, incremental backup of Cassandra is just hard link all the
> new SSTable files being generated during the incremental backup period. It
> could contain any data, not just the data being update/insert/delete in
> this period, correct?
>
> Correct.
>
> But over time, some old enough SSTable files are usually shared across
> multiple snapshots.
>
>
> To be clear, "incremental backup" feature backs up the data being modified
> in that period, because it writes only those files to the incremental
> backup dir as hard links, between full snapshots.
>
> http://www.datastax.com/docs/1.0/operations/backup_restore
> "
> When incremental backups are enabled (disabled by default), Cassandra
> hard-links each flushed SSTable to a backups directory under the keyspace
> data directory. This allows you to store backups offsite without
> transferring entire snapshots. Also, incremental backups combine with
> snapshots to provide a dependable, up-to-date backup mechanism.
> "
>
> What Takenori is referring to is that a full snapshot is in some ways an
> "incremental backup" because it shares hard linked SSTables with other
> snapshots.
>
> =Rob
>
>

Re: questions related to the SSTable file

Reply via email to