Re: Backups eating up disk space

Kunal Gangakhedkar Fri, 13 Jan 2017 05:01:57 -0800

Great, thanks a lot to all for the help :)

I finally took the dive and went with Razi's suggestions.
In summary, this is what I did:


   - turn off incremental backups on each of the nodes in rolling fashion
   - remove the 'backups' directory from each keyspace on each node.

This ended up freeing up almost 350GB on each node - yay :)

Again, thanks a lot for the help, guys.

Kunal

On 12 January 2017 at 21:15, Khaja, Raziuddin (NIH/NLM/NCBI) [C] <
raziuddin.kh...@nih.gov> wrote:

> snapshots are slightly different than backups.
>
>
>
> In my explanation of the hardlinks created in the backups folder, notice
> that compacted sstables, never end up in the backups folder.
>
>
>
> On the other hand, a snapshot is meant to represent the data at a
> particular moment in time. Thus, the snapshots directory contains hardlinks
> to all active sstables at the time the snapshot was taken, which would
> include: compacted sstables; and any sstables from memtable flush or
> streamed from other nodes that both exist in the table directory and the
> backups directory.
>
>
>
> So, that would be the difference between snapshots and backups.
>
>
>
> Best regards,
>
> -Razi
>
>
>
>
>
> *From: *Alain RODRIGUEZ <arodr...@gmail.com>
> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Date: *Thursday, January 12, 2017 at 9:16 AM
>
> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Subject: *Re: Backups eating up disk space
>
>
>
> My 2 cents,
>
>
>
> As I mentioned earlier, we're not currently using snapshots - it's only
> the backups that are bothering me right now.
>
>
>
> I believe backups folder is just the new name for the previously called
> snapshots folder. But I can be completely wrong, I haven't played that much
> with snapshots in new versions yet.
>
>
>
> Anyway, some operations in Apache Cassandra can trigger a snapshot:
>
>
>
> - Repair (when not using parallel option but sequential repairs instead)
>
> - Truncating a table (by default)
>
> - Dropping a table (by default)
>
> - Maybe other I can't think of... ?
>
>
>
> If you want to clean space but still keep a backup you can run:
>
>
>
> "nodetool clearsnapshots"
>
> "nodetool snapshot <whatever>"
>
>
>
> This way and for a while, data won't be taking space as old files will be
> cleaned and new files will be only hardlinks as detailed above. Then you
> might want to work at a proper backup policy, probably implying getting
> data out of production server (a lot of people uses S3 or similar
> services). Or just do that from time to time, meaning you only keep a
> backup and disk space behaviour will be hard to predict.
>
>
>
> C*heers,
>
> -----------------------
>
> Alain Rodriguez - @arodream - al...@thelastpickle.com
>
> France
>
>
>
> The Last Pickle - Apache Cassandra Consulting
>
> http://www.thelastpickle.com
>
>
>
> 2017-01-12 6:42 GMT+01:00 Prasenjit Sarkar <prasenjit.sar...@datos.io>:
>
> Hi Kunal,
>
>
>
> Razi's post does give a very lucid description of how cassandra manages
> the hard links inside the backup directory.
>
>
>
> Where it needs clarification is the following:
>
> --> incremental backups is a system wide setting and so its an all or
> nothing approach
>
>
>
> --> as multiple people have stated, incremental backups do not create hard
> links to compacted sstables. however, this can bloat the size of your
> backups
>
>
>
> --> again as stated, it is a general industry practice to place backups in
> a different secondary storage location than the main production site. So
> best to move it to the secondary storage before applying rm on the backups
> folder
>
>
>
> In my experience with production clusters, managing the backups folder
> across multiple nodes can be painful if the objective is to ever recover
> data. With the usual disclaimers, better to rely on third party vendors to
> accomplish the needful rather than scripts/tablesnap.
>
>
>
> Regards
>
> Prasenjit
>
>
>
>
>
> On Wed, Jan 11, 2017 at 7:49 AM, Khaja, Raziuddin (NIH/NLM/NCBI) [C] <
> raziuddin.kh...@nih.gov> wrote:
>
> Hello Kunal,
>
>
>
> Caveat: I am not a super-expert on Cassandra, but it helps to explain to
> others, in order to eventually become an expert, so if my explanation is
> wrong, I would hope others would correct me. J
>
>
>
> The active sstables/data files are are all the files located in the
> directory for the table.
>
> You can safely remove all files under the backups/ directory and the
> directory itself.
>
> Removing any files that are current hard-links inside backups won’t cause
> any issues, and I will explain why.
>
>
>
> Have you looked at your Cassandra.yaml file and checked the setting for
> incremental_backups?  If it is set to true, and you don’t want to make new
> backups, you can set it to false, so that after you clean up, you will not
> have to clean up the backups again.
>
>
>
> Explanation:
>
> Lets look at the the definition of incremental backups again: “Cassandra
> creates a hard link to each SSTable flushed or streamed locally in
> a backups subdirectory of the keyspace data.”
>
>
>
> Suppose we have a directory path: my_keyspace/my_table-some-uuid/backups/
>
> In the rest of the discussion, when I refer to “table directory”, I
> explicitly mean the directory: my_keyspace/my_table-some-uuid/
>
> When I refer to backups/ directory, I explicitly mean:
> my_keyspace/my_table-some-uuid/backups/
>
>
>
> Suppose that you have an sstable-A that was either flushed from a memtable
> or streamed from another node.
>
> At this point, you have a hardlink to sstable-A in your table directory,
> and a hardlink to sstable-A in your backups/ directory.
>
> Suppose that you have another sstable-B that was also either flushed from
> a memtable or streamed from another node.
>
> At this point, you have a hardlink to sstable-B in your table directory,
> and a hardlink to sstable-B in your backups/ directory.
>
>
>
> Next, suppose compaction were to occur, where say sstable-A and sstable-B
> would be compacted to produce sstable-C, representing all the data from A
> and B.
>
> Now, sstable-C will live in your main table directory, and the hardlinks
> to sstable-A and sstable-B will be deleted in the main table directory, but
> sstable-A and sstable-B will continue to exist in /backups.
>
> At this point, in your main table directory, you will have a hardlink to
> sstable-C. In your backups/ directory you will have hardlinks to sstable-A,
> and sstable-B.
>
>
>
> Thus, your main table directory is not cluttered with old un-compacted
> sstables, and only has the sstables along with other files that are
> actively being used.
>
>
>
> To drive the point home, …
>
> Suppose that you have another sstable-D that was either flushed from a
> memtable or streamed from another node.
>
> At this point, in your main table directory, you will have sstable-C and
> sstable-D. In your backups/ directory you will have hardlinks to sstable-A,
> sstable-B, and sstable-D.
>
>
>
> Next, suppose compaction were to occur where say sstable-C and sstable-D
> would be compacted to produce sstable-E, representing all the data from C
> and D.
>
> Now, sstable-E will live in your main table directory, and the hardlinks
> to sstable-C and sstable-D will be deleted in the main table directory, but
> sstable-D will continue to exist in /backups.
>
> At this point, in your main table directory, you will have a hardlink to
> sstable-E. In your backups/ directory you will have hardlinks to sstable-A,
> sstable-B and sstable-D.
>
>
>
> As you can see, the /backups directory quickly accumulates with all
> un-compacted sstables and how it progressively used up more and more space.
>
> Also, note that the /backups directory does not contain sstables generated
> from compaction, such as sstable-C and sstable-E.
>
> It is safe to delete the entire backups/ directory because all the data is
> represented in the compacted sstable-E.
>
> I hope this explanation was clear and gives you confidence in using rm to
> delete the directory for backups/.
>
>
>
> Best regards,
>
> -Razi
>
>
>
>
>
>
>
> *From: *Kunal Gangakhedkar <kgangakhed...@gmail.com>
> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Date: *Wednesday, January 11, 2017 at 6:47 AM
>
>
> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Subject: *Re: Backups eating up disk space
>
>
>
> Thanks for the reply, Razi.
>
>
>
> As I mentioned earlier, we're not currently using snapshots - it's only
> the backups that are bothering me right now.
>
>
>
> So my next question is pertaining to this statement of yours:
>
>
>
> As far as I am aware, using *rm* is perfectly safe to delete the
> directories for snapshots/backups as long as you are careful not to delete
> your actively used sstable files and directories.
>
>
>
> How do I find out which are the actively used sstables?
>
> If by that you mean the main data files, does that mean I can safely
> remove all files ONLY under the "backups/" directory?
>
> Or, removing any files that are current hard-links inside backups can
> potentially cause any issues?
>
>
>
> Thanks,
>
> Kunal
>
>
>
> On 11 January 2017 at 01:06, Khaja, Raziuddin (NIH/NLM/NCBI) [C] <
> raziuddin.kh...@nih.gov> wrote:
>
> Hello Kunal,
>
>
>
> I would take a look at the following configuration options in the
> Cassandra.yaml
>
>
>
> *Common automatic backup settings*
>
> *Incremental_backups:*
>
> http://docs.datastax.com/en/archived/cassandra/3.x/
> cassandra/configuration/configCassandra_yaml.html#configCassandra_yaml__
> incremental_backups
>
>
>
> (Default: false) Backs up data updated since the last snapshot was taken.
> When enabled, Cassandra creates a hard link to each SSTable flushed or
> streamed locally in a backups subdirectory of the keyspace data. Removing
> these links is the operator's responsibility.
>
>
>
> *snapshot_before_compaction*:
>
> http://docs.datastax.com/en/archived/cassandra/3.x/
> cassandra/configuration/configCassandra_yaml.html#configCassandra_yaml__
> snapshot_before_compaction
>
>
>
> (Default: false) Enables or disables taking a snapshot before each
> compaction. A snapshot is useful to back up data when there is a data
> format change. Be careful using this option: Cassandra does not clean up
> older snapshots automatically.
>
>
>
>
>
> *Advanced automatic backup setting*
>
> *auto_snapshot*:
>
> http://docs.datastax.com/en/archived/cassandra/3.x/
> cassandra/configuration/configCassandra_yaml.html#
> configCassandra_yaml__auto_snapshot
>
>
>
> (Default: true) Enables or disables whether Cassandra takes a snapshot of
> the data before truncating a keyspace or dropping a table. To prevent data
> loss, Datastax strongly advises using the default setting. If you
> set auto_snapshot to false, you lose data on truncation or drop.
>
>
>
>
>
> *nodetool* also provides methods to manage snapshots.
> http://docs.datastax.com/en/archived/cassandra/3.x/
> cassandra/tools/toolsNodetool.html
>
> See the specific commands:
>
>    - nodetool clearsnapshot
>    
> <http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsClearSnapShot.html>
>    Removes one or more snapshots.
>    - nodetool listsnapshots
>    
> <http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsListSnapShots.html>
>    Lists snapshot names, size on disk, and true size.
>    - nodetool snapshot
>    
> <http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsSnapShot.html>
>    Take a snapshot of one or more keyspaces, or of a table, to backup
>    data.
>
>
>
> As far as I am aware, using *rm* is perfectly safe to delete the
> directories for snapshots/backups as long as you are careful not to delete
> your actively used sstable files and directories.  I think the *nodetool
> clearsnapshot* command is provided so that you don’t accidentally delete
> actively used files.  Last I used *clearsnapshot*, (a very long time
> ago), I thought it left behind the directory, but this could have been
> fixed in newer versions (so you might want to check that).
>
>
>
> HTH
>
> -Razi
>
>
>
>
>
> *From: *Jonathan Haddad <j...@jonhaddad.com>
> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Date: *Tuesday, January 10, 2017 at 12:26 PM
> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Subject: *Re: Backups eating up disk space
>
>
>
> If you remove the files from the backup directory, you would not have data
> loss in the case of a node going down.  They're hard links to the same
> files that are in your data directory, and are created when an sstable is
> written to disk.  At the time, they take up (almost) no space, so they
> aren't a big deal, but when the sstable gets compacted, they stick around,
> so they end up not freeing space up.
>
>
>
> Usually you use incremental backups as a means of moving the sstables off
> the node to a backup location.  If you're not doing anything with them,
> they're just wasting space and you should disable incremental backups.
>
>
>
> Some people take snapshots then rely on incremental backups.  Others use
> the tablesnap utility which does sort of the same thing.
>
>
>
> On Tue, Jan 10, 2017 at 9:18 AM Kunal Gangakhedkar <
> kgangakhed...@gmail.com> wrote:
>
> Thanks for quick reply, Jon.
>
>
>
> But, what about in case of node/cluster going down? Would there be data
> loss if I remove these files manually?
>
>
>
> How is it typically managed in production setups?
>
> What are the best-practices for the same?
>
> Do people take snapshots on each node before removing the backups?
>
>
>
> This is my first production deployment - so, still trying to learn.
>
>
>
> Thanks,
>
> Kunal
>
>
>
> On 10 January 2017 at 21:36, Jonathan Haddad <j...@jonhaddad.com> wrote:
>
> You can just delete them off the filesystem (rm)
>
>
>
> On Tue, Jan 10, 2017 at 8:02 AM Kunal Gangakhedkar <
> kgangakhed...@gmail.com> wrote:
>
> Hi all,
>
>
>
> We have a 3-node cassandra cluster with incremental backup set to true.
>
> Each node has 1TB data volume that stores cassandra data.
>
>
>
> The load in the output of 'nodetool status' comes up at around 260GB each
> node.
>
> All our keyspaces use replication factor = 3.
>
>
>
> However, the df output shows the data volumes consuming around 850GB of
> space.
>
> I checked the keyspace directory structures - most of the space goes in
> <CASS_DATA_VOL>/data/<KEYSPACE>/<CF>/backups.
>
>
>
> We have never manually run snapshots.
>
>
>
> What is the typical procedure to clear the backups?
>
> Can it be done without taking the node offline?
>
>
>
> Thanks,
>
> Kunal
>
>
>
>
>
>
>
>
>

Re: Backups eating up disk space

Reply via email to