Re: Backups eating up disk space

Kunal Gangakhedkar Mon, 27 Feb 2017 06:28:22 -0800

Hi all,

Is it safe to delete the backup folders from various CFs from 'system'
keyspace too?
I seem to have missed them in the last cleanup - and now, the
size_estimates and compactions_in_progress seem to have grown large ( >200G
and ~6G respectively).


Can I remove them too?

Thanks,
Kunal

On 13 January 2017 at 18:30, Kunal Gangakhedkar <kgangakhed...@gmail.com>
wrote:

> Great, thanks a lot to all for the help :)
>
> I finally took the dive and went with Razi's suggestions.
> In summary, this is what I did:
>
>    - turn off incremental backups on each of the nodes in rolling fashion
>    - remove the 'backups' directory from each keyspace on each node.
>
> This ended up freeing up almost 350GB on each node - yay :)
>
> Again, thanks a lot for the help, guys.
>
> Kunal
>
> On 12 January 2017 at 21:15, Khaja, Raziuddin (NIH/NLM/NCBI) [C] <
> raziuddin.kh...@nih.gov> wrote:
>
>> snapshots are slightly different than backups.
>>
>>
>>
>> In my explanation of the hardlinks created in the backups folder, notice
>> that compacted sstables, never end up in the backups folder.
>>
>>
>>
>> On the other hand, a snapshot is meant to represent the data at a
>> particular moment in time. Thus, the snapshots directory contains hardlinks
>> to all active sstables at the time the snapshot was taken, which would
>> include: compacted sstables; and any sstables from memtable flush or
>> streamed from other nodes that both exist in the table directory and the
>> backups directory.
>>
>>
>>
>> So, that would be the difference between snapshots and backups.
>>
>>
>>
>> Best regards,
>>
>> -Razi
>>
>>
>>
>>
>>
>> *From: *Alain RODRIGUEZ <arodr...@gmail.com>
>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Date: *Thursday, January 12, 2017 at 9:16 AM
>>
>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Subject: *Re: Backups eating up disk space
>>
>>
>>
>> My 2 cents,
>>
>>
>>
>> As I mentioned earlier, we're not currently using snapshots - it's only
>> the backups that are bothering me right now.
>>
>>
>>
>> I believe backups folder is just the new name for the previously called
>> snapshots folder. But I can be completely wrong, I haven't played that much
>> with snapshots in new versions yet.
>>
>>
>>
>> Anyway, some operations in Apache Cassandra can trigger a snapshot:
>>
>>
>>
>> - Repair (when not using parallel option but sequential repairs instead)
>>
>> - Truncating a table (by default)
>>
>> - Dropping a table (by default)
>>
>> - Maybe other I can't think of... ?
>>
>>
>>
>> If you want to clean space but still keep a backup you can run:
>>
>>
>>
>> "nodetool clearsnapshots"
>>
>> "nodetool snapshot <whatever>"
>>
>>
>>
>> This way and for a while, data won't be taking space as old files will be
>> cleaned and new files will be only hardlinks as detailed above. Then you
>> might want to work at a proper backup policy, probably implying getting
>> data out of production server (a lot of people uses S3 or similar
>> services). Or just do that from time to time, meaning you only keep a
>> backup and disk space behaviour will be hard to predict.
>>
>>
>>
>> C*heers,
>>
>> -----------------------
>>
>> Alain Rodriguez - @arodream - al...@thelastpickle.com
>>
>> France
>>
>>
>>
>> The Last Pickle - Apache Cassandra Consulting
>>
>> http://www.thelastpickle.com
>>
>>
>>
>> 2017-01-12 6:42 GMT+01:00 Prasenjit Sarkar <prasenjit.sar...@datos.io>:
>>
>> Hi Kunal,
>>
>>
>>
>> Razi's post does give a very lucid description of how cassandra manages
>> the hard links inside the backup directory.
>>
>>
>>
>> Where it needs clarification is the following:
>>
>> --> incremental backups is a system wide setting and so its an all or
>> nothing approach
>>
>>
>>
>> --> as multiple people have stated, incremental backups do not create
>> hard links to compacted sstables. however, this can bloat the size of your
>> backups
>>
>>
>>
>> --> again as stated, it is a general industry practice to place backups
>> in a different secondary storage location than the main production site. So
>> best to move it to the secondary storage before applying rm on the backups
>> folder
>>
>>
>>
>> In my experience with production clusters, managing the backups folder
>> across multiple nodes can be painful if the objective is to ever recover
>> data. With the usual disclaimers, better to rely on third party vendors to
>> accomplish the needful rather than scripts/tablesnap.
>>
>>
>>
>> Regards
>>
>> Prasenjit
>>
>>
>>
>>
>>
>> On Wed, Jan 11, 2017 at 7:49 AM, Khaja, Raziuddin (NIH/NLM/NCBI) [C] <
>> raziuddin.kh...@nih.gov> wrote:
>>
>> Hello Kunal,
>>
>>
>>
>> Caveat: I am not a super-expert on Cassandra, but it helps to explain to
>> others, in order to eventually become an expert, so if my explanation is
>> wrong, I would hope others would correct me. J
>>
>>
>>
>> The active sstables/data files are are all the files located in the
>> directory for the table.
>>
>> You can safely remove all files under the backups/ directory and the
>> directory itself.
>>
>> Removing any files that are current hard-links inside backups won’t cause
>> any issues, and I will explain why.
>>
>>
>>
>> Have you looked at your Cassandra.yaml file and checked the setting for
>> incremental_backups?  If it is set to true, and you don’t want to make new
>> backups, you can set it to false, so that after you clean up, you will not
>> have to clean up the backups again.
>>
>>
>>
>> Explanation:
>>
>> Lets look at the the definition of incremental backups again: “Cassandra
>> creates a hard link to each SSTable flushed or streamed locally in
>> a backups subdirectory of the keyspace data.”
>>
>>
>>
>> Suppose we have a directory path: my_keyspace/my_table-some-uuid/backups/
>>
>> In the rest of the discussion, when I refer to “table directory”, I
>> explicitly mean the directory: my_keyspace/my_table-some-uuid/
>>
>> When I refer to backups/ directory, I explicitly mean:
>> my_keyspace/my_table-some-uuid/backups/
>>
>>
>>
>> Suppose that you have an sstable-A that was either flushed from a
>> memtable or streamed from another node.
>>
>> At this point, you have a hardlink to sstable-A in your table directory,
>> and a hardlink to sstable-A in your backups/ directory.
>>
>> Suppose that you have another sstable-B that was also either flushed from
>> a memtable or streamed from another node.
>>
>> At this point, you have a hardlink to sstable-B in your table directory,
>> and a hardlink to sstable-B in your backups/ directory.
>>
>>
>>
>> Next, suppose compaction were to occur, where say sstable-A and sstable-B
>> would be compacted to produce sstable-C, representing all the data from A
>> and B.
>>
>> Now, sstable-C will live in your main table directory, and the hardlinks
>> to sstable-A and sstable-B will be deleted in the main table directory, but
>> sstable-A and sstable-B will continue to exist in /backups.
>>
>> At this point, in your main table directory, you will have a hardlink to
>> sstable-C. In your backups/ directory you will have hardlinks to sstable-A,
>> and sstable-B.
>>
>>
>>
>> Thus, your main table directory is not cluttered with old un-compacted
>> sstables, and only has the sstables along with other files that are
>> actively being used.
>>
>>
>>
>> To drive the point home, …
>>
>> Suppose that you have another sstable-D that was either flushed from a
>> memtable or streamed from another node.
>>
>> At this point, in your main table directory, you will have sstable-C and
>> sstable-D. In your backups/ directory you will have hardlinks to sstable-A,
>> sstable-B, and sstable-D.
>>
>>
>>
>> Next, suppose compaction were to occur where say sstable-C and sstable-D
>> would be compacted to produce sstable-E, representing all the data from C
>> and D.
>>
>> Now, sstable-E will live in your main table directory, and the hardlinks
>> to sstable-C and sstable-D will be deleted in the main table directory, but
>> sstable-D will continue to exist in /backups.
>>
>> At this point, in your main table directory, you will have a hardlink to
>> sstable-E. In your backups/ directory you will have hardlinks to sstable-A,
>> sstable-B and sstable-D.
>>
>>
>>
>> As you can see, the /backups directory quickly accumulates with all
>> un-compacted sstables and how it progressively used up more and more space.
>>
>> Also, note that the /backups directory does not contain sstables
>> generated from compaction, such as sstable-C and sstable-E.
>>
>> It is safe to delete the entire backups/ directory because all the data
>> is represented in the compacted sstable-E.
>>
>> I hope this explanation was clear and gives you confidence in using rm to
>> delete the directory for backups/.
>>
>>
>>
>> Best regards,
>>
>> -Razi
>>
>>
>>
>>
>>
>>
>>
>> *From: *Kunal Gangakhedkar <kgangakhed...@gmail.com>
>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Date: *Wednesday, January 11, 2017 at 6:47 AM
>>
>>
>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Subject: *Re: Backups eating up disk space
>>
>>
>>
>> Thanks for the reply, Razi.
>>
>>
>>
>> As I mentioned earlier, we're not currently using snapshots - it's only
>> the backups that are bothering me right now.
>>
>>
>>
>> So my next question is pertaining to this statement of yours:
>>
>>
>>
>> As far as I am aware, using *rm* is perfectly safe to delete the
>> directories for snapshots/backups as long as you are careful not to delete
>> your actively used sstable files and directories.
>>
>>
>>
>> How do I find out which are the actively used sstables?
>>
>> If by that you mean the main data files, does that mean I can safely
>> remove all files ONLY under the "backups/" directory?
>>
>> Or, removing any files that are current hard-links inside backups can
>> potentially cause any issues?
>>
>>
>>
>> Thanks,
>>
>> Kunal
>>
>>
>>
>> On 11 January 2017 at 01:06, Khaja, Raziuddin (NIH/NLM/NCBI) [C] <
>> raziuddin.kh...@nih.gov> wrote:
>>
>> Hello Kunal,
>>
>>
>>
>> I would take a look at the following configuration options in the
>> Cassandra.yaml
>>
>>
>>
>> *Common automatic backup settings*
>>
>> *Incremental_backups:*
>>
>> http://docs.datastax.com/en/archived/cassandra/3.x/cassandra
>> /configuration/configCassandra_yaml.html#configCassandra_
>> yaml__incremental_backups
>>
>>
>>
>> (Default: false) Backs up data updated since the last snapshot was taken.
>> When enabled, Cassandra creates a hard link to each SSTable flushed or
>> streamed locally in a backups subdirectory of the keyspace data. Removing
>> these links is the operator's responsibility.
>>
>>
>>
>> *snapshot_before_compaction*:
>>
>> http://docs.datastax.com/en/archived/cassandra/3.x/cassandra
>> /configuration/configCassandra_yaml.html#configCassandra_
>> yaml__snapshot_before_compaction
>>
>>
>>
>> (Default: false) Enables or disables taking a snapshot before each
>> compaction. A snapshot is useful to back up data when there is a data
>> format change. Be careful using this option: Cassandra does not clean up
>> older snapshots automatically.
>>
>>
>>
>>
>>
>> *Advanced automatic backup setting*
>>
>> *auto_snapshot*:
>>
>> http://docs.datastax.com/en/archived/cassandra/3.x/cassandra
>> /configuration/configCassandra_yaml.html#configCassandra_
>> yaml__auto_snapshot
>>
>>
>>
>> (Default: true) Enables or disables whether Cassandra takes a snapshot of
>> the data before truncating a keyspace or dropping a table. To prevent data
>> loss, Datastax strongly advises using the default setting. If you
>> set auto_snapshot to false, you lose data on truncation or drop.
>>
>>
>>
>>
>>
>> *nodetool* also provides methods to manage snapshots.
>> http://docs.datastax.com/en/archived/cassandra/3.x/cassandra
>> /tools/toolsNodetool.html
>>
>> See the specific commands:
>>
>>    - nodetool clearsnapshot
>>    
>> <http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsClearSnapShot.html>
>>    Removes one or more snapshots.
>>    - nodetool listsnapshots
>>    
>> <http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsListSnapShots.html>
>>    Lists snapshot names, size on disk, and true size.
>>    - nodetool snapshot
>>    
>> <http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsSnapShot.html>
>>    Take a snapshot of one or more keyspaces, or of a table, to backup
>>    data.
>>
>>
>>
>> As far as I am aware, using *rm* is perfectly safe to delete the
>> directories for snapshots/backups as long as you are careful not to delete
>> your actively used sstable files and directories.  I think the *nodetool
>> clearsnapshot* command is provided so that you don’t accidentally delete
>> actively used files.  Last I used *clearsnapshot*, (a very long time
>> ago), I thought it left behind the directory, but this could have been
>> fixed in newer versions (so you might want to check that).
>>
>>
>>
>> HTH
>>
>> -Razi
>>
>>
>>
>>
>>
>> *From: *Jonathan Haddad <j...@jonhaddad.com>
>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Date: *Tuesday, January 10, 2017 at 12:26 PM
>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Subject: *Re: Backups eating up disk space
>>
>>
>>
>> If you remove the files from the backup directory, you would not have
>> data loss in the case of a node going down.  They're hard links to the same
>> files that are in your data directory, and are created when an sstable is
>> written to disk.  At the time, they take up (almost) no space, so they
>> aren't a big deal, but when the sstable gets compacted, they stick around,
>> so they end up not freeing space up.
>>
>>
>>
>> Usually you use incremental backups as a means of moving the sstables off
>> the node to a backup location.  If you're not doing anything with them,
>> they're just wasting space and you should disable incremental backups.
>>
>>
>>
>> Some people take snapshots then rely on incremental backups.  Others use
>> the tablesnap utility which does sort of the same thing.
>>
>>
>>
>> On Tue, Jan 10, 2017 at 9:18 AM Kunal Gangakhedkar <
>> kgangakhed...@gmail.com> wrote:
>>
>> Thanks for quick reply, Jon.
>>
>>
>>
>> But, what about in case of node/cluster going down? Would there be data
>> loss if I remove these files manually?
>>
>>
>>
>> How is it typically managed in production setups?
>>
>> What are the best-practices for the same?
>>
>> Do people take snapshots on each node before removing the backups?
>>
>>
>>
>> This is my first production deployment - so, still trying to learn.
>>
>>
>>
>> Thanks,
>>
>> Kunal
>>
>>
>>
>> On 10 January 2017 at 21:36, Jonathan Haddad <j...@jonhaddad.com> wrote:
>>
>> You can just delete them off the filesystem (rm)
>>
>>
>>
>> On Tue, Jan 10, 2017 at 8:02 AM Kunal Gangakhedkar <
>> kgangakhed...@gmail.com> wrote:
>>
>> Hi all,
>>
>>
>>
>> We have a 3-node cassandra cluster with incremental backup set to true.
>>
>> Each node has 1TB data volume that stores cassandra data.
>>
>>
>>
>> The load in the output of 'nodetool status' comes up at around 260GB each
>> node.
>>
>> All our keyspaces use replication factor = 3.
>>
>>
>>
>> However, the df output shows the data volumes consuming around 850GB of
>> space.
>>
>> I checked the keyspace directory structures - most of the space goes in
>> <CASS_DATA_VOL>/data/<KEYSPACE>/<CF>/backups.
>>
>>
>>
>> We have never manually run snapshots.
>>
>>
>>
>> What is the typical procedure to clear the backups?
>>
>> Can it be done without taking the node offline?
>>
>>
>>
>> Thanks,
>>
>> Kunal
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Backups eating up disk space

Reply via email to