Hi all, Is it safe to delete the backup folders from various CFs from 'system' keyspace too? I seem to have missed them in the last cleanup - and now, the size_estimates and compactions_in_progress seem to have grown large ( >200G and ~6G respectively).
Can I remove them too? Thanks, Kunal On 13 January 2017 at 18:30, Kunal Gangakhedkar <kgangakhed...@gmail.com> wrote: > Great, thanks a lot to all for the help :) > > I finally took the dive and went with Razi's suggestions. > In summary, this is what I did: > > - turn off incremental backups on each of the nodes in rolling fashion > - remove the 'backups' directory from each keyspace on each node. > > This ended up freeing up almost 350GB on each node - yay :) > > Again, thanks a lot for the help, guys. > > Kunal > > On 12 January 2017 at 21:15, Khaja, Raziuddin (NIH/NLM/NCBI) [C] < > raziuddin.kh...@nih.gov> wrote: > >> snapshots are slightly different than backups. >> >> >> >> In my explanation of the hardlinks created in the backups folder, notice >> that compacted sstables, never end up in the backups folder. >> >> >> >> On the other hand, a snapshot is meant to represent the data at a >> particular moment in time. Thus, the snapshots directory contains hardlinks >> to all active sstables at the time the snapshot was taken, which would >> include: compacted sstables; and any sstables from memtable flush or >> streamed from other nodes that both exist in the table directory and the >> backups directory. >> >> >> >> So, that would be the difference between snapshots and backups. >> >> >> >> Best regards, >> >> -Razi >> >> >> >> >> >> *From: *Alain RODRIGUEZ <arodr...@gmail.com> >> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >> *Date: *Thursday, January 12, 2017 at 9:16 AM >> >> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >> *Subject: *Re: Backups eating up disk space >> >> >> >> My 2 cents, >> >> >> >> As I mentioned earlier, we're not currently using snapshots - it's only >> the backups that are bothering me right now. >> >> >> >> I believe backups folder is just the new name for the previously called >> snapshots folder. But I can be completely wrong, I haven't played that much >> with snapshots in new versions yet. >> >> >> >> Anyway, some operations in Apache Cassandra can trigger a snapshot: >> >> >> >> - Repair (when not using parallel option but sequential repairs instead) >> >> - Truncating a table (by default) >> >> - Dropping a table (by default) >> >> - Maybe other I can't think of... ? >> >> >> >> If you want to clean space but still keep a backup you can run: >> >> >> >> "nodetool clearsnapshots" >> >> "nodetool snapshot <whatever>" >> >> >> >> This way and for a while, data won't be taking space as old files will be >> cleaned and new files will be only hardlinks as detailed above. Then you >> might want to work at a proper backup policy, probably implying getting >> data out of production server (a lot of people uses S3 or similar >> services). Or just do that from time to time, meaning you only keep a >> backup and disk space behaviour will be hard to predict. >> >> >> >> C*heers, >> >> ----------------------- >> >> Alain Rodriguez - @arodream - al...@thelastpickle.com >> >> France >> >> >> >> The Last Pickle - Apache Cassandra Consulting >> >> http://www.thelastpickle.com >> >> >> >> 2017-01-12 6:42 GMT+01:00 Prasenjit Sarkar <prasenjit.sar...@datos.io>: >> >> Hi Kunal, >> >> >> >> Razi's post does give a very lucid description of how cassandra manages >> the hard links inside the backup directory. >> >> >> >> Where it needs clarification is the following: >> >> --> incremental backups is a system wide setting and so its an all or >> nothing approach >> >> >> >> --> as multiple people have stated, incremental backups do not create >> hard links to compacted sstables. however, this can bloat the size of your >> backups >> >> >> >> --> again as stated, it is a general industry practice to place backups >> in a different secondary storage location than the main production site. So >> best to move it to the secondary storage before applying rm on the backups >> folder >> >> >> >> In my experience with production clusters, managing the backups folder >> across multiple nodes can be painful if the objective is to ever recover >> data. With the usual disclaimers, better to rely on third party vendors to >> accomplish the needful rather than scripts/tablesnap. >> >> >> >> Regards >> >> Prasenjit >> >> >> >> >> >> On Wed, Jan 11, 2017 at 7:49 AM, Khaja, Raziuddin (NIH/NLM/NCBI) [C] < >> raziuddin.kh...@nih.gov> wrote: >> >> Hello Kunal, >> >> >> >> Caveat: I am not a super-expert on Cassandra, but it helps to explain to >> others, in order to eventually become an expert, so if my explanation is >> wrong, I would hope others would correct me. J >> >> >> >> The active sstables/data files are are all the files located in the >> directory for the table. >> >> You can safely remove all files under the backups/ directory and the >> directory itself. >> >> Removing any files that are current hard-links inside backups won’t cause >> any issues, and I will explain why. >> >> >> >> Have you looked at your Cassandra.yaml file and checked the setting for >> incremental_backups? If it is set to true, and you don’t want to make new >> backups, you can set it to false, so that after you clean up, you will not >> have to clean up the backups again. >> >> >> >> Explanation: >> >> Lets look at the the definition of incremental backups again: “Cassandra >> creates a hard link to each SSTable flushed or streamed locally in >> a backups subdirectory of the keyspace data.” >> >> >> >> Suppose we have a directory path: my_keyspace/my_table-some-uuid/backups/ >> >> In the rest of the discussion, when I refer to “table directory”, I >> explicitly mean the directory: my_keyspace/my_table-some-uuid/ >> >> When I refer to backups/ directory, I explicitly mean: >> my_keyspace/my_table-some-uuid/backups/ >> >> >> >> Suppose that you have an sstable-A that was either flushed from a >> memtable or streamed from another node. >> >> At this point, you have a hardlink to sstable-A in your table directory, >> and a hardlink to sstable-A in your backups/ directory. >> >> Suppose that you have another sstable-B that was also either flushed from >> a memtable or streamed from another node. >> >> At this point, you have a hardlink to sstable-B in your table directory, >> and a hardlink to sstable-B in your backups/ directory. >> >> >> >> Next, suppose compaction were to occur, where say sstable-A and sstable-B >> would be compacted to produce sstable-C, representing all the data from A >> and B. >> >> Now, sstable-C will live in your main table directory, and the hardlinks >> to sstable-A and sstable-B will be deleted in the main table directory, but >> sstable-A and sstable-B will continue to exist in /backups. >> >> At this point, in your main table directory, you will have a hardlink to >> sstable-C. In your backups/ directory you will have hardlinks to sstable-A, >> and sstable-B. >> >> >> >> Thus, your main table directory is not cluttered with old un-compacted >> sstables, and only has the sstables along with other files that are >> actively being used. >> >> >> >> To drive the point home, … >> >> Suppose that you have another sstable-D that was either flushed from a >> memtable or streamed from another node. >> >> At this point, in your main table directory, you will have sstable-C and >> sstable-D. In your backups/ directory you will have hardlinks to sstable-A, >> sstable-B, and sstable-D. >> >> >> >> Next, suppose compaction were to occur where say sstable-C and sstable-D >> would be compacted to produce sstable-E, representing all the data from C >> and D. >> >> Now, sstable-E will live in your main table directory, and the hardlinks >> to sstable-C and sstable-D will be deleted in the main table directory, but >> sstable-D will continue to exist in /backups. >> >> At this point, in your main table directory, you will have a hardlink to >> sstable-E. In your backups/ directory you will have hardlinks to sstable-A, >> sstable-B and sstable-D. >> >> >> >> As you can see, the /backups directory quickly accumulates with all >> un-compacted sstables and how it progressively used up more and more space. >> >> Also, note that the /backups directory does not contain sstables >> generated from compaction, such as sstable-C and sstable-E. >> >> It is safe to delete the entire backups/ directory because all the data >> is represented in the compacted sstable-E. >> >> I hope this explanation was clear and gives you confidence in using rm to >> delete the directory for backups/. >> >> >> >> Best regards, >> >> -Razi >> >> >> >> >> >> >> >> *From: *Kunal Gangakhedkar <kgangakhed...@gmail.com> >> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >> *Date: *Wednesday, January 11, 2017 at 6:47 AM >> >> >> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >> *Subject: *Re: Backups eating up disk space >> >> >> >> Thanks for the reply, Razi. >> >> >> >> As I mentioned earlier, we're not currently using snapshots - it's only >> the backups that are bothering me right now. >> >> >> >> So my next question is pertaining to this statement of yours: >> >> >> >> As far as I am aware, using *rm* is perfectly safe to delete the >> directories for snapshots/backups as long as you are careful not to delete >> your actively used sstable files and directories. >> >> >> >> How do I find out which are the actively used sstables? >> >> If by that you mean the main data files, does that mean I can safely >> remove all files ONLY under the "backups/" directory? >> >> Or, removing any files that are current hard-links inside backups can >> potentially cause any issues? >> >> >> >> Thanks, >> >> Kunal >> >> >> >> On 11 January 2017 at 01:06, Khaja, Raziuddin (NIH/NLM/NCBI) [C] < >> raziuddin.kh...@nih.gov> wrote: >> >> Hello Kunal, >> >> >> >> I would take a look at the following configuration options in the >> Cassandra.yaml >> >> >> >> *Common automatic backup settings* >> >> *Incremental_backups:* >> >> http://docs.datastax.com/en/archived/cassandra/3.x/cassandra >> /configuration/configCassandra_yaml.html#configCassandra_ >> yaml__incremental_backups >> >> >> >> (Default: false) Backs up data updated since the last snapshot was taken. >> When enabled, Cassandra creates a hard link to each SSTable flushed or >> streamed locally in a backups subdirectory of the keyspace data. Removing >> these links is the operator's responsibility. >> >> >> >> *snapshot_before_compaction*: >> >> http://docs.datastax.com/en/archived/cassandra/3.x/cassandra >> /configuration/configCassandra_yaml.html#configCassandra_ >> yaml__snapshot_before_compaction >> >> >> >> (Default: false) Enables or disables taking a snapshot before each >> compaction. A snapshot is useful to back up data when there is a data >> format change. Be careful using this option: Cassandra does not clean up >> older snapshots automatically. >> >> >> >> >> >> *Advanced automatic backup setting* >> >> *auto_snapshot*: >> >> http://docs.datastax.com/en/archived/cassandra/3.x/cassandra >> /configuration/configCassandra_yaml.html#configCassandra_ >> yaml__auto_snapshot >> >> >> >> (Default: true) Enables or disables whether Cassandra takes a snapshot of >> the data before truncating a keyspace or dropping a table. To prevent data >> loss, Datastax strongly advises using the default setting. If you >> set auto_snapshot to false, you lose data on truncation or drop. >> >> >> >> >> >> *nodetool* also provides methods to manage snapshots. >> http://docs.datastax.com/en/archived/cassandra/3.x/cassandra >> /tools/toolsNodetool.html >> >> See the specific commands: >> >> - nodetool clearsnapshot >> >> <http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsClearSnapShot.html> >> Removes one or more snapshots. >> - nodetool listsnapshots >> >> <http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsListSnapShots.html> >> Lists snapshot names, size on disk, and true size. >> - nodetool snapshot >> >> <http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsSnapShot.html> >> Take a snapshot of one or more keyspaces, or of a table, to backup >> data. >> >> >> >> As far as I am aware, using *rm* is perfectly safe to delete the >> directories for snapshots/backups as long as you are careful not to delete >> your actively used sstable files and directories. I think the *nodetool >> clearsnapshot* command is provided so that you don’t accidentally delete >> actively used files. Last I used *clearsnapshot*, (a very long time >> ago), I thought it left behind the directory, but this could have been >> fixed in newer versions (so you might want to check that). >> >> >> >> HTH >> >> -Razi >> >> >> >> >> >> *From: *Jonathan Haddad <j...@jonhaddad.com> >> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >> *Date: *Tuesday, January 10, 2017 at 12:26 PM >> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >> *Subject: *Re: Backups eating up disk space >> >> >> >> If you remove the files from the backup directory, you would not have >> data loss in the case of a node going down. They're hard links to the same >> files that are in your data directory, and are created when an sstable is >> written to disk. At the time, they take up (almost) no space, so they >> aren't a big deal, but when the sstable gets compacted, they stick around, >> so they end up not freeing space up. >> >> >> >> Usually you use incremental backups as a means of moving the sstables off >> the node to a backup location. If you're not doing anything with them, >> they're just wasting space and you should disable incremental backups. >> >> >> >> Some people take snapshots then rely on incremental backups. Others use >> the tablesnap utility which does sort of the same thing. >> >> >> >> On Tue, Jan 10, 2017 at 9:18 AM Kunal Gangakhedkar < >> kgangakhed...@gmail.com> wrote: >> >> Thanks for quick reply, Jon. >> >> >> >> But, what about in case of node/cluster going down? Would there be data >> loss if I remove these files manually? >> >> >> >> How is it typically managed in production setups? >> >> What are the best-practices for the same? >> >> Do people take snapshots on each node before removing the backups? >> >> >> >> This is my first production deployment - so, still trying to learn. >> >> >> >> Thanks, >> >> Kunal >> >> >> >> On 10 January 2017 at 21:36, Jonathan Haddad <j...@jonhaddad.com> wrote: >> >> You can just delete them off the filesystem (rm) >> >> >> >> On Tue, Jan 10, 2017 at 8:02 AM Kunal Gangakhedkar < >> kgangakhed...@gmail.com> wrote: >> >> Hi all, >> >> >> >> We have a 3-node cassandra cluster with incremental backup set to true. >> >> Each node has 1TB data volume that stores cassandra data. >> >> >> >> The load in the output of 'nodetool status' comes up at around 260GB each >> node. >> >> All our keyspaces use replication factor = 3. >> >> >> >> However, the df output shows the data volumes consuming around 850GB of >> space. >> >> I checked the keyspace directory structures - most of the space goes in >> <CASS_DATA_VOL>/data/<KEYSPACE>/<CF>/backups. >> >> >> >> We have never manually run snapshots. >> >> >> >> What is the typical procedure to clear the backups? >> >> Can it be done without taking the node offline? >> >> >> >> Thanks, >> >> Kunal >> >> >> >> >> >> >> >> >> > >