Information is only deleted from Cassandra during a compaction. Using SizeTieredCompaction, compaction only occurs when a number of similarly sized sstables are combined into a new sstable.
When you perform a major compaction, all sstables are combined into one, very large, sstable. As a result, any tombstoned data in that large sstable will only be removed when a number of very large sstable exists. This means tombstoned data maybe trapped in that sstable for a very long time (or indefinitely depending on your usecase). -Mike On Jul 11, 2013, at 9:31 AM, Brian Tarbox wrote: > Perhaps I should already know this but why is running a major compaction > considered so bad? We're running 1.1.6. > > Thanks. > > > On Thu, Jul 11, 2013 at 7:51 AM, Takenori Sato <ts...@cloudian.com> wrote: > Hi, > > I think it is a common headache for users running a large Cassandra cluster > in production. > > > Running a major compaction is not the only cause, but more. For example, I > see two typical scenario. > > 1. backup use case > 2. active wide row > > In the case of 1, say, one data is removed a year later. This means, > tombstone on the row is 1 year away from the original row. To remove an > expired row entirely, a compaction set has to include all the rows. So, when > do the original, 1 year old row, and the tombstoned row are included in a > compaction set? It is likely to take one year. > > In the case of 2, such an active wide row exists in most of sstable files. > And it typically contains many expired columns. But none of them wouldn't be > removed entirely because a compaction set practically do not include all the > row fragments. > > > Btw, there is a very convenient MBean API is available. It is > CompactionManager's forceUserDefinedCompaction. You can invoke a minor > compaction on a file set you define. So the question is how to find an > optimal set of sstable files. > > Then, I wrote a tool to check garbage, and print outs some useful information > to find such an optimal set. > > Here's a simple log output. > > # /opt/cassandra/bin/checksstablegarbage -e > /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db > [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, > 300(1373504071)] > =================================================================================== > ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, > REMAINNING_SSTABLE_FILES > =================================================================================== > hello5/100.txt.1373502926003, 40, 40, YES, YES, Test5_BLOB-hc-3-Data.db > ----------------------------------------------------------------------------------- > TOTAL, 40, 40 > =================================================================================== > REMAINNING_SSTABLE_FILES means any other sstable files that contain the > respective row. So, the following is an optimal set. > > # /opt/cassandra/bin/checksstablegarbage -e > /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db > /cassandra_data/UserData/Test5_BLOB-hc-3-Data.db > [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, > 300(1373504131)] > =================================================================================== > ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, > REMAINNING_SSTABLE_FILES > =================================================================================== > hello5/100.txt.1373502926003, 223, 0, YES, YES > ----------------------------------------------------------------------------------- > TOTAL, 223, 0 > =================================================================================== > This tool relies on SSTableReader and an aggregation iterator as Cassandra > does in compaction. I was considering to share this with the community. So > let me know if anyone is interested. > > Ah, note that it is based on 1.0.7. So I will need to check and update for > newer versions. > > Thanks, > Takenori > > > On Thu, Jul 11, 2013 at 6:46 PM, Tomàs Núnez <tomas.nu...@groupalia.com> > wrote: > Hi > > About a year ago, we did a major compaction in our cassandra cluster (a n00b > mistake, I know), and since then we've had huge sstables that never get > compacted, and we were condemned to repeat the major compaction process every > once in a while (we are using SizeTieredCompaction strategy, and we've not > avaluated yet LeveledCompaction, because it has its downsides, and we've had > no time to test all of them in our environment). > > I was trying to find a way to solve this situation (that is, do something > like a major compaction that writes small sstables, not huge as major > compaction does), and I couldn't find it in the documentation. I tried > cleanup and scrub/upgradesstables, but they don't do that (as documentation > states). Then I tried deleting all data in a node and then bootstrapping it > (or "nodetool rebuild"-ing it), hoping that this way the sstables would get > cleaned from deleted records and updates. But the deleted node just copied > the sstables from another node as they were, cleaning nothing. > > So I tried a new approach: I switched the sstable compaction strategy > (SizeTiered to Leveled), forcing the sstables to be rewritten from scratch, > and then switching it back (Leveled to SizeTiered). It took a while (but so > do the major compaction process) and it worked, I have smaller sstables, and > I've regained a lot of disk space. > > I'm happy with the results, but it doesn't seem a orthodox way of "cleaning" > the sstables. What do you think, is it something wrong or crazy? Is there a > different way to achieve the same thing? > > Let's put an example: > Suppose you have a write-only columnfamily (no updates and no deletes, so no > need for LeveledCompaction, because SizeTiered works perfectly and requires > less I/O) and you mistakenly run a major compaction on it. After a few months > you need more space and you delete half the data, and you find out that > you're not freeing half the disk space, because most of those records were in > the "major compacted" sstables. How can you free the disk space? Waiting will > do you no good, because the huge sstable won't get compacted anytime soon. > You can run another major compaction, but that would just postpone the real > problem. Then you can switch compaction strategy and switch it back, as I > just did. Is there any other way? > > -- > <groupalia.jpg> > www.groupalia.com > Tomàs Núñez > IT-Sysprod > Tel. + 34 93 159 31 00 > Fax. + 34 93 396 18 52 > Llull, 95-97, 2º planta, 08005 Barcelona > Skype: tomas.nunez.groupalia > tomas.nu...@groupalia.com > <twitter.png> Twitter <facebook.png> Facebook <linkedin.png> Linkedin > >