Re: Alternate "major compaction"

Michael Theroux Thu, 11 Jul 2013 06:43:20 -0700

Information is only deleted from Cassandra during a compaction.  Using 
SizeTieredCompaction, compaction only occurs when a number of similarly sized 
sstables are combined into a new sstable.


When you perform a major compaction, all sstables are combined into one, very 
large, sstable.  As a result, any tombstoned data in that large sstable will 
only be removed when a number of very large sstable exists.  This means 
tombstoned data maybe trapped in that sstable for a very long time (or 
indefinitely depending on your usecase).

-Mike

On Jul 11, 2013, at 9:31 AM, Brian Tarbox wrote:

> Perhaps I should already know this but why is running a major compaction 
> considered so bad?  We're running 1.1.6.
> 
> Thanks.
> 
> 
> On Thu, Jul 11, 2013 at 7:51 AM, Takenori Sato <ts...@cloudian.com> wrote:
> Hi,
> 
> I think it is a common headache for users running a large Cassandra cluster 
> in production.
> 
> 
> Running a major compaction is not the only cause, but more. For example, I 
> see two typical scenario.
> 
> 1. backup use case
> 2. active wide row
> 
> In the case of 1, say, one data is removed a year later. This means, 
> tombstone on the row is 1 year away from the original row. To remove an 
> expired row entirely, a compaction set has to include all the rows. So, when 
> do the original, 1 year old row, and the tombstoned row are included in a 
> compaction set? It is likely to take one year.
> 
> In the case of 2, such an active wide row exists in most of sstable files. 
> And it typically contains many expired columns. But none of them wouldn't be 
> removed entirely because a compaction set practically do not include all the 
> row fragments.
> 
> 
> Btw, there is a very convenient MBean API is available. It is 
> CompactionManager's forceUserDefinedCompaction. You can invoke a minor 
> compaction on a file set you define. So the question is how to find an 
> optimal set of sstable files.
> 
> Then, I wrote a tool to check garbage, and print outs some useful information 
> to find such an optimal set.
> 
> Here's a simple log output.
> 
> # /opt/cassandra/bin/checksstablegarbage -e 
> /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db
> [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 
> 300(1373504071)]
> ===================================================================================
> ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, 
> REMAINNING_SSTABLE_FILES
> ===================================================================================
> hello5/100.txt.1373502926003, 40, 40, YES, YES, Test5_BLOB-hc-3-Data.db    
> -----------------------------------------------------------------------------------
> TOTAL, 40, 40
> ===================================================================================
> REMAINNING_SSTABLE_FILES means any other sstable files that contain the 
> respective row. So, the following is an optimal set.
> 
> # /opt/cassandra/bin/checksstablegarbage -e 
> /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db 
> /cassandra_data/UserData/Test5_BLOB-hc-3-Data.db 
> [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 
> 300(1373504131)]
> ===================================================================================
> ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, 
> REMAINNING_SSTABLE_FILES
> ===================================================================================
> hello5/100.txt.1373502926003, 223, 0, YES, YES
> -----------------------------------------------------------------------------------
> TOTAL, 223, 0
> ===================================================================================
> This tool relies on SSTableReader and an aggregation iterator as Cassandra 
> does in compaction. I was considering to share this with the community. So 
> let me know if anyone is interested.
> 
> Ah, note that it is based on 1.0.7. So I will need to check and update for 
> newer versions.
> 
> Thanks,
> Takenori
> 
> 
> On Thu, Jul 11, 2013 at 6:46 PM, Tomàs Núnez <tomas.nu...@groupalia.com> 
> wrote:
> Hi
> 
> About a year ago, we did a major compaction in our cassandra cluster (a n00b 
> mistake, I know), and since then we've had huge sstables that never get 
> compacted, and we were condemned to repeat the major compaction process every 
> once in a while (we are using SizeTieredCompaction strategy, and we've not 
> avaluated yet LeveledCompaction, because it has its downsides, and we've had 
> no time to test all of them in our environment).
> 
> I was trying to find a way to solve this situation (that is, do something 
> like a major compaction that writes small sstables, not huge as major 
> compaction does), and I couldn't find it in the documentation. I tried 
> cleanup and scrub/upgradesstables, but they don't do that (as documentation 
> states). Then I tried deleting all data in a node and then bootstrapping it 
> (or "nodetool rebuild"-ing it), hoping that this way the sstables would get 
> cleaned from deleted records and updates. But the deleted node just copied 
> the sstables from another node as they were, cleaning nothing. 
> 
> So I tried a new approach: I switched the sstable compaction strategy 
> (SizeTiered to Leveled), forcing the sstables to be rewritten from scratch, 
> and then switching it back (Leveled to SizeTiered). It took a while (but so 
> do the major compaction process) and it worked, I have smaller sstables, and 
> I've regained a lot of disk space.
> 
> I'm happy with the results, but it doesn't seem a orthodox way of "cleaning" 
> the sstables. What do you think, is it something wrong or crazy? Is there a 
> different way to achieve the same thing?
> 
> Let's put an example:
> Suppose you have a write-only columnfamily (no updates and no deletes, so no 
> need for LeveledCompaction, because SizeTiered works perfectly and requires 
> less I/O) and you mistakenly run a major compaction on it. After a few months 
> you need more space and you delete half the data, and you find out that 
> you're not freeing half the disk space, because most of those records were in 
> the "major compacted" sstables. How can you free the disk space? Waiting will 
> do you no good, because the huge sstable won't get compacted anytime soon. 
> You can run another major compaction, but that would just postpone the real 
> problem. Then you can switch compaction strategy and switch it back, as I 
> just did. Is there any other way?
> 
> -- 
> <groupalia.jpg>
> www.groupalia.com     
> Tomàs Núñez
> IT-Sysprod
> Tel. + 34 93 159 31 00 
> Fax. + 34 93 396 18 52
> Llull, 95-97, 2º planta, 08005 Barcelona
> Skype: tomas.nunez.groupalia
> tomas.nu...@groupalia.com
> <twitter.png> Twitter    <facebook.png> Facebook    <linkedin.png> Linkedin
> 
>

Re: Alternate "major compaction"

Reply via email to