There’s a seldom discussed parameter called:

unchecked_tombstone_compaction

The documentation describes the option as follows:

True enables more aggressive than normal tombstone compactions. A single 
SSTable tombstone compaction runs without checking the likelihood of success. 
Cassandra 2.0.9 and later.

You’d need to upgrade to newer than 2.0.9, but by doing so, and enabling 
unchecked_tombstone_compaction, you could encourage cassandra to compact just 
one single large sstable to purge tombstones.



From:  <erickramirezonl...@gmail.com> on behalf of Erick Ramirez
Reply-To:  "user@cassandra.apache.org"
Date:  Sunday, September 27, 2015 at 11:59 PM
To:  "user@cassandra.apache.org", Dongfeng Lu
Subject:  Re: How to remove huge files with all expired data sooner?

Hello, 

You should never run `nodetool compact` since this will result in a massive 
SSTable that will almost never get compacted out or take a very long time to 
get compacted out.

You are correct that there needs to be 4 similar-sized SSTables for them to get 
compacted. If you want the expired data to be deleted quicker, try lowering the 
STCS `min_threshold` to 3 or even 2. Good luck!

Cheers,
Erick 


On Sat, Sep 26, 2015 at 4:40 AM, Dongfeng Lu <dlu66...@yahoo.com> wrote:
Hi I have a table where I set TTL to only 7 days for all records and we keep 
pumping records in every day. In general, I would expect all data files for 
that table to have timestamps less than, say 8 or 9 days old, giving the system 
some time to work its magic. However, I see some files more than 9 days old 
occationally. Last Friday, I saw 4 large files, each about 10G in size, with 
timestamps about 5, 4, 3, 2 weeks old. Interestingly they are all gone this 
Monday, leaving 1 new file 9 GB in size.

The compaction strategy is SizeTieredCompactionStrategy, and I can understand 
why the above happened. It seems we have 10G of data every week and when 
SizeTieredCompactionStrategy works to create various tiers, it just happened 
the file size for the next tier is 10G, and all the data is packed into this 
huge file. Then it starts the next cycle. Another week goes by, and another 10G 
file is created. This process continues until the minimum number of files of 
the same size is reached, which I think is 4 by default. Then it started to 
compact this set of 4 10G files. At this time, all data in these 4 files have 
expired so we end up with nothing or much smaller file if there is still some 
records with TTL left.

I have many tables like this, and I'd like to reclaim those spaces sooner. What 
would be the best way to do it? Should I run "nodetool compact" when I see two 
large files that are 2 weeks old? Is there configuration parameters I can tune 
to achieve the same effect? I looked through all the CQL Compaction 
Subproperties for STCS, but I am not sure how they can help here. Any 
suggestion is welcome.

BTW, I am using Cassandra 2.0.6.


Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to