Re: Alternate "major compaction"

Takenori Sato Thu, 11 Jul 2013 21:44:56 -0700

Hi,

I made the repository public. Now you can checkout from here.


https://github.com/cloudian/support-tools

checksstablegarbage is the tool.

Enjoy, and any feedback is welcome.

Thanks,
- Takenori


On Thu, Jul 11, 2013 at 10:12 PM, srmore <comom...@gmail.com> wrote:

> Thanks Takenori,
> Looks like the tool provides some good info that people can use. It would
> be great if you can share it with the community.
>
>
>
>
> On Thu, Jul 11, 2013 at 6:51 AM, Takenori Sato <ts...@cloudian.com> wrote:
>
>> Hi,
>>
>> I think it is a common headache for users running a large Cassandra
>> cluster in production.
>>
>>
>> Running a major compaction is not the only cause, but more. For example,
>> I see two typical scenario.
>>
>> 1. backup use case
>> 2. active wide row
>>
>> In the case of 1, say, one data is removed a year later. This means,
>> tombstone on the row is 1 year away from the original row. To remove an
>> expired row entirely, a compaction set has to include all the rows. So,
>> when do the original, 1 year old row, and the tombstoned row are included
>> in a compaction set? It is likely to take one year.
>>
>> In the case of 2, such an active wide row exists in most of sstable
>> files. And it typically contains many expired columns. But none of them
>> wouldn't be removed entirely because a compaction set practically do not
>> include all the row fragments.
>>
>>
>> Btw, there is a very convenient MBean API is available. It is
>> CompactionManager's forceUserDefinedCompaction. You can invoke a minor
>> compaction on a file set you define. So the question is how to find an
>> optimal set of sstable files.
>>
>> Then, I wrote a tool to check garbage, and print outs some useful
>> information to find such an optimal set.
>>
>> Here's a simple log output.
>>
>> # /opt/cassandra/bin/checksstablegarbage -e 
>> /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db
>> [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 
>> 300(1373504071)]
>> ===================================================================================
>> ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, 
>> REMAINNING_SSTABLE_FILES
>> ===================================================================================
>> hello5/100.txt.1373502926003, 40, 40, YES, YES, Test5_BLOB-hc-3-Data.db
>> -----------------------------------------------------------------------------------
>> TOTAL, 40, 40
>> ===================================================================================
>>
>> REMAINNING_SSTABLE_FILES means any other sstable files that contain the
>> respective row. So, the following is an optimal set.
>>
>> # /opt/cassandra/bin/checksstablegarbage -e 
>> /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db 
>> /cassandra_data/UserData/Test5_BLOB-hc-3-Data.db
>> [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 
>> 300(1373504131)]
>> ===================================================================================
>> ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, 
>> REMAINNING_SSTABLE_FILES
>> ===================================================================================
>> hello5/100.txt.1373502926003, 223, 0, YES, YES
>> -----------------------------------------------------------------------------------
>> TOTAL, 223, 0
>> ===================================================================================
>>
>> This tool relies on SSTableReader and an aggregation iterator as
>> Cassandra does in compaction. I was considering to share this with the
>> community. So let me know if anyone is interested.
>>
>> Ah, note that it is based on 1.0.7. So I will need to check and update
>> for newer versions.
>>
>> Thanks,
>> Takenori
>>
>>
>> On Thu, Jul 11, 2013 at 6:46 PM, Tomàs Núnez 
>> <tomas.nu...@groupalia.com>wrote:
>>
>>> Hi
>>>
>>> About a year ago, we did a major compaction in our cassandra cluster (a
>>> n00b mistake, I know), and since then we've had huge sstables that never
>>> get compacted, and we were condemned to repeat the major compaction process
>>> every once in a while (we are using SizeTieredCompaction strategy, and
>>> we've not avaluated yet LeveledCompaction, because it has its downsides,
>>> and we've had no time to test all of them in our environment).
>>>
>>> I was trying to find a way to solve this situation (that is, do
>>> something like a major compaction that writes small sstables, not huge as
>>> major compaction does), and I couldn't find it in the documentation. I
>>> tried cleanup and scrub/upgradesstables, but they don't do that (as
>>> documentation states). Then I tried deleting all data in a node and then
>>> bootstrapping it (or "nodetool rebuild"-ing it), hoping that this way the
>>> sstables would get cleaned from deleted records and updates. But the
>>> deleted node just copied the sstables from another node as they were,
>>> cleaning nothing.
>>>
>>> So I tried a new approach: I switched the sstable compaction strategy
>>> (SizeTiered to Leveled), forcing the sstables to be rewritten from scratch,
>>> and then switching it back (Leveled to SizeTiered). It took a while (but so
>>> do the major compaction process) and it worked, I have smaller sstables,
>>> and I've regained a lot of disk space.
>>>
>>> I'm happy with the results, but it doesn't seem a orthodox way of
>>> "cleaning" the sstables. What do you think, is it something wrong or crazy?
>>> Is there a different way to achieve the same thing?
>>>
>>> Let's put an example:
>>> Suppose you have a write-only columnfamily (no updates and no deletes,
>>> so no need for LeveledCompaction, because SizeTiered works perfectly and
>>> requires less I/O) and you mistakenly run a major compaction on it. After a
>>> few months you need more space and you delete half the data, and you find
>>> out that you're not freeing half the disk space, because most of those
>>> records were in the "major compacted" sstables. How can you free the disk
>>> space? Waiting will do you no good, because the huge sstable won't get
>>> compacted anytime soon. You can run another major compaction, but that
>>> would just postpone the real problem. Then you can switch compaction
>>> strategy and switch it back, as I just did. Is there any other way?
>>>
>>> --
>>> [image: Groupalia] <http://es.groupalia.com/>
>>> www.groupalia.com <http://es.groupalia.com/> Tomàs Núñez IT-SysprodTel. +
>>> 34 93 159 31 00  Fax. + 34 93 396 18 52 Llull, 95-97, 2º planta, 08005
>>> BarcelonaSkype: tomas.nunez.groupalia 
>>> tomas.nu...@groupalia.com<nombre.apell...@groupalia.com> [image:
>>> Twitter] Twitter <http://twitter.com/#%21/groupaliaes>    [image:
>>> Twitter] Facebook <https://www.facebook.com/GroupaliaEspana>    [image:
>>> Twitter] Linkedin <http://www.linkedin.com/company/groupalia>
>>>
>>
>>
>

<<linkedin.png>>

<<facebook.png>>

<<groupalia.jpg>>

<<twitter.png>>

Re: Alternate "major compaction"

Reply via email to