Hi, I made the repository public. Now you can checkout from here.
https://github.com/cloudian/support-tools checksstablegarbage is the tool. Enjoy, and any feedback is welcome. Thanks, - Takenori On Thu, Jul 11, 2013 at 10:12 PM, srmore <comom...@gmail.com> wrote: > Thanks Takenori, > Looks like the tool provides some good info that people can use. It would > be great if you can share it with the community. > > > > > On Thu, Jul 11, 2013 at 6:51 AM, Takenori Sato <ts...@cloudian.com> wrote: > >> Hi, >> >> I think it is a common headache for users running a large Cassandra >> cluster in production. >> >> >> Running a major compaction is not the only cause, but more. For example, >> I see two typical scenario. >> >> 1. backup use case >> 2. active wide row >> >> In the case of 1, say, one data is removed a year later. This means, >> tombstone on the row is 1 year away from the original row. To remove an >> expired row entirely, a compaction set has to include all the rows. So, >> when do the original, 1 year old row, and the tombstoned row are included >> in a compaction set? It is likely to take one year. >> >> In the case of 2, such an active wide row exists in most of sstable >> files. And it typically contains many expired columns. But none of them >> wouldn't be removed entirely because a compaction set practically do not >> include all the row fragments. >> >> >> Btw, there is a very convenient MBean API is available. It is >> CompactionManager's forceUserDefinedCompaction. You can invoke a minor >> compaction on a file set you define. So the question is how to find an >> optimal set of sstable files. >> >> Then, I wrote a tool to check garbage, and print outs some useful >> information to find such an optimal set. >> >> Here's a simple log output. >> >> # /opt/cassandra/bin/checksstablegarbage -e >> /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db >> [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, >> 300(1373504071)] >> =================================================================================== >> ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, >> REMAINNING_SSTABLE_FILES >> =================================================================================== >> hello5/100.txt.1373502926003, 40, 40, YES, YES, Test5_BLOB-hc-3-Data.db >> ----------------------------------------------------------------------------------- >> TOTAL, 40, 40 >> =================================================================================== >> >> REMAINNING_SSTABLE_FILES means any other sstable files that contain the >> respective row. So, the following is an optimal set. >> >> # /opt/cassandra/bin/checksstablegarbage -e >> /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db >> /cassandra_data/UserData/Test5_BLOB-hc-3-Data.db >> [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, >> 300(1373504131)] >> =================================================================================== >> ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, >> REMAINNING_SSTABLE_FILES >> =================================================================================== >> hello5/100.txt.1373502926003, 223, 0, YES, YES >> ----------------------------------------------------------------------------------- >> TOTAL, 223, 0 >> =================================================================================== >> >> This tool relies on SSTableReader and an aggregation iterator as >> Cassandra does in compaction. I was considering to share this with the >> community. So let me know if anyone is interested. >> >> Ah, note that it is based on 1.0.7. So I will need to check and update >> for newer versions. >> >> Thanks, >> Takenori >> >> >> On Thu, Jul 11, 2013 at 6:46 PM, Tomàs Núnez >> <tomas.nu...@groupalia.com>wrote: >> >>> Hi >>> >>> About a year ago, we did a major compaction in our cassandra cluster (a >>> n00b mistake, I know), and since then we've had huge sstables that never >>> get compacted, and we were condemned to repeat the major compaction process >>> every once in a while (we are using SizeTieredCompaction strategy, and >>> we've not avaluated yet LeveledCompaction, because it has its downsides, >>> and we've had no time to test all of them in our environment). >>> >>> I was trying to find a way to solve this situation (that is, do >>> something like a major compaction that writes small sstables, not huge as >>> major compaction does), and I couldn't find it in the documentation. I >>> tried cleanup and scrub/upgradesstables, but they don't do that (as >>> documentation states). Then I tried deleting all data in a node and then >>> bootstrapping it (or "nodetool rebuild"-ing it), hoping that this way the >>> sstables would get cleaned from deleted records and updates. But the >>> deleted node just copied the sstables from another node as they were, >>> cleaning nothing. >>> >>> So I tried a new approach: I switched the sstable compaction strategy >>> (SizeTiered to Leveled), forcing the sstables to be rewritten from scratch, >>> and then switching it back (Leveled to SizeTiered). It took a while (but so >>> do the major compaction process) and it worked, I have smaller sstables, >>> and I've regained a lot of disk space. >>> >>> I'm happy with the results, but it doesn't seem a orthodox way of >>> "cleaning" the sstables. What do you think, is it something wrong or crazy? >>> Is there a different way to achieve the same thing? >>> >>> Let's put an example: >>> Suppose you have a write-only columnfamily (no updates and no deletes, >>> so no need for LeveledCompaction, because SizeTiered works perfectly and >>> requires less I/O) and you mistakenly run a major compaction on it. After a >>> few months you need more space and you delete half the data, and you find >>> out that you're not freeing half the disk space, because most of those >>> records were in the "major compacted" sstables. How can you free the disk >>> space? Waiting will do you no good, because the huge sstable won't get >>> compacted anytime soon. You can run another major compaction, but that >>> would just postpone the real problem. Then you can switch compaction >>> strategy and switch it back, as I just did. Is there any other way? >>> >>> -- >>> [image: Groupalia] <http://es.groupalia.com/> >>> www.groupalia.com <http://es.groupalia.com/> Tomàs Núñez IT-SysprodTel. + >>> 34 93 159 31 00 Fax. + 34 93 396 18 52 Llull, 95-97, 2º planta, 08005 >>> BarcelonaSkype: tomas.nunez.groupalia >>> tomas.nu...@groupalia.com<nombre.apell...@groupalia.com> [image: >>> Twitter] Twitter <http://twitter.com/#%21/groupaliaes> [image: >>> Twitter] Facebook <https://www.facebook.com/GroupaliaEspana> [image: >>> Twitter] Linkedin <http://www.linkedin.com/company/groupalia> >>> >> >> >
<<linkedin.png>>
<<facebook.png>>
<<groupalia.jpg>>
<<twitter.png>>