On Sat, Feb 5, 2011 at 11:59 AM, buddhasystem <potek...@bnl.gov> wrote: > > Just wanted to see if someone with experience in running an actual service > can advise me: > > how often do you run nodetool compact on your nodes? Do you stagger it in > time, for each node? How badly is performance affected? > > I know this all seems too generic but then again no two clusters are created > equal anyhow. Just wanted to get a feel. > > Thanks, > Maxim > > -- > View this message in context: > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-bad-is-teh-impact-of-compaction-on-performance-tp5995868p5995868.html > Sent from the cassandra-u...@incubator.apache.org mailing list archive at > Nabble.com. >
This is an interesting topic. Cassandra can now remove tombstones on non-major compaction. For some use cases you may not have to trigger nodetool compact yourself to remove tombstones. Use cases that do not to many updates, deletes may have the least need to run compaction yourself. !However! If you have smaller SSTables, or less SSTables your read operations will be more efficient. if you have downtime such as from 1AM-6AM. Going through a major compaction might shrink you dataset significantly and that will make reads better. Compaction can be more or less intensive. The largest factor is is row size. Users with large rows probably see faster compaction while smaller rows see it take a long time. You can lower the priority of the compaction thread for experimentation. As to the performance you want to get your cluster to the state where it is not compacting often. This may mean you need more nodes to handle writes. I graph the compaction information from JMX http://www.jointhegrid.com/cassandra/cassandra-cacti-m6.jsp to get a feel for how often a node is compacting on average. Also I cross reference the compaction with Read latency and IO graphs I have to see what impact compaction has on reads. Forcing a major compaction also lowers the chances a compaction will happen during the day on peak time. I major compact a few cluster nodes each night through cron (gc time 3 days). This has been good for keeping our data on disk as small as possible. Forcing the major compact at night uses IO, but i find it saves IO over the course of the day because each read seeks less on disk.