Hi Sylvain, might I ask why repair cannot simply ignore anything that is older than gc-grace? (like Aaron proposed) I agree that repair should not process any tombstones or anything. But in my mind it sounds reasonable to make repair ignore timed-out data. Because the timestamp is created on the client, there is no reason to repair these, right?
We are using TTLs quite heavily and I was noticing that every repair increases the load of all nodes by 1-2 GBs, where each node has about 20-30GB of data. I dont know if this increases with the data-volume. The data is mostly time-series data. I even noticed an increase when running two repairs directly after each other. So even when data was just repaired, there is still data being transferred. I assume this is due some columns timing out within that timeframe and the entire row being repaired. regards, Christian On Thu, Nov 1, 2012 at 9:43 AM, Sylvain Lebresne <sylv...@datastax.com>wrote: > > Is this a feature or a bug? > > Neither really. Repair doesn't do any gcable tombstone collection and > it would be really hard to change that (besides, it's not his job). So > if you when you run repair there is sstable with tombstone that could > be collected but are not yet, then yes, they will be streamed. Now the > theory is that compaction will run often enough that gcable tombstone > will be collected in a reasonably timely fashion and so you will never > have lots of such tombstones in general (making the fact that repair > stream them largely irrelevant). That being said, in practice, I don't > doubt that there is a few scenario like your own where this still can > lead to doing too much useless work. > > I believe the main problem is that size tiered compaction has a > tendency to not compact the largest sstables very often. Meaning that > you could have large sstable with mostly gcable tombstone sitting > around. In the upcoming Cassandra 1.2, > https://issues.apache.org/jira/browse/CASSANDRA-3442 will fix that. > Until then, if you are no afraid of a little bit of scripting, one > option could be before running a repair to run a small script that > would check the creation time of your sstable. If an sstable is old > enough (for some value of that that depends on what is the TTL you use > on all your columns), you may want to force a compaction (using the > JMX call forceUserDefinedCompaction()) of that sstable. The goal being > to get read of a maximum of outdated tombstones before running the > repair (you could also alternatively run a major compaction prior to > the repair, but major compactions have a lot of nasty effect so I > wouldn't recommend that a priori). > > -- > Sylvain >