That makes sense. I see however some unexpected performance data on my test, but I will start another thread for that.
Thanks again! On Fri, May 12, 2017 at 6:56 PM, Blake Eggleston <beggles...@apple.com> wrote: > The start and end points of a range tombstone are basically stored as > special purpose rows alongside the normal data in an sstable. As part of a > read, they're reconciled with the data from the other sstables into a > single partition, just like the other rows. The only difference is that > they don't contain any 'real' data, and, of course, they prevent 'deleted' > data from being returned in the read. It's a bit more complicated than > that, but that's the general idea. > > > On May 12, 2017 at 6:23:01 AM, Stefano Ortolani (ostef...@gmail.com) > wrote: > > Thanks a lot Blake, that definitely helps! > > I actually found a ticket re range tombstones and how they are accounted > for: https://issues.apache.org/jira/browse/CASSANDRA-8527 > > I am wondering now what happens when a node receives a read request. Are > the range tombstones read before scanning the SStables? More interestingly, > given that a single partition might be split across different levels, and > that some range tombstones might be in L0 while all the rest of the data in > L1, are all the tombstones prefetched from _all_ the involved SStables > before doing any table scan? > > Regards, > Stefano > > On Thu, May 11, 2017 at 7:58 PM, Blake Eggleston <beggles...@apple.com> > wrote: > >> Hi Stefano, >> >> Based on what I understood reading the docs, if the ratio of garbage >> collectable tomstones exceeds the "tombstone_threshold", C* should start >> compacting and evicting. >> >> >> If there are no other normal compaction tasks to be run, LCS will attempt >> to compact the sstables it estimates it will be able to drop the most >> tombstones from. It does this by estimating the number of tombstones an >> sstable has that have passed the gc grace period. Whether or not a >> tombstone will actually be evicted is more complicated. Even if a tombstone >> has passed gc grace, it can't be dropped if the data it's deleting still >> exists in another sstable, otherwise the data would appear to return. So, a >> tombstone won't be dropped if there is data for the same partition in other >> sstables that is older than the tombstone being evaluated for eviction. >> >> I am quite puzzled however by what might happen when dealing with range >> tombstones. In that case a single tombstone might actually stand for an >> arbitrary number of normal tombstones. In other words, do range >> tombstones >> contribute to the "tombstone_threshold"? If so, how? >> >> >> From what I can tell, each end of the range tombstone is counted as a >> single tombstone tombstone. So a range tombstone effectively contributes >> '2' to the count of tombstones for an sstable. I'm not 100% sure, but I >> haven't seen any sstable writing logic that tracks open tombstones and >> counts covered cells as tombstones. So, it's likely that the effect of >> range tombstones covering many rows are under represented in the droppable >> tombstone estimate. >> >> I am also a bit confused by the "tombstone_compaction_interval". If I am >> dealing with a big partition in LCS which is receiving new records every >> day, >> and a weekly incremental repair job continously anticompacting the data >> and >> thus creating SStables, what is the likelhood of the default interval >> (10 days) to be actually hit? >> >> >> It will be hit, but probably only in the repaired data. Once the data is >> marked repaired, it shouldn't be anticompacted again, and should get old >> enough to pass the compaction interval. That shouldn't be an issue though, >> because you should be running repair often enough that data is repaired >> before it can ever get past the gc grace period. Otherwise you'll have >> other problems. Also, keep in mind that tombstone eviction is a part of all >> compactions, it's just that occasionally a compaction is run specifically >> for that purpose. Finally, you probably shouldn't run incremental repair on >> data that is deleted. There is a design flaw in the incremental repair used >> in pre-4.0 of cassandra that can cause consistency issues. It can also >> cause a *lot* of over streaming, so you might want to take a look at how >> much streaming your cluster is doing with full repairs, and incremental >> repairs. It might actually be more efficient to run full repairs. >> >> Hope that helps, >> >> Blake >> >> On May 11, 2017 at 7:16:26 AM, Stefano Ortolani (ostef...@gmail.com) >> wrote: >> >> Hi all, >> >> I am trying to wrap my head around how C* evicts tombstones when using >> LCS. >> Based on what I understood reading the docs, if the ratio of garbage >> collectable tomstones exceeds the "tombstone_threshold", C* should start >> compacting and evicting. >> >> I am quite puzzled however by what might happen when dealing with range >> tombstones. In that case a single tombstone might actually stand for an >> arbitrary number of normal tombstones. In other words, do range >> tombstones >> contribute to the "tombstone_threshold"? If so, how? >> >> I am also a bit confused by the "tombstone_compaction_interval". If I am >> dealing with a big partition in LCS which is receiving new records every >> day, >> and a weekly incremental repair job continously anticompacting the data >> and >> thus creating SStables, what is the likelhood of the default interval >> (10 days) to be actually hit? >> >> Hopefully somebody will be able to shed some lights here! >> >> Thanks in advance! >> Stefano >> >> >