Have you taken a look at the new stuff introduced by https://issues.apache.org/jira/browse/CASSANDRA-7019 ? I think it may go a ways to reducing the need for something complicated like this. Though it is an interesting idea as special handling for bulk deletes. If they were truly just sstables that only contained deletes the logic from 7109 would probably go a long ways. Though if you are bulk inserting deletes that is what you would end up with, so maybe it already works.
-Jeremiah > On Feb 13, 2018, at 6:04 PM, Jeff Jirsa <jji...@gmail.com> wrote: > > On Tue, Feb 13, 2018 at 2:38 PM, Carl Mueller <carl.muel...@smartthings.com> > wrote: > >> In process of doing my second major data purge from a cassandra system. >> >> Almost all of my purging is done via row tombstones. While performing this >> the second time while trying to cajole compaction to occur (in 2.1.x, >> LevelledCompaction) to goddamn actually compact the data, I've been >> thinking as to why there isn't a separate set of sstable infrastructure >> setup for row deletion tombstones. >> >> I'm imagining that row tombstones are written to separate sstables than >> mainline data updates/appends and range/column tombstones. >> >> By writing them to separate sstables, the compaction systems can >> preferentially merge / process them when compacting sstables. >> >> This would create an additional sstable for lookup in the bloom filters, >> granted. I had visions of short circuiting the lookups to other sstables if >> a row tombstone was present in one of the special row tombstone sstables. >> >> > All of the above sounds really interesting to me, but I suspect it's a LOT > of work to make it happen correctly. > > You'd almost end up with 2 sets of logs for the LSM - a tombstone > log/generation, and a data log/generation, and the tombstone logs would be > read-only inputs to data compactions. > > >> But that would only be possible if there was the notion of a "super row >> tombstone" that permanently deleted a rowkey and all future writes would be >> invalidated. Kind of like how a tombstone with a mistakenly huge timestamp >> becomes a sneaky permanent tombstone, but intended. There could be a >> special operation / statement to undo this permanent tombstone, and since >> the row tombstones would be in their own dedicated sstables, they could >> process and compact more quickly, with prioritization by the compactor. >> >> > This part sounds way less interesting to me (other than the fact you can > already do this with a timestamp in the future, but it'll gc away at gcgs). > > >> I'm thinking there must be something I am forgetting in the >> read/write/compaction paths that invalidate this. >> > > There are a lot of places where we do "smart" things to make sure we don't > accidentally resurrect data. Read path includes old sstables for tombstones > for example. Those all need to be concretely identified and handled (and > tested),.