I'm not seeing any reasons why CEP-21 would make this more difficult to implement, besides the fact that it hasn't landed yet.
There are two major potential pitfalls that CEP-21 would help us avoid: 1. Bit-errors beget further bit-errors, so we ought to be resistant to a high frequency of corruption events 2. Avoid token ownership changes when attempting to stream a corrupted token I found some data supporting (1) - https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140806_T1_Hetzler.pdf If we detect bit-errors and store them in system_distributed, then we need a capacity to throttle that load and ensure that consistency is maintained. When we attempt to rectify any bit-error by streaming data from peers, we implicitly take a lock on token ownership. A user needs to know that it is unsafe to change token ownership in a cluster that is currently in the process of repairing a corruption error on one of its instances' disks. CEP-21 makes this sequencing safe, and provides abstractions to better expose this information to operators. -- Abe > On Mar 9, 2023, at 10:55 AM, Josh McKenzie <jmcken...@apache.org> wrote: > >> Personally, I'd like to see the fix for this issue come after CEP-21. It >> could be feasible to implement a fix before then, that detects bit-errors on >> the read path and refuses to respond to the coordinator, implicitly having >> speculative execution handle the retry against another replica while repair >> of that range happens. But that feels suboptimal to me when a better >> framework is on the horizon. > I originally typed something in agreement with you but the more I think about > this, the more a node-local "reject queries for specific token ranges" > degradation profile seems like it _could_ work. I don't see an obvious way to > remove the need for a human-in-the-loop on fixing things in a pre-CEP-21 > world without opening pandora's box (Gossip + TMD + non-deterministic > agreement on ownership state cluster-wide /cry). > > And even in a post CEP-21 world you're definitely in the "at what point is it > better to declare a host dead and replace it" fuzzy territory where there's > no immediately correct answers. > > A system_distributed table of corrupt token ranges that are currently being > rejected by replicas with a mechanism to kick off a repair of those ranges > could be interesting. > > On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote: >> Thanks for proposing this discussion Bowen. I see a few different issues >> here: >> >> 1. How do we safely handle corruption of a handful of tokens without taking >> an entire instance offline for re-bootstrap? This includes refusal to serve >> read requests for the corrupted token(s), and correct repair of the data. >> 2. How do we expose the corruption rate to operators, in a way that lets >> them decide whether a full disk replacement is worthwhile? >> 3. When CEP-21 lands it should become feasible to support ownership >> draining, which would let us migrate read traffic for a given token range >> away from an instance where that range is corrupted. Is it worth planning a >> fix for this issue before CEP-21 lands? >> >> I'm also curious whether there's any existing literature on how different >> filesystems and storage media accommodate bit-errors (correctable and >> uncorrectable), so we can be consistent with those behaviors. >> >> Personally, I'd like to see the fix for this issue come after CEP-21. It >> could be feasible to implement a fix before then, that detects bit-errors on >> the read path and refuses to respond to the coordinator, implicitly having >> speculative execution handle the retry against another replica while repair >> of that range happens. But that feels suboptimal to me when a better >> framework is on the horizon. >> >> -- >> Abe >> >>> On Mar 9, 2023, at 8:23 AM, Bowen Song via dev <dev@cassandra.apache.org> >>> wrote: >>> >>> Hi Jeremiah, >>> >>> I'm fully aware of that, which is why I said that deleting the affected >>> SSTable files is "less safe". >>> >>> If the "bad blocks" logic is implemented and the node abort the current >>> read query when hitting a bad block, it should remain safe, as the data in >>> other SSTable files will not be used. The streamed data should contain the >>> unexpired tombstones, and that's enough to keep the data consistent on the >>> node. >>> >>> >>> Cheers, >>> Bowen >>> >>> >>> >>> On 09/03/2023 15:58, Jeremiah D Jordan wrote: >>>> It is actually more complicated than just removing the sstable and running >>>> repair. >>>> >>>> In the face of expired tombstones that might be covering data in other >>>> sstables the only safe way to deal with a bad sstable is wipe the token >>>> range in the bad sstable and rebuild/bootstrap that range (or wipe/rebuild >>>> the whole node which is usually the easier way). If there are expired >>>> tombstones in play, it means they could have already been compacted away >>>> on the other replicas, but may not have compacted away on the current >>>> replica, meaning the data they cover could still be present in other >>>> sstables on this node. Removing the sstable will mean resurrecting that >>>> data. And pulling the range from other nodes does not help because they >>>> can have already compacted away the tombstone, so you won’t get it back. >>>> >>>> Tl;DR you can’t just remove the one sstable you have to remove all data in >>>> the token range covered by the sstable (aka all data that sstable may have >>>> had a tombstone covering). Then you can stream from the other nodes to >>>> get the data back. >>>> >>>> -Jeremiah >>>> >>>>> On Mar 8, 2023, at 7:24 AM, Bowen Song via dev <dev@cassandra.apache.org> >>>>> <mailto:dev@cassandra.apache.org> wrote: >>>>> >>>>> At the moment, when a read error, such as unrecoverable bit error or data >>>>> corruption, occurs in the SSTable data files, regardless of the >>>>> disk_failure_policy configuration, manual (or to be precise, external) >>>>> intervention is required to recover from the error. >>>>> >>>>> Commonly, there's two approach to recover from such error: >>>>> >>>>> The safer, but slower recover strategy: replace the entire node. >>>>> The less safe, but faster recover strategy: shut down the node, delete >>>>> the affected SSTable file(s), and then bring the node back online and run >>>>> repair. >>>>> Based on my understanding of Cassandra, it should be possible to recover >>>>> from such error by marking the affected token range in the existing >>>>> SSTable as "corrupted" and stop reading from them (e.g. creating a "bad >>>>> block" file or in memory), and then streaming the affected token range >>>>> from the healthy replicas. The corrupted SSTable file can then be removed >>>>> upon the next successful compaction involving it, or alternatively an >>>>> anti-compaction is performed on it to remove the corrupted data. >>>>> >>>>> The advantage of this strategy is: >>>>> >>>>> Reduced node down time - node restart or replacement is not needed >>>>> Less data streaming is required - only the affected token range >>>>> Faster recovery time - less streaming and delayed compaction or >>>>> anti-compaction >>>>> No less safe than replacing the entire node >>>>> This process can be automated internally, removing the need for operator >>>>> inputs >>>>> The disadvantage is added complexity on the SSTable read path and it may >>>>> mask disk failures from the operator who is not paying attention to it. >>>>> >>>>> What do you think about this? >>>>>