> Personally, I'd like to see the fix for this issue come after CEP-21. It > could be feasible to implement a fix before then, that detects bit-errors on > the read path and refuses to respond to the coordinator, implicitly having > speculative execution handle the retry against another replica while repair > of that range happens. But that feels suboptimal to me when a better > framework is on the horizon. I originally typed something in agreement with you but the more I think about this, the more a node-local "reject queries for specific token ranges" degradation profile seems like it _could_ work. I don't see an obvious way to remove the need for a human-in-the-loop on fixing things in a pre-CEP-21 world without opening pandora's box (Gossip + TMD + non-deterministic agreement on ownership state cluster-wide /cry).
And even in a post CEP-21 world you're definitely in the "at what point is it better to declare a host dead and replace it" fuzzy territory where there's no immediately correct answers. A system_distributed table of corrupt token ranges that are currently being rejected by replicas with a mechanism to kick off a repair of those ranges could be interesting. On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote: > Thanks for proposing this discussion Bowen. I see a few different issues here: > > 1. How do we safely handle corruption of a handful of tokens without taking > an entire instance offline for re-bootstrap? This includes refusal to serve > read requests for the corrupted token(s), and correct repair of the data. > 2. How do we expose the corruption rate to operators, in a way that lets them > decide whether a full disk replacement is worthwhile? > 3. When CEP-21 lands it should become feasible to support ownership draining, > which would let us migrate read traffic for a given token range away from an > instance where that range is corrupted. Is it worth planning a fix for this > issue before CEP-21 lands? > > I'm also curious whether there's any existing literature on how different > filesystems and storage media accommodate bit-errors (correctable and > uncorrectable), so we can be consistent with those behaviors. > > Personally, I'd like to see the fix for this issue come after CEP-21. It > could be feasible to implement a fix before then, that detects bit-errors on > the read path and refuses to respond to the coordinator, implicitly having > speculative execution handle the retry against another replica while repair > of that range happens. But that feels suboptimal to me when a better > framework is on the horizon. > > -- > Abe > >> On Mar 9, 2023, at 8:23 AM, Bowen Song via dev <dev@cassandra.apache.org> >> wrote: >> >> Hi Jeremiah, >> >> I'm fully aware of that, which is why I said that deleting the affected >> SSTable files is "less safe". >> >> If the "bad blocks" logic is implemented and the node abort the current read >> query when hitting a bad block, it should remain safe, as the data in other >> SSTable files will not be used. The streamed data should contain the >> unexpired tombstones, and that's enough to keep the data consistent on the >> node. >> >> >> Cheers, >> Bowen >> >> >> >> On 09/03/2023 15:58, Jeremiah D Jordan wrote: >>> It is actually more complicated than just removing the sstable and running >>> repair. >>> >>> In the face of expired tombstones that might be covering data in other >>> sstables the only safe way to deal with a bad sstable is wipe the token >>> range in the bad sstable and rebuild/bootstrap that range (or wipe/rebuild >>> the whole node which is usually the easier way). If there are expired >>> tombstones in play, it means they could have already been compacted away on >>> the other replicas, but may not have compacted away on the current replica, >>> meaning the data they cover could still be present in other sstables on >>> this node. Removing the sstable will mean resurrecting that data. And >>> pulling the range from other nodes does not help because they can have >>> already compacted away the tombstone, so you won’t get it back. >>> >>> Tl;DR you can’t just remove the one sstable you have to remove all data in >>> the token range covered by the sstable (aka all data that sstable may have >>> had a tombstone covering). Then you can stream from the other nodes to get >>> the data back. >>> >>> -Jeremiah >>> >>>> On Mar 8, 2023, at 7:24 AM, Bowen Song via dev <dev@cassandra.apache.org> >>>> wrote: >>>> >>>> At the moment, when a read error, such as unrecoverable bit error or data >>>> corruption, occurs in the SSTable data files, regardless of the >>>> disk_failure_policy configuration, manual (or to be precise, external) >>>> intervention is required to recover from the error. >>>> >>>> Commonly, there's two approach to recover from such error: >>>> >>>> 1. The safer, but slower recover strategy: replace the entire node. >>>> 2. The less safe, but faster recover strategy: shut down the node, delete >>>> the affected SSTable file(s), and then bring the node back online and run >>>> repair. >>>> Based on my understanding of Cassandra, it should be possible to recover >>>> from such error by marking the affected token range in the existing >>>> SSTable as "corrupted" and stop reading from them (e.g. creating a "bad >>>> block" file or in memory), and then streaming the affected token range >>>> from the healthy replicas. The corrupted SSTable file can then be removed >>>> upon the next successful compaction involving it, or alternatively an >>>> anti-compaction is performed on it to remove the corrupted data. >>>> >>>> The advantage of this strategy is: >>>> >>>> • Reduced node down time - node restart or replacement is not needed >>>> • Less data streaming is required - only the affected token range >>>> • Faster recovery time - less streaming and delayed compaction or >>>> anti-compaction >>>> • No less safe than replacing the entire node >>>> • This process can be automated internally, removing the need for >>>> operator inputs >>>> The disadvantage is added complexity on the SSTable read path and it may >>>> mask disk failures from the operator who is not paying attention to it. >>>> >>>> What do you think about this? >>>>