> Personally, I'd like to see the fix for this issue come after CEP-21. It 
> could be feasible to implement a fix before then, that detects bit-errors on 
> the read path and refuses to respond to the coordinator, implicitly having 
> speculative execution handle the retry against another replica while repair 
> of that range happens. But that feels suboptimal to me when a better 
> framework is on the horizon.
I originally typed something in agreement with you but the more I think about 
this, the more a node-local "reject queries for specific token ranges" 
degradation profile seems like it _could_ work. I don't see an obvious way to 
remove the need for a human-in-the-loop on fixing things in a pre-CEP-21 world 
without opening pandora's box (Gossip + TMD + non-deterministic agreement on 
ownership state cluster-wide /cry).

And even in a post CEP-21 world you're definitely in the "at what point is it 
better to declare a host dead and replace it" fuzzy territory where there's no 
immediately correct answers.

A system_distributed table of corrupt token ranges that are currently being 
rejected by replicas with a mechanism to kick off a repair of those ranges 
could be interesting.

On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote:
> Thanks for proposing this discussion Bowen. I see a few different issues here:
> 
> 1. How do we safely handle corruption of a handful of tokens without taking 
> an entire instance offline for re-bootstrap? This includes refusal to serve 
> read requests for the corrupted token(s), and correct repair of the data.
> 2. How do we expose the corruption rate to operators, in a way that lets them 
> decide whether a full disk replacement is worthwhile?
> 3. When CEP-21 lands it should become feasible to support ownership draining, 
> which would let us migrate read traffic for a given token range away from an 
> instance where that range is corrupted. Is it worth planning a fix for this 
> issue before CEP-21 lands?
> 
> I'm also curious whether there's any existing literature on how different 
> filesystems and storage media accommodate bit-errors (correctable and 
> uncorrectable), so we can be consistent with those behaviors.
> 
> Personally, I'd like to see the fix for this issue come after CEP-21. It 
> could be feasible to implement a fix before then, that detects bit-errors on 
> the read path and refuses to respond to the coordinator, implicitly having 
> speculative execution handle the retry against another replica while repair 
> of that range happens. But that feels suboptimal to me when a better 
> framework is on the horizon.
> 
> --
> Abe
> 
>> On Mar 9, 2023, at 8:23 AM, Bowen Song via dev <dev@cassandra.apache.org> 
>> wrote:
>> 
>> Hi Jeremiah,
>> 
>> I'm fully aware of that, which is why I said that deleting the affected 
>> SSTable files is "less safe".
>> 
>> If the "bad blocks" logic is implemented and the node abort the current read 
>> query when hitting a bad block, it should remain safe, as the data in other 
>> SSTable files will not be used. The streamed data should contain the 
>> unexpired tombstones, and that's enough to keep the data consistent on the 
>> node.
>> 
>> 
>> Cheers,
>> Bowen
>> 
>> 
>> 
>> On 09/03/2023 15:58, Jeremiah D Jordan wrote:
>>> It is actually more complicated than just removing the sstable and running 
>>> repair.
>>> 
>>> In the face of expired tombstones that might be covering data in other 
>>> sstables the only safe way to deal with a bad sstable is wipe the token 
>>> range in the bad sstable and rebuild/bootstrap that range (or wipe/rebuild 
>>> the whole node which is usually the easier way).  If there are expired 
>>> tombstones in play, it means they could have already been compacted away on 
>>> the other replicas, but may not have compacted away on the current replica, 
>>> meaning the data they cover could still be present in other sstables on 
>>> this node.  Removing the sstable will mean resurrecting that data.  And 
>>> pulling the range from other nodes does not help because they can have 
>>> already compacted away the tombstone, so you won’t get it back.
>>> 
>>> Tl;DR you can’t just remove the one sstable you have to remove all data in 
>>> the token range covered by the sstable (aka all data that sstable may have 
>>> had a tombstone covering).  Then you can stream from the other nodes to get 
>>> the data back.
>>> 
>>> -Jeremiah
>>> 
>>>> On Mar 8, 2023, at 7:24 AM, Bowen Song via dev <dev@cassandra.apache.org> 
>>>> wrote:
>>>> 
>>>> At the moment, when a read error, such as unrecoverable bit error or data 
>>>> corruption, occurs in the SSTable data files, regardless of the 
>>>> disk_failure_policy configuration, manual (or to be precise, external) 
>>>> intervention is required to recover from the error.
>>>> 
>>>> Commonly, there's two approach to recover from such error:
>>>> 
>>>>  1. The safer, but slower recover strategy: replace the entire node.
>>>>  2. The less safe, but faster recover strategy: shut down the node, delete 
>>>> the affected SSTable file(s), and then bring the node back online and run 
>>>> repair.
>>>> Based on my understanding of Cassandra, it should be possible to recover 
>>>> from such error by marking the affected token range in the existing 
>>>> SSTable as "corrupted" and stop reading from them (e.g. creating a "bad 
>>>> block" file or in memory), and then streaming the affected token range 
>>>> from the healthy replicas. The corrupted SSTable file can then be removed 
>>>> upon the next successful compaction involving it, or alternatively an 
>>>> anti-compaction is performed on it to remove the corrupted data.
>>>> 
>>>> The advantage of this strategy is:
>>>> 
>>>>  • Reduced node down time - node restart or replacement is not needed
>>>>  • Less data streaming is required - only the affected token range
>>>>  • Faster recovery time - less streaming and delayed compaction or 
>>>> anti-compaction
>>>>  • No less safe than replacing the entire node
>>>>  • This process can be automated internally, removing the need for 
>>>> operator inputs
>>>> The disadvantage is added complexity on the SSTable read path and it may 
>>>> mask disk failures from the operator who is not paying attention to it.
>>>> 
>>>> What do you think about this?
>>>> 

Reply via email to