Re: [DISCUSS] Enhanced Disk Error Handling

Abe Ratnofsky Thu, 09 Mar 2023 11:56:45 -0800

I'm not seeing any reasons why CEP-21 would make this more difficult to 
implement, besides the fact that it hasn't landed yet.


There are two major potential pitfalls that CEP-21 would help us avoid:
1. Bit-errors beget further bit-errors, so we ought to be resistant to a high 
frequency of corruption events
2. Avoid token ownership changes when attempting to stream a corrupted token

I found some data supporting (1) - 
https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140806_T1_Hetzler.pdf

If we detect bit-errors and store them in system_distributed, then we need a 
capacity to throttle that load and ensure that consistency is maintained.

When we attempt to rectify any bit-error by streaming data from peers, we 
implicitly take a lock on token ownership. A user needs to know that it is 
unsafe to change token ownership in a cluster that is currently in the process 
of repairing a corruption error on one of its instances' disks. CEP-21 makes 
this sequencing safe, and provides abstractions to better expose this 
information to operators.

--
Abe

> On Mar 9, 2023, at 10:55 AM, Josh McKenzie <jmcken...@apache.org> wrote:
> 
>> Personally, I'd like to see the fix for this issue come after CEP-21. It 
>> could be feasible to implement a fix before then, that detects bit-errors on 
>> the read path and refuses to respond to the coordinator, implicitly having 
>> speculative execution handle the retry against another replica while repair 
>> of that range happens. But that feels suboptimal to me when a better 
>> framework is on the horizon.
> I originally typed something in agreement with you but the more I think about 
> this, the more a node-local "reject queries for specific token ranges" 
> degradation profile seems like it _could_ work. I don't see an obvious way to 
> remove the need for a human-in-the-loop on fixing things in a pre-CEP-21 
> world without opening pandora's box (Gossip + TMD + non-deterministic 
> agreement on ownership state cluster-wide /cry).
> 
> And even in a post CEP-21 world you're definitely in the "at what point is it 
> better to declare a host dead and replace it" fuzzy territory where there's 
> no immediately correct answers.
> 
> A system_distributed table of corrupt token ranges that are currently being 
> rejected by replicas with a mechanism to kick off a repair of those ranges 
> could be interesting.
> 
> On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote:
>> Thanks for proposing this discussion Bowen. I see a few different issues 
>> here:
>> 
>> 1. How do we safely handle corruption of a handful of tokens without taking 
>> an entire instance offline for re-bootstrap? This includes refusal to serve 
>> read requests for the corrupted token(s), and correct repair of the data.
>> 2. How do we expose the corruption rate to operators, in a way that lets 
>> them decide whether a full disk replacement is worthwhile?
>> 3. When CEP-21 lands it should become feasible to support ownership 
>> draining, which would let us migrate read traffic for a given token range 
>> away from an instance where that range is corrupted. Is it worth planning a 
>> fix for this issue before CEP-21 lands?
>> 
>> I'm also curious whether there's any existing literature on how different 
>> filesystems and storage media accommodate bit-errors (correctable and 
>> uncorrectable), so we can be consistent with those behaviors.
>> 
>> Personally, I'd like to see the fix for this issue come after CEP-21. It 
>> could be feasible to implement a fix before then, that detects bit-errors on 
>> the read path and refuses to respond to the coordinator, implicitly having 
>> speculative execution handle the retry against another replica while repair 
>> of that range happens. But that feels suboptimal to me when a better 
>> framework is on the horizon.
>> 
>> --
>> Abe
>> 
>>> On Mar 9, 2023, at 8:23 AM, Bowen Song via dev <dev@cassandra.apache.org> 
>>> wrote:
>>> 
>>> Hi Jeremiah,
>>> 
>>> I'm fully aware of that, which is why I said that deleting the affected 
>>> SSTable files is "less safe".
>>> 
>>> If the "bad blocks" logic is implemented and the node abort the current 
>>> read query when hitting a bad block, it should remain safe, as the data in 
>>> other SSTable files will not be used. The streamed data should contain the 
>>> unexpired tombstones, and that's enough to keep the data consistent on the 
>>> node.
>>> 
>>> 
>>> Cheers,
>>> Bowen
>>> 
>>> 
>>> 
>>> On 09/03/2023 15:58, Jeremiah D Jordan wrote:
>>>> It is actually more complicated than just removing the sstable and running 
>>>> repair.
>>>> 
>>>> In the face of expired tombstones that might be covering data in other 
>>>> sstables the only safe way to deal with a bad sstable is wipe the token 
>>>> range in the bad sstable and rebuild/bootstrap that range (or wipe/rebuild 
>>>> the whole node which is usually the easier way).  If there are expired 
>>>> tombstones in play, it means they could have already been compacted away 
>>>> on the other replicas, but may not have compacted away on the current 
>>>> replica, meaning the data they cover could still be present in other 
>>>> sstables on this node.  Removing the sstable will mean resurrecting that 
>>>> data.  And pulling the range from other nodes does not help because they 
>>>> can have already compacted away the tombstone, so you won’t get it back.
>>>> 
>>>> Tl;DR you can’t just remove the one sstable you have to remove all data in 
>>>> the token range covered by the sstable (aka all data that sstable may have 
>>>> had a tombstone covering).  Then you can stream from the other nodes to 
>>>> get the data back.
>>>> 
>>>> -Jeremiah
>>>> 
>>>>> On Mar 8, 2023, at 7:24 AM, Bowen Song via dev <dev@cassandra.apache.org> 
>>>>> <mailto:dev@cassandra.apache.org> wrote:
>>>>> 
>>>>> At the moment, when a read error, such as unrecoverable bit error or data 
>>>>> corruption, occurs in the SSTable data files, regardless of the 
>>>>> disk_failure_policy configuration, manual (or to be precise, external) 
>>>>> intervention is required to recover from the error.
>>>>> 
>>>>> Commonly, there's two approach to recover from such error:
>>>>> 
>>>>> The safer, but slower recover strategy: replace the entire node.
>>>>> The less safe, but faster recover strategy: shut down the node, delete 
>>>>> the affected SSTable file(s), and then bring the node back online and run 
>>>>> repair.
>>>>> Based on my understanding of Cassandra, it should be possible to recover 
>>>>> from such error by marking the affected token range in the existing 
>>>>> SSTable as "corrupted" and stop reading from them (e.g. creating a "bad 
>>>>> block" file or in memory), and then streaming the affected token range 
>>>>> from the healthy replicas. The corrupted SSTable file can then be removed 
>>>>> upon the next successful compaction involving it, or alternatively an 
>>>>> anti-compaction is performed on it to remove the corrupted data.
>>>>> 
>>>>> The advantage of this strategy is:
>>>>> 
>>>>> Reduced node down time - node restart or replacement is not needed
>>>>> Less data streaming is required - only the affected token range
>>>>> Faster recovery time - less streaming and delayed compaction or 
>>>>> anti-compaction
>>>>> No less safe than replacing the entire node
>>>>> This process can be automated internally, removing the need for operator 
>>>>> inputs
>>>>> The disadvantage is added complexity on the SSTable read path and it may 
>>>>> mask disk failures from the operator who is not paying attention to it.
>>>>> 
>>>>> What do you think about this?
>>>>>

Re: [DISCUSS] Enhanced Disk Error Handling

Reply via email to