Re: [DISCUSS] Enhanced Disk Error Handling

Josh McKenzie Thu, 09 Mar 2023 12:46:01 -0800

> I'm not seeing any reasons why CEP-21 would make this more difficult to 
> implement
I think I communicated poorly - I was just trying to point out that there's a 
point at which a host limping along is better put down and replaced than 
piecemeal flagging range after range dead and working around it, and there's no 
immediately obvious "Correct" answer to where that point is regardless of what 
mechanism we're using to hold a cluster-wide view of topology.


> ...CEP-21 makes this sequencing safe...
For sure - I wouldn't advocate for any kind of "automated corrupt data repair" 
in a pre-CEP-21 world.

On Thu, Mar 9, 2023, at 2:56 PM, Abe Ratnofsky wrote:
> I'm not seeing any reasons why CEP-21 would make this more difficult to 
> implement, besides the fact that it hasn't landed yet.
> 
> There are two major potential pitfalls that CEP-21 would help us avoid:
> 1. Bit-errors beget further bit-errors, so we ought to be resistant to a high 
> frequency of corruption events
> 2. Avoid token ownership changes when attempting to stream a corrupted token
> 
> I found some data supporting (1) - 
> https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140806_T1_Hetzler.pdf
> 
> If we detect bit-errors and store them in system_distributed, then we need a 
> capacity to throttle that load and ensure that consistency is maintained.
> 
> When we attempt to rectify any bit-error by streaming data from peers, we 
> implicitly take a lock on token ownership. A user needs to know that it is 
> unsafe to change token ownership in a cluster that is currently in the 
> process of repairing a corruption error on one of its instances' disks. 
> CEP-21 makes this sequencing safe, and provides abstractions to better expose 
> this information to operators.
> 
> --
> Abe
> 
>> On Mar 9, 2023, at 10:55 AM, Josh McKenzie <[email protected]> wrote:
>> 
>>> Personally, I'd like to see the fix for this issue come after CEP-21. It 
>>> could be feasible to implement a fix before then, that detects bit-errors 
>>> on the read path and refuses to respond to the coordinator, implicitly 
>>> having speculative execution handle the retry against another replica while 
>>> repair of that range happens. But that feels suboptimal to me when a better 
>>> framework is on the horizon.
>> I originally typed something in agreement with you but the more I think 
>> about this, the more a node-local "reject queries for specific token ranges" 
>> degradation profile seems like it _could_ work. I don't see an obvious way 
>> to remove the need for a human-in-the-loop on fixing things in a pre-CEP-21 
>> world without opening pandora's box (Gossip + TMD + non-deterministic 
>> agreement on ownership state cluster-wide /cry).
>> 
>> And even in a post CEP-21 world you're definitely in the "at what point is 
>> it better to declare a host dead and replace it" fuzzy territory where 
>> there's no immediately correct answers.
>> 
>> A system_distributed table of corrupt token ranges that are currently being 
>> rejected by replicas with a mechanism to kick off a repair of those ranges 
>> could be interesting.
>> 
>> On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote:
>>> Thanks for proposing this discussion Bowen. I see a few different issues 
>>> here:
>>> 
>>> 1. How do we safely handle corruption of a handful of tokens without taking 
>>> an entire instance offline for re-bootstrap? This includes refusal to serve 
>>> read requests for the corrupted token(s), and correct repair of the data.
>>> 2. How do we expose the corruption rate to operators, in a way that lets 
>>> them decide whether a full disk replacement is worthwhile?
>>> 3. When CEP-21 lands it should become feasible to support ownership 
>>> draining, which would let us migrate read traffic for a given token range 
>>> away from an instance where that range is corrupted. Is it worth planning a 
>>> fix for this issue before CEP-21 lands?
>>> 
>>> I'm also curious whether there's any existing literature on how different 
>>> filesystems and storage media accommodate bit-errors (correctable and 
>>> uncorrectable), so we can be consistent with those behaviors.
>>> 
>>> Personally, I'd like to see the fix for this issue come after CEP-21. It 
>>> could be feasible to implement a fix before then, that detects bit-errors 
>>> on the read path and refuses to respond to the coordinator, implicitly 
>>> having speculative execution handle the retry against another replica while 
>>> repair of that range happens. But that feels suboptimal to me when a better 
>>> framework is on the horizon.
>>> 
>>> --
>>> Abe
>>> 
>>>> On Mar 9, 2023, at 8:23 AM, Bowen Song via dev <[email protected]> 
>>>> wrote:
>>>> 
>>>> Hi Jeremiah,
>>>> 
>>>> I'm fully aware of that, which is why I said that deleting the affected 
>>>> SSTable files is "less safe".
>>>> 
>>>> If the "bad blocks" logic is implemented and the node abort the current 
>>>> read query when hitting a bad block, it should remain safe, as the data in 
>>>> other SSTable files will not be used. The streamed data should contain the 
>>>> unexpired tombstones, and that's enough to keep the data consistent on the 
>>>> node.
>>>> 
>>>> 
>>>> Cheers,
>>>> Bowen
>>>> 
>>>> 
>>>> 
>>>> On 09/03/2023 15:58, Jeremiah D Jordan wrote:
>>>>> It is actually more complicated than just removing the sstable and 
>>>>> running repair.
>>>>> 
>>>>> In the face of expired tombstones that might be covering data in other 
>>>>> sstables the only safe way to deal with a bad sstable is wipe the token 
>>>>> range in the bad sstable and rebuild/bootstrap that range (or 
>>>>> wipe/rebuild the whole node which is usually the easier way).  If there 
>>>>> are expired tombstones in play, it means they could have already been 
>>>>> compacted away on the other replicas, but may not have compacted away on 
>>>>> the current replica, meaning the data they cover could still be present 
>>>>> in other sstables on this node.  Removing the sstable will mean 
>>>>> resurrecting that data.  And pulling the range from other nodes does not 
>>>>> help because they can have already compacted away the tombstone, so you 
>>>>> won’t get it back.
>>>>> 
>>>>> Tl;DR you can’t just remove the one sstable you have to remove all data 
>>>>> in the token range covered by the sstable (aka all data that sstable may 
>>>>> have had a tombstone covering).  Then you can stream from the other nodes 
>>>>> to get the data back.
>>>>> 
>>>>> -Jeremiah
>>>>> 
>>>>>> On Mar 8, 2023, at 7:24 AM, Bowen Song via dev 
>>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>> At the moment, when a read error, such as unrecoverable bit error or 
>>>>>> data corruption, occurs in the SSTable data files, regardless of the 
>>>>>> disk_failure_policy configuration, manual (or to be precise, external) 
>>>>>> intervention is required to recover from the error.
>>>>>> 
>>>>>> Commonly, there's two approach to recover from such error:
>>>>>> 
>>>>>>  1. The safer, but slower recover strategy: replace the entire node.
>>>>>>  2. The less safe, but faster recover strategy: shut down the node, 
>>>>>> delete the affected SSTable file(s), and then bring the node back online 
>>>>>> and run repair.
>>>>>> Based on my understanding of Cassandra, it should be possible to recover 
>>>>>> from such error by marking the affected token range in the existing 
>>>>>> SSTable as "corrupted" and stop reading from them (e.g. creating a "bad 
>>>>>> block" file or in memory), and then streaming the affected token range 
>>>>>> from the healthy replicas. The corrupted SSTable file can then be 
>>>>>> removed upon the next successful compaction involving it, or 
>>>>>> alternatively an anti-compaction is performed on it to remove the 
>>>>>> corrupted data.
>>>>>> 
>>>>>> The advantage of this strategy is:
>>>>>> 
>>>>>>  • Reduced node down time - node restart or replacement is not needed
>>>>>>  • Less data streaming is required - only the affected token range
>>>>>>  • Faster recovery time - less streaming and delayed compaction or 
>>>>>> anti-compaction
>>>>>>  • No less safe than replacing the entire node
>>>>>>  • This process can be automated internally, removing the need for 
>>>>>> operator inputs
>>>>>> The disadvantage is added complexity on the SSTable read path and it may 
>>>>>> mask disk failures from the operator who is not paying attention to it.
>>>>>> 
>>>>>> What do you think about this?
>>>>>>

Re: [DISCUSS] Enhanced Disk Error Handling

Reply via email to