Realized I’m somewhat mistaken here -

The repair of surviving replicas would be necessary for correctness prior to 
the node with deleted data files to be able to serve client/internode reads.

But the repair of the node with deleted data files prior to being brought back 
into the cluster is more of an optimization to avoid read repair for queries 
over the affected range grinding traffic to a halt, rather than necessary for 
safety.

— Scott

> On Mar 8, 2023, at 8:54 AM, C. Scott Andreas <sc...@paradoxica.net> wrote:
> 
> For this to be safe, my understanding is that:
> 
> – A repair of the affected range would need to be completed among the 
> replicas without such corruption (including paxos repair).
> – And we'd need a mechanism to execute repair on the affected node without it 
> being available to respond to queries, either via the client protocol or via 
> internode (similar to a partial bootstrap).
> 
> My hunch is that the examples where this are desirable might be limited 
> though. It might allow one to limp along on a bad drive momentarily while a 
> proper replacement is bootstrapped, but typically with disk failures where 
> there's smoke there's fire - I wouldn't expect a drive reporting 
> uncorrectable errors / filesystem corruption to be long for this world.
> 
> Can you say more about the scenarios you have in mind?
> 
> – Scott
> 
>> On Mar 8, 2023, at 5:24 AM, Bowen Song via dev <dev@cassandra.apache.org> 
>> wrote:
>> 
>> 
>> At the moment, when a read error, such as unrecoverable bit error or data 
>> corruption, occurs in the SSTable data files, regardless of the 
>> disk_failure_policy configuration, manual (or to be precise, external) 
>> intervention is required to recover from the error.
>> 
>> Commonly, there's two approach to recover from such error:
>> 
>> The safer, but slower recover strategy: replace the entire node.
>> The less safe, but faster recover strategy: shut down the node, delete the 
>> affected SSTable file(s), and then bring the node back online and run repair.
>> Based on my understanding of Cassandra, it should be possible to recover 
>> from such error by marking the affected token range in the existing SSTable 
>> as "corrupted" and stop reading from them (e.g. creating a "bad block" file 
>> or in memory), and then streaming the affected token range from the healthy 
>> replicas. The corrupted SSTable file can then be removed upon the next 
>> successful compaction involving it, or alternatively an anti-compaction is 
>> performed on it to remove the corrupted data.
>> 
>> The advantage of this strategy is:
>> 
>> Reduced node down time - node restart or replacement is not needed
>> Less data streaming is required - only the affected token range
>> Faster recovery time - less streaming and delayed compaction or 
>> anti-compaction
>> No less safe than replacing the entire node
>> This process can be automated internally, removing the need for operator 
>> inputs
>> The disadvantage is added complexity on the SSTable read path and it may 
>> mask disk failures from the operator who is not paying attention to it.
>> 
>> What do you think about this?
>> 
> 

Reply via email to