For this to be safe, my understanding is that:– A repair of the affected range would 
need to be completed among the replicas without such corruption (including paxos 
repair).– And we'd need a mechanism to execute repair on the affected node without it 
being available to respond to queries, either via the client protocol or via 
internode (similar to a partial bootstrap).My hunch is that the examples where this 
are desirable might be limited though. It might allow one to limp along on a bad 
drive momentarily while a proper replacement is bootstrapped, but typically with disk 
failures where there's smoke there's fire - I wouldn't expect a drive reporting 
uncorrectable errors / filesystem corruption to be long for this world.Can you say 
more about the scenarios you have in mind?– ScottOn Mar 8, 2023, at 5:24 AM, Bowen 
Song via dev <dev@cassandra.apache.org> wrote:At the moment, when a read error, 
such as unrecoverable bit error
     or data corruption, occurs in the SSTable data files, regardless
     of the disk_failure_policy configuration, manual (or to be
     precise, external) intervention is required to recover from the
     error.Commonly, there's two approach to recover from such error:The safer, 
but slower recover strategy: replace the entire
       node.The less safe, but faster recover strategy: shut down the
       node, delete the affected SSTable file(s), and then bring the
       node back online and run repair.Based on my understanding of Cassandra, 
it should be possible to
     recover from such error by marking the affected token range in the
     existing SSTable as "corrupted" and stop reading from them (e.g.
     creating a "bad block" file or in memory), and then streaming the
     affected token range from the healthy replicas. The corrupted
     SSTable file can then be removed upon the next successful
     compaction involving it, or alternatively an anti-compaction is
     performed on it to remove the corrupted data.The advantage of this 
strategy is:Reduced node down time - node restart or replacement is not
       neededLess data streaming is required - only the affected token
       rangeFaster recovery time - less streaming and delayed compaction
       or anti-compactionNo less safe than replacing the entire nodeThis 
process can be automated internally, removing the need
       for operator inputsThe disadvantage is added complexity on the SSTable 
read path and
     it may mask disk failures from the operator who is not paying
     attention to it.What do you think about this?

Reply via email to