For this to be safe, my understanding is that:– A repair of the affected range would
need to be completed among the replicas without such corruption (including paxos
repair).– And we'd need a mechanism to execute repair on the affected node without it
being available to respond to queries, either via the client protocol or via
internode (similar to a partial bootstrap).My hunch is that the examples where this
are desirable might be limited though. It might allow one to limp along on a bad
drive momentarily while a proper replacement is bootstrapped, but typically with disk
failures where there's smoke there's fire - I wouldn't expect a drive reporting
uncorrectable errors / filesystem corruption to be long for this world.Can you say
more about the scenarios you have in mind?– ScottOn Mar 8, 2023, at 5:24 AM, Bowen
Song via dev <dev@cassandra.apache.org> wrote:At the moment, when a read error,
such as unrecoverable bit error
or data corruption, occurs in the SSTable data files, regardless
of the disk_failure_policy configuration, manual (or to be
precise, external) intervention is required to recover from the
error.Commonly, there's two approach to recover from such error:The safer,
but slower recover strategy: replace the entire
node.The less safe, but faster recover strategy: shut down the
node, delete the affected SSTable file(s), and then bring the
node back online and run repair.Based on my understanding of Cassandra,
it should be possible to
recover from such error by marking the affected token range in the
existing SSTable as "corrupted" and stop reading from them (e.g.
creating a "bad block" file or in memory), and then streaming the
affected token range from the healthy replicas. The corrupted
SSTable file can then be removed upon the next successful
compaction involving it, or alternatively an anti-compaction is
performed on it to remove the corrupted data.The advantage of this
strategy is:Reduced node down time - node restart or replacement is not
neededLess data streaming is required - only the affected token
rangeFaster recovery time - less streaming and delayed compaction
or anti-compactionNo less safe than replacing the entire nodeThis
process can be automated internally, removing the need
for operator inputsThe disadvantage is added complexity on the SSTable
read path and
it may mask disk failures from the operator who is not paying
attention to it.What do you think about this?