At the moment, when a read error, such as unrecoverable bit error or
data corruption, occurs in the SSTable data files, regardless of the
disk_failure_policy configuration, manual (or to be precise, external)
intervention is required to recover from the error.
Commonly, there's two approach to recover from such error:
1. The safer, but slower recover strategy: replace the entire node.
2. The less safe, but faster recover strategy: shut down the node,
delete the affected SSTable file(s), and then bring the node back
online and run repair.
Based on my understanding of Cassandra, it should be possible to recover
from such error by marking the affected token range in the existing
SSTable as "corrupted" and stop reading from them (e.g. creating a "bad
block" file or in memory), and then streaming the affected token range
from the healthy replicas. The corrupted SSTable file can then be
removed upon the next successful compaction involving it, or
alternatively an anti-compaction is performed on it to remove the
corrupted data.
The advantage of this strategy is:
* Reduced node down time - node restart or replacement is not needed
* Less data streaming is required - only the affected token range
* Faster recovery time - less streaming and delayed compaction or
anti-compaction
* No less safe than replacing the entire node
* This process can be automated internally, removing the need for
operator inputs
The disadvantage is added complexity on the SSTable read path and it may
mask disk failures from the operator who is not paying attention to it.
What do you think about this?