At the moment, when a read error, such as unrecoverable bit error or data corruption, occurs in the SSTable data files, regardless of the disk_failure_policy configuration, manual (or to be precise, external) intervention is required to recover from the error.

Commonly, there's two approach to recover from such error:

1. The safer, but slower recover strategy: replace the entire node.
2. The less safe, but faster recover strategy: shut down the node,
   delete the affected SSTable file(s), and then bring the node back
   online and run repair.

Based on my understanding of Cassandra, it should be possible to recover from such error by marking the affected token range in the existing SSTable as "corrupted" and stop reading from them (e.g. creating a "bad block" file or in memory), and then streaming the affected token range from the healthy replicas. The corrupted SSTable file can then be removed upon the next successful compaction involving it, or alternatively an anti-compaction is performed on it to remove the corrupted data.

The advantage of this strategy is:

 * Reduced node down time - node restart or replacement is not needed
 * Less data streaming is required - only the affected token range
 * Faster recovery time - less streaming and delayed compaction or
   anti-compaction
 * No less safe than replacing the entire node
 * This process can be automated internally, removing the need for
   operator inputs

The disadvantage is added complexity on the SSTable read path and it may mask disk failures from the operator who is not paying attention to it.

What do you think about this?

Reply via email to