[DISCUSS] Enhanced Disk Error Handling

Bowen Song via dev Wed, 08 Mar 2023 05:24:55 -0800

At the moment, when a read error, such as unrecoverable bit error ordata corruption, occurs in the SSTable data files, regardless of thedisk_failure_policy configuration, manual (or to be precise, external)intervention is required to recover from the error.


Commonly, there's two approach to recover from such error:


1. The safer, but slower recover strategy: replace the entire node.
2. The less safe, but faster recover strategy: shut down the node,
   delete the affected SSTable file(s), and then bring the node back
   online and run repair.

Based on my understanding of Cassandra, it should be possible to recoverfrom such error by marking the affected token range in the existingSSTable as "corrupted" and stop reading from them (e.g. creating a "badblock" file or in memory), and then streaming the affected token rangefrom the healthy replicas. The corrupted SSTable file can then beremoved upon the next successful compaction involving it, oralternatively an anti-compaction is performed on it to remove thecorrupted data.


The advantage of this strategy is:

 * Reduced node down time - node restart or replacement is not needed
 * Less data streaming is required - only the affected token range
 * Faster recovery time - less streaming and delayed compaction or
   anti-compaction
 * No less safe than replacing the entire node
 * This process can be automated internally, removing the need for
   operator inputs

The disadvantage is added complexity on the SSTable read path and it maymask disk failures from the operator who is not paying attention to it.


What do you think about this?

[DISCUSS] Enhanced Disk Error Handling

Reply via email to