Re: [DISCUSS] Enhanced Disk Error Handling

Bowen Song via dev Thu, 09 Mar 2023 08:24:02 -0800

Hi Jeremiah,

I'm fully aware of that, which is why I said that deleting the affectedSSTable files is "less safe".

If the "bad blocks" logic is implemented and the node abort the currentread query when hitting a bad block, it should remain safe, as the datain other SSTable files will not be used. The streamed data shouldcontain the unexpired tombstones, and that's enough to keep the dataconsistent on the node.


Cheers,
Bowen


On 09/03/2023 15:58, Jeremiah D Jordan wrote:

It is actually more complicated than just removing the sstable andrunning repair.
In the face of expired tombstones that might be covering data in othersstables the only safe way to deal with a bad sstable is wipe thetoken range in the bad sstable and rebuild/bootstrap that range (orwipe/rebuild the whole node which is usually the easier way). Ifthere are expired tombstones in play, it means they could have alreadybeen compacted away on the other replicas, but may not have compactedaway on the current replica, meaning the data they cover could stillbe present in other sstables on this node. Removing the sstable willmean resurrecting that data. And pulling the range from other nodesdoes not help because they can have already compacted away thetombstone, so you won’t get it back.
Tl;DR you can’t just remove the one sstable you have to remove alldata in the token range covered by the sstable (aka all data thatsstable may have had a tombstone covering). Then you can stream fromthe other nodes to get the data back.
-Jeremiah
On Mar 8, 2023, at 7:24 AM, Bowen Song via dev<dev@cassandra.apache.org> wrote:
At the moment, when a read error, such as unrecoverable bit error ordata corruption, occurs in the SSTable data files, regardless of thedisk_failure_policy configuration, manual (or to be precise,external) intervention is required to recover from the error.
Commonly, there's two approach to recover from such error:

 1. The safer, but slower recover strategy: replace the entire node.
 2. The less safe, but faster recover strategy: shut down the node,
    delete the affected SSTable file(s), and then bring the node back
    online and run repair.
Based on my understanding of Cassandra, it should be possible torecover from such error by marking the affected token range in theexisting SSTable as "corrupted" and stop reading from them (e.g.creating a "bad block" file or in memory), and then streaming theaffected token range from the healthy replicas. The corrupted SSTablefile can then be removed upon the next successful compactioninvolving it, or alternatively an anti-compaction is performed on itto remove the corrupted data.
The advantage of this strategy is:

  * Reduced node down time - node restart or replacement is not needed
  * Less data streaming is required - only the affected token range
  * Faster recovery time - less streaming and delayed compaction or
    anti-compaction
  * No less safe than replacing the entire node
  * This process can be automated internally, removing the need for
    operator inputs
The disadvantage is added complexity on the SSTable read path and itmay mask disk failures from the operator who is not paying attentionto it.
What do you think about this?

Re: [DISCUSS] Enhanced Disk Error Handling

Reply via email to