Re: [DISCUSS] Enhanced Disk Error Handling

Bowen Song via dev Wed, 08 Mar 2023 10:54:58 -0800

   /– A repair of the affected range would need to be completed among
   the replicas without such corruption (including paxos repair)./

It can be safe without a repair by over-streaming the data from more (orall) available replicas, either within the DC (when LOCAL_* CL is used)or across the whole cluster (when other CL is used), then perform acompaction locally on the streamed SSTables to get rid of the duplicatedata. Since the read error should only affect a fairly limited range oftokens, over-streaming in theory should not be an issue.



   /– And we'd need a mechanism to execute repair on the affected node
   without it being available to respond to queries, either via the
   client protocol or via internode (similar to a partial bootstrap)./

The mechanism to not respond to queries already exists. I believe theremay be better ways to do this, but at the minimal level, the affectednode could just drop that read request silently, and then thecoordinator will automatically retry it on other replicas if speculativeretry is enabled, or the client may get a query failure (the "requiredresponses N, received responses N-1" error).



   /My hunch is that the examples where this are desirable might be
   limited though. It might allow one to limp along on a bad drive
   momentarily while a proper replacement is bootstrapped, but
   typically with disk failures where there's smoke there's fire - I
   wouldn't expect a drive reporting uncorrectable errors / filesystem
   corruption to be long for this world./

Actually no. Regardless it's a mechanical hard drive or an SSD, they allhave certain level of uncorrectable bit-error rate (UBER).

For example, a consumer grade hard drive may have an UBER of 1 in 1e14,that means on average roughly every 11 TiB read will lead to anunrecoverable read error, which result in an entire 512 bytes or 4096bytes sector becomes unreadable, and that's perfectly normal, the harddrive is still in good health and may still last for many years if notdecades. Consumer grade SSDs often have UBER of 1 in 1e15, and datacentre grade SSDs have far better UBER than consumer grade drives, buteven then, the best still have UBER of about 1 in 1e17.

When managing a cluster of hundreds of Cassandra nodes, each hashundreds (if not thousands) GB of data read per day, the probability ofhitting uncorrectable bit-error is pretty high. The Cassandra cluster ofapproximately 300 nodes I manage hits this fairly often, and replacingnode for the sake of data consistency has become a chore.



On 08/03/2023 16:53, C. Scott Andreas wrote:

For this to be safe, my understanding is that:
– A repair of the affected range would need to be completed among thereplicas without such corruption (including paxos repair).– And we'd need a mechanism to execute repair on the affected nodewithout it being available to respond to queries, either via theclient protocol or via internode (similar to a partial bootstrap).
My hunch is that the examples where this are desirable might belimited though. It might allow one to limp along on a bad drivemomentarily while a proper replacement is bootstrapped, but typicallywith disk failures where there's smoke there's fire - I wouldn'texpect a drive reporting uncorrectable errors / filesystem corruptionto be long for this world.
Can you say more about the scenarios you have in mind?

– Scott
On Mar 8, 2023, at 5:24 AM, Bowen Song via dev<dev@cassandra.apache.org> wrote:
At the moment, when a read error, such as unrecoverable bit error ordata corruption, occurs in the SSTable data files, regardless of thedisk_failure_policy configuration, manual (or to be precise,external) intervention is required to recover from the error.
Commonly, there's two approach to recover from such error:

 1. The safer, but slower recover strategy: replace the entire node.
 2. The less safe, but faster recover strategy: shut down the node,
    delete the affected SSTable file(s), and then bring the node back
    online and run repair.
Based on my understanding of Cassandra, it should be possible torecover from such error by marking the affected token range in theexisting SSTable as "corrupted" and stop reading from them (e.g.creating a "bad block" file or in memory), and then streaming theaffected token range from the healthy replicas. The corrupted SSTablefile can then be removed upon the next successful compactioninvolving it, or alternatively an anti-compaction is performed on itto remove the corrupted data.
The advantage of this strategy is:

  * Reduced node down time - node restart or replacement is not needed
  * Less data streaming is required - only the affected token range
  * Faster recovery time - less streaming and delayed compaction or
    anti-compaction
  * No less safe than replacing the entire node
  * This process can be automated internally, removing the need for
    operator inputs
The disadvantage is added complexity on the SSTable read path and itmay mask disk failures from the operator who is not paying attentionto it.
What do you think about this?

Re: [DISCUSS] Enhanced Disk Error Handling

Reply via email to