/– A repair of the affected range would need to be completed among
   the replicas without such corruption (including paxos repair)./

It can be safe without a repair by over-streaming the data from more (or all) available replicas, either within the DC (when LOCAL_* CL is used) or across the whole cluster (when other CL is used), then perform a compaction locally on the streamed SSTables to get rid of the duplicate data. Since the read error should only affect a fairly limited range of tokens, over-streaming in theory should not be an issue.


   /– And we'd need a mechanism to execute repair on the affected node
   without it being available to respond to queries, either via the
   client protocol or via internode (similar to a partial bootstrap)./

The mechanism to not respond to queries already exists. I believe there may be better ways to do this, but at the minimal level, the affected node could just drop that read request silently, and then the coordinator will automatically retry it on other replicas if speculative retry is enabled, or the client may get a query failure (the "required responses N, received responses N-1" error).


   /My hunch is that the examples where this are desirable might be
   limited though. It might allow one to limp along on a bad drive
   momentarily while a proper replacement is bootstrapped, but
   typically with disk failures where there's smoke there's fire - I
   wouldn't expect a drive reporting uncorrectable errors / filesystem
   corruption to be long for this world./

Actually no. Regardless it's a mechanical hard drive or an SSD, they all have certain level of uncorrectable bit-error rate (UBER).

For example, a consumer grade hard drive may have an UBER of 1 in 1e14, that means on average roughly every 11 TiB read will lead to an unrecoverable read error, which result in an entire 512 bytes or 4096 bytes sector becomes unreadable, and that's perfectly normal, the hard drive is still in good health and may still last for many years if not decades. Consumer grade SSDs often have UBER of 1 in 1e15, and data centre grade SSDs have far better UBER than consumer grade drives, but even then, the best still have UBER of about 1 in 1e17.

When managing a cluster of hundreds of Cassandra nodes, each has hundreds (if not thousands) GB of data read per day, the probability of hitting uncorrectable bit-error is pretty high. The Cassandra cluster of approximately 300 nodes I manage hits this fairly often, and replacing node for the sake of data consistency has become a chore.


On 08/03/2023 16:53, C. Scott Andreas wrote:
For this to be safe, my understanding is that:

– A repair of the affected range would need to be completed among the replicas without such corruption (including paxos repair). – And we'd need a mechanism to execute repair on the affected node without it being available to respond to queries, either via the client protocol or via internode (similar to a partial bootstrap).

My hunch is that the examples where this are desirable might be limited though. It might allow one to limp along on a bad drive momentarily while a proper replacement is bootstrapped, but typically with disk failures where there's smoke there's fire - I wouldn't expect a drive reporting uncorrectable errors / filesystem corruption to be long for this world.

Can you say more about the scenarios you have in mind?

– Scott

On Mar 8, 2023, at 5:24 AM, Bowen Song via dev <dev@cassandra.apache.org> wrote:


At the moment, when a read error, such as unrecoverable bit error or data corruption, occurs in the SSTable data files, regardless of the disk_failure_policy configuration, manual (or to be precise, external) intervention is required to recover from the error.

Commonly, there's two approach to recover from such error:

 1. The safer, but slower recover strategy: replace the entire node.
 2. The less safe, but faster recover strategy: shut down the node,
    delete the affected SSTable file(s), and then bring the node back
    online and run repair.

Based on my understanding of Cassandra, it should be possible to recover from such error by marking the affected token range in the existing SSTable as "corrupted" and stop reading from them (e.g. creating a "bad block" file or in memory), and then streaming the affected token range from the healthy replicas. The corrupted SSTable file can then be removed upon the next successful compaction involving it, or alternatively an anti-compaction is performed on it to remove the corrupted data.

The advantage of this strategy is:

  * Reduced node down time - node restart or replacement is not needed
  * Less data streaming is required - only the affected token range
  * Faster recovery time - less streaming and delayed compaction or
    anti-compaction
  * No less safe than replacing the entire node
  * This process can be automated internally, removing the need for
    operator inputs

The disadvantage is added complexity on the SSTable read path and it may mask disk failures from the operator who is not paying attention to it.

What do you think about this?


Reply via email to