/– A repair of the affected range would need to be completed among
the replicas without such corruption (including paxos repair)./
It can be safe without a repair by over-streaming the data from more (or
all) available replicas, either within the DC (when LOCAL_* CL is used)
or across the whole cluster (when other CL is used), then perform a
compaction locally on the streamed SSTables to get rid of the duplicate
data. Since the read error should only affect a fairly limited range of
tokens, over-streaming in theory should not be an issue.
/– And we'd need a mechanism to execute repair on the affected node
without it being available to respond to queries, either via the
client protocol or via internode (similar to a partial bootstrap)./
The mechanism to not respond to queries already exists. I believe there
may be better ways to do this, but at the minimal level, the affected
node could just drop that read request silently, and then the
coordinator will automatically retry it on other replicas if speculative
retry is enabled, or the client may get a query failure (the "required
responses N, received responses N-1" error).
/My hunch is that the examples where this are desirable might be
limited though. It might allow one to limp along on a bad drive
momentarily while a proper replacement is bootstrapped, but
typically with disk failures where there's smoke there's fire - I
wouldn't expect a drive reporting uncorrectable errors / filesystem
corruption to be long for this world./
Actually no. Regardless it's a mechanical hard drive or an SSD, they all
have certain level of uncorrectable bit-error rate (UBER).
For example, a consumer grade hard drive may have an UBER of 1 in 1e14,
that means on average roughly every 11 TiB read will lead to an
unrecoverable read error, which result in an entire 512 bytes or 4096
bytes sector becomes unreadable, and that's perfectly normal, the hard
drive is still in good health and may still last for many years if not
decades. Consumer grade SSDs often have UBER of 1 in 1e15, and data
centre grade SSDs have far better UBER than consumer grade drives, but
even then, the best still have UBER of about 1 in 1e17.
When managing a cluster of hundreds of Cassandra nodes, each has
hundreds (if not thousands) GB of data read per day, the probability of
hitting uncorrectable bit-error is pretty high. The Cassandra cluster of
approximately 300 nodes I manage hits this fairly often, and replacing
node for the sake of data consistency has become a chore.
On 08/03/2023 16:53, C. Scott Andreas wrote:
For this to be safe, my understanding is that:
– A repair of the affected range would need to be completed among the
replicas without such corruption (including paxos repair).
– And we'd need a mechanism to execute repair on the affected node
without it being available to respond to queries, either via the
client protocol or via internode (similar to a partial bootstrap).
My hunch is that the examples where this are desirable might be
limited though. It might allow one to limp along on a bad drive
momentarily while a proper replacement is bootstrapped, but typically
with disk failures where there's smoke there's fire - I wouldn't
expect a drive reporting uncorrectable errors / filesystem corruption
to be long for this world.
Can you say more about the scenarios you have in mind?
– Scott
On Mar 8, 2023, at 5:24 AM, Bowen Song via dev
<dev@cassandra.apache.org> wrote:
At the moment, when a read error, such as unrecoverable bit error or
data corruption, occurs in the SSTable data files, regardless of the
disk_failure_policy configuration, manual (or to be precise,
external) intervention is required to recover from the error.
Commonly, there's two approach to recover from such error:
1. The safer, but slower recover strategy: replace the entire node.
2. The less safe, but faster recover strategy: shut down the node,
delete the affected SSTable file(s), and then bring the node back
online and run repair.
Based on my understanding of Cassandra, it should be possible to
recover from such error by marking the affected token range in the
existing SSTable as "corrupted" and stop reading from them (e.g.
creating a "bad block" file or in memory), and then streaming the
affected token range from the healthy replicas. The corrupted SSTable
file can then be removed upon the next successful compaction
involving it, or alternatively an anti-compaction is performed on it
to remove the corrupted data.
The advantage of this strategy is:
* Reduced node down time - node restart or replacement is not needed
* Less data streaming is required - only the affected token range
* Faster recovery time - less streaming and delayed compaction or
anti-compaction
* No less safe than replacing the entire node
* This process can be automated internally, removing the need for
operator inputs
The disadvantage is added complexity on the SSTable read path and it
may mask disk failures from the operator who is not paying attention
to it.
What do you think about this?