From an operator's view, I think the most reliable indicator is not the
total count of corruption events, but the frequency of the events. Let
me try to explain that over some examples:
1. many corruption events in short period of time, then nothing after that
The disk is probably still heal
/When we attempt to rectify any bit-error by streaming data from
peers, we implicitly take a lock on token ownership. A user needs to
know that it is unsafe to change token ownership in a cluster that
is currently in the process of repairing a corruption error on one
of its instance
> there's a point at which a host limping along is better put down and replaced
I did a basic literature review and it looks like load (total program-erase
cycles), disk age, and operating temperature all lead to BER increases. We
don't need to build a whole model of disk failure, we could proba
> I'm not seeing any reasons why CEP-21 would make this more difficult to
> implement
I think I communicated poorly - I was just trying to point out that there's a
point at which a host limping along is better put down and replaced than
piecemeal flagging range after range dead and working aroun
I'm not seeing any reasons why CEP-21 would make this more difficult to
implement, besides the fact that it hasn't landed yet.
There are two major potential pitfalls that CEP-21 would help us avoid:
1. Bit-errors beget further bit-errors, so we ought to be resistant to a high
frequency of corrup
> Personally, I'd like to see the fix for this issue come after CEP-21. It
> could be feasible to implement a fix before then, that detects bit-errors on
> the read path and refuses to respond to the coordinator, implicitly having
> speculative execution handle the retry against another replica
Thanks for proposing this discussion Bowen. I see a few different issues here:
1. How do we safely handle corruption of a handful of tokens without taking an
entire instance offline for re-bootstrap? This includes refusal to serve read
requests for the corrupted token(s), and correct repair of t
Hi Jeremiah,
I'm fully aware of that, which is why I said that deleting the affected
SSTable files is "less safe".
If the "bad blocks" logic is implemented and the node abort the current
read query when hitting a bad block, it should remain safe, as the data
in other SSTable files will not b
It is actually more complicated than just removing the sstable and running
repair.
In the face of expired tombstones that might be covering data in other sstables
the only safe way to deal with a bad sstable is wipe the token range in the bad
sstable and rebuild/bootstrap that range (or wipe/re
On Wed, Mar 8, 2023 at 5:25 AM Bowen Song via dev
wrote:
> At the moment, when a read error, such as unrecoverable bit error or data
> corruption, occurs in the SSTable data files, regardless of the
> disk_failure_policy configuration, manual (or to be precise, external)
> intervention is require
/– A repair of the affected range would need to be completed among
the replicas without such corruption (including paxos repair)./
It can be safe without a repair by over-streaming the data from more (or
all) available replicas, either within the DC (when LOCAL_* CL is used)
or across the
Realized I’m somewhat mistaken here -
The repair of surviving replicas would be necessary for correctness prior to
the node with deleted data files to be able to serve client/internode reads.
But the repair of the node with deleted data files prior to being brought back
into the cluster is more
For this to be safe, my understanding is that:– A repair of the affected range would
need to be completed among the replicas without such corruption (including paxos
repair).– And we'd need a mechanism to execute repair on the affected node without it
being available to respond to queries, eith
At the moment, when a read error, such as unrecoverable bit error or
data corruption, occurs in the SSTable data files, regardless of the
disk_failure_policy configuration, manual (or to be precise, external)
intervention is required to recover from the error.
Commonly, there's two approach to
14 matches
Mail list logo