errose28 commented on PR #8405:
URL: https://github.com/apache/ozone/pull/8405#issuecomment-2860628521

   Thanks for checking this out @sodonnel. I can improve the motivation at the 
top of this doc, but the driving factor is the same as any changes we have made 
to replication manager, reconstruction, or reconciliation: As a storage system, 
we must prioiritize data durability over everything else, and we should never 
deliberately reduce data durability.
   
   > My observation from past problems on HDFS is that partially failed disks 
are a very large problem. They are hard to detect and sometimes reads on them 
can block for a very long time, resulting in hard to explain slow reads. I'd be 
more in favor of failing bad volumes completely,
   
   This is conflating two different issues with partially failed volumes: 
performance and durability. This doc is only concerned with data durability, 
which is more important. If a disk is causing performance problems then that 
should be identifed with metrics and alerting, which we also don't do well, but 
that would be a different proposal. We should not remove readable replicas 
without first copying them just to improve system performance.
   
   > The system is intended to handle the abrupt loss of a datanode or disk at 
any time, so what is driving the need for this proposal? Are volumes being 
failed too easily resulting in dataloss?
   
   There is a difference between us losing copies of data because of an 
external issue we are responding to, and us losing copies of data because we 
removed them ourselves. In the later case we are in control, and need to make 
new copies before removing existing ones. For reference, previously our 
handling of unhealthy replicas did not do this (we deleted them on sight) and 
this was rightfully changed.
   
   > If volumes are being failed to eagerly, then for what reason? Disk full, 
checksum errors, outright failed reads?
   
   This seems to imply that there is an exact set of criteria fail a volume, 
and anything outside of that is either "too eager" or "not eager enough". Disk 
failures are a fuzzy problem and I don't think such an exact set of criteria 
exists. The purpose of adding an intermediate state is to safely account for 
this unknown, rather than pin down a binary definition of volume health which 
becomes closely tied to our durability guarantees.
   
   > We do have mechanisms to repair bad containers already (scanner and 
reconcilor), so that part is handled.
   
   This is true. An alternate proposal would be to keep the current criteria we 
are using for volume failure, and discard all checks that this doc currently 
proposes using to move a volume to degraded health. Then let scanner + 
reconciler fix things as we go. I considered this approach and I'm actually not 
opposed to it, my hesitation was that it seems irresponsible to treat volumes 
that are frequently throwing errors the same as if they are totally healthy. We 
cannot choose to fail these reachable volumes without first copying all their 
data though.
   
   > What is considered an IO error which can trigger an ondemand scan? Is it a 
checksum validation or an unexpect EOF / data length error? Are we keeping a 
sliding windown count of each unique block so that 10 failures on the same 
block only counts as 1 rather than 10?
   
   Everything listed here could trigger an on-demand scan. Currently the 
on-demand volume scanner is plugged into the `catch` blocks of most datanode IO 
paths. The sliding windows are planned to be tracked at a per-disk level, but 
this raises a good point that if one bad sector becomes hot it may artifically 
cause the volume to seem worse than it is purely based on scan counts.
   
   Overall I agree that there is complexity involved here, and I am not tied to 
this particular solution. One alternate proposal could be to improve our disk 
health metrics and dashboards, maybe putting some info in Recon, to alert when 
disks have reached a degraded state. But at that point the safe way out would 
be disk decommissioning, which would be a new feature that looks similar to 
this one.
   
   Regardless of the proposal, I do think we need change in this area. As 
stated at the top of the doc, currently our only two options to handle partial 
volume failures are to reduce durability by removing all data on a disk that is 
potentially still readable, or swallow disk errors with the scanner and 
continue to put new data on this volume as if nothing is wrong.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

Reply via email to