errose28 commented on PR #8405: URL: https://github.com/apache/ozone/pull/8405#issuecomment-2860628521
Thanks for checking this out @sodonnel. I can improve the motivation at the top of this doc, but the driving factor is the same as any changes we have made to replication manager, reconstruction, or reconciliation: As a storage system, we must prioiritize data durability over everything else, and we should never deliberately reduce data durability. > My observation from past problems on HDFS is that partially failed disks are a very large problem. They are hard to detect and sometimes reads on them can block for a very long time, resulting in hard to explain slow reads. I'd be more in favor of failing bad volumes completely, This is conflating two different issues with partially failed volumes: performance and durability. This doc is only concerned with data durability, which is more important. If a disk is causing performance problems then that should be identifed with metrics and alerting, which we also don't do well, but that would be a different proposal. We should not remove readable replicas without first copying them just to improve system performance. > The system is intended to handle the abrupt loss of a datanode or disk at any time, so what is driving the need for this proposal? Are volumes being failed too easily resulting in dataloss? There is a difference between us losing copies of data because of an external issue we are responding to, and us losing copies of data because we removed them ourselves. In the later case we are in control, and need to make new copies before removing existing ones. For reference, previously our handling of unhealthy replicas did not do this (we deleted them on sight) and this was rightfully changed. > If volumes are being failed to eagerly, then for what reason? Disk full, checksum errors, outright failed reads? This seems to imply that there is an exact set of criteria fail a volume, and anything outside of that is either "too eager" or "not eager enough". Disk failures are a fuzzy problem and I don't think such an exact set of criteria exists. The purpose of adding an intermediate state is to safely account for this unknown, rather than pin down a binary definition of volume health which becomes closely tied to our durability guarantees. > We do have mechanisms to repair bad containers already (scanner and reconcilor), so that part is handled. This is true. An alternate proposal would be to keep the current criteria we are using for volume failure, and discard all checks that this doc currently proposes using to move a volume to degraded health. Then let scanner + reconciler fix things as we go. I considered this approach and I'm actually not opposed to it, my hesitation was that it seems irresponsible to treat volumes that are frequently throwing errors the same as if they are totally healthy. We cannot choose to fail these reachable volumes without first copying all their data though. > What is considered an IO error which can trigger an ondemand scan? Is it a checksum validation or an unexpect EOF / data length error? Are we keeping a sliding windown count of each unique block so that 10 failures on the same block only counts as 1 rather than 10? Everything listed here could trigger an on-demand scan. Currently the on-demand volume scanner is plugged into the `catch` blocks of most datanode IO paths. The sliding windows are planned to be tracked at a per-disk level, but this raises a good point that if one bad sector becomes hot it may artifically cause the volume to seem worse than it is purely based on scan counts. Overall I agree that there is complexity involved here, and I am not tied to this particular solution. One alternate proposal could be to improve our disk health metrics and dashboards, maybe putting some info in Recon, to alert when disks have reached a degraded state. But at that point the safe way out would be disk decommissioning, which would be a new feature that looks similar to this one. Regardless of the proposal, I do think we need change in this area. As stated at the top of the doc, currently our only two options to handle partial volume failures are to reduce durability by removing all data on a disk that is potentially still readable, or swallow disk errors with the scanner and continue to put new data on this volume as if nothing is wrong. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org