Thank you Ethan for working on this important work. Looks like we are not enabled by default to validate the data checksums when writing. I am just thinking that we should validate data with priorities in background scanning?. Example: For files which did not get scanned before should be prioritized a bit aggressively compared to the data which was already scanned before. I know this might add some complexity, but let's think about this.
For the other disk IO checking, if we can determine within 45 mins, that seems ok to me. The problem with more aggressive validations causes more IO and results in perf impact. Do we have a mechanism today to learn the disk issues from ongoing writes? Could IO exceptions be a trigger to validate the disk? Regards, Uma On Thu, Jun 15, 2023 at 3:46 PM Ethan Rose <er...@cloudera.com.invalid> wrote: > Hi Ozone devs, > > I am currently working on HDDS-8782 > <https://issues.apache.org/jira/browse/HDDS-8782> to improve the checks > that are run by Ozone's volume scanner. This is a thread that periodically > runs in the background of datanodes to check the health of volumes/disks > configured with hdds.datanode.dir and determine whether or not they have > failed. The existing checks need some improvement so I am looking for input > on a better implementation. The following aspects should be considered: > 1. A health check that is too strict may fail the volume unnecessarily due > to intermittent IO failures. This would trigger alerts and replication when > they are not required. > 2. A health check that is too lenient may take a long time to detect a disk > failure or miss it entirely. This leaves the data vulnerable as it will not > be replicated when it should. > 3. The strictness of the check should be set with sensible defaults, but > allow configuration if required. > 4. The reason for volume failure should be clearly returned for logging. > > Ozone's master branch is currently using `DiskChecker#checkDir` from Hadoop > to assess disk health. This call only checks directory existence and > permissions, which can be cached, and is not a good indication of hardware > failure. There is also `DiskChecker#checkDirWithDiskIO`, which writes a > file and syncs it as part of the check. Even using this check has issues: > - In some cases booleans are used instead of exceptions, which masks the > cause of the error. Violates 4. > - Aspects like the size of the file to write back to the disk, the number > of files written, and the number of failures tolerated are not > configurable. Violates 3 and possibly 1 or 2 if the default values are not > good. > - The check does not read back the data written to check that the contents > match. Violates 2. > > The code to implement such checks is simple, so I propose implementing our > own set of checks in Ozone for fine grained control. In general those > checks should probably contain at least these three aspects: > 1. A check that the volume's directory exists. > 2. A check that the datanode has rwx permission on the directory. > 3. An IO operation consisting of the following steps: > 1. Write x bytes to a file. > 2. Sync the file to the disk. > 3. Read x bytes back. > 4. Make sure the read bytes match what was written. > 5. Delete the file. > > If either of the first two checks fail, the volume should be failed > immediately. More graceful handling of these errors is proposed in > HDDS-8785 > <https://issues.apache.org/jira/browse/HDDS-8785>, but it is out of scope > of this current change. The third check is a bit more ambiguous. We have > the following options to adjust: > - The size of the file written. > - How many files are written as part of each volume scan. > - One scan could read/write only one file, or it could do a few to > increase the odds of catching a problem. > - How frequently back-to-back scans of the same volume are allowed. > - Since there is a background and on-demand volume scanner, there is a > "cool down" period between scans to prevent a volume from being repeatedly > scanned in a short period of time. > - This is a good throttling mechanism, but if it is too high, it can slow > down failure detection if multiple scans are required to determine failure. > See the next point. > - How many failures must be encountered before failing the volume. These > failures could span multiple volume scans, or be contained in one scan > using repeated IO operations. Some options for this are: > - Require x consecutive failures before the volume is failed. If there is > a success in between, the failure count is cleared. > - Require a certain percent of the last x checks to fail before the > volume is unhealthy. For example, of the last 10 checks, at least 8 must > pass or the volume will be declared unhealthy. > > FWIW I have a draft PR <https://github.com/apache/ozone/pull/4867> out for > this that has some failure checks added, but I am not happy with them. They > currently require 3 consecutive scans to fail and leave the default volume > check gap at 15 minutes. This means you could have 66% IO failure rate and > still have a "healthy" disk. It could also take 45 minutes to determine if > there is a failure. > > Disk failures are one of the primary motivations for using Ozone, so I > appreciate your insights in this key area. > > Ethan >