Hi Ozone devs, I am currently working on HDDS-8782 <https://issues.apache.org/jira/browse/HDDS-8782> to improve the checks that are run by Ozone's volume scanner. This is a thread that periodically runs in the background of datanodes to check the health of volumes/disks configured with hdds.datanode.dir and determine whether or not they have failed. The existing checks need some improvement so I am looking for input on a better implementation. The following aspects should be considered: 1. A health check that is too strict may fail the volume unnecessarily due to intermittent IO failures. This would trigger alerts and replication when they are not required. 2. A health check that is too lenient may take a long time to detect a disk failure or miss it entirely. This leaves the data vulnerable as it will not be replicated when it should. 3. The strictness of the check should be set with sensible defaults, but allow configuration if required. 4. The reason for volume failure should be clearly returned for logging.
Ozone's master branch is currently using `DiskChecker#checkDir` from Hadoop to assess disk health. This call only checks directory existence and permissions, which can be cached, and is not a good indication of hardware failure. There is also `DiskChecker#checkDirWithDiskIO`, which writes a file and syncs it as part of the check. Even using this check has issues: - In some cases booleans are used instead of exceptions, which masks the cause of the error. Violates 4. - Aspects like the size of the file to write back to the disk, the number of files written, and the number of failures tolerated are not configurable. Violates 3 and possibly 1 or 2 if the default values are not good. - The check does not read back the data written to check that the contents match. Violates 2. The code to implement such checks is simple, so I propose implementing our own set of checks in Ozone for fine grained control. In general those checks should probably contain at least these three aspects: 1. A check that the volume's directory exists. 2. A check that the datanode has rwx permission on the directory. 3. An IO operation consisting of the following steps: 1. Write x bytes to a file. 2. Sync the file to the disk. 3. Read x bytes back. 4. Make sure the read bytes match what was written. 5. Delete the file. If either of the first two checks fail, the volume should be failed immediately. More graceful handling of these errors is proposed in HDDS-8785 <https://issues.apache.org/jira/browse/HDDS-8785>, but it is out of scope of this current change. The third check is a bit more ambiguous. We have the following options to adjust: - The size of the file written. - How many files are written as part of each volume scan. - One scan could read/write only one file, or it could do a few to increase the odds of catching a problem. - How frequently back-to-back scans of the same volume are allowed. - Since there is a background and on-demand volume scanner, there is a "cool down" period between scans to prevent a volume from being repeatedly scanned in a short period of time. - This is a good throttling mechanism, but if it is too high, it can slow down failure detection if multiple scans are required to determine failure. See the next point. - How many failures must be encountered before failing the volume. These failures could span multiple volume scans, or be contained in one scan using repeated IO operations. Some options for this are: - Require x consecutive failures before the volume is failed. If there is a success in between, the failure count is cleared. - Require a certain percent of the last x checks to fail before the volume is unhealthy. For example, of the last 10 checks, at least 8 must pass or the volume will be declared unhealthy. FWIW I have a draft PR <https://github.com/apache/ozone/pull/4867> out for this that has some failure checks added, but I am not happy with them. They currently require 3 consecutive scans to fail and leave the default volume check gap at 15 minutes. This means you could have 66% IO failure rate and still have a "healthy" disk. It could also take 45 minutes to determine if there is a failure. Disk failures are one of the primary motivations for using Ozone, so I appreciate your insights in this key area. Ethan