Detecting disk/volume failures in Ozone

Ethan Rose Thu, 15 Jun 2023 15:46:53 -0700

Hi Ozone devs,

I am currently working on HDDS-8782
<https://issues.apache.org/jira/browse/HDDS-8782> to improve the checks
that are run by Ozone's volume scanner. This is a thread that periodically
runs in the background of datanodes to check the health of volumes/disks
configured with hdds.datanode.dir and determine whether or not they have
failed. The existing checks need some improvement so I am looking for input
on a better implementation. The following aspects should be considered:
1. A health check that is too strict may fail the volume unnecessarily due
to intermittent IO failures. This would trigger alerts and replication when
they are not required.
2. A health check that is too lenient may take a long time to detect a disk
failure or miss it entirely. This leaves the data vulnerable as it will not
be replicated when it should.
3. The strictness of the check should be set with sensible defaults, but
allow configuration if required.
4. The reason for volume failure should be clearly returned for logging.


Ozone's master branch is currently using `DiskChecker#checkDir` from Hadoop
to assess disk health. This call only checks directory existence and
permissions, which can be cached, and is not a good indication of hardware
failure. There is also `DiskChecker#checkDirWithDiskIO`, which writes a
file and syncs it as part of the check. Even using this check has issues:
- In some cases booleans are used instead of exceptions, which masks the
cause of the error. Violates 4.
- Aspects like the size of the file to write back to the disk, the number
of files written, and the number of failures tolerated are not
configurable. Violates 3 and possibly 1 or 2 if the default values are not
good.
- The check does not read back the data written to check that the contents
match. Violates 2.

The code to implement such checks is simple, so I propose implementing our
own set of checks in Ozone for fine grained control. In general those
checks should probably contain at least these three aspects:
1. A check that the volume's directory exists.
2. A check that the datanode has rwx permission on the directory.
3. An IO operation consisting of the following steps:
  1. Write x bytes to a file.
  2. Sync the file to the disk.
  3. Read x bytes back.
  4. Make sure the read bytes match what was written.
  5. Delete the file.

If either of the first two checks fail, the volume should be failed
immediately. More graceful handling of these errors is proposed in HDDS-8785
<https://issues.apache.org/jira/browse/HDDS-8785>, but it is out of scope
of this current change. The third check is a bit more ambiguous. We have
the following options to adjust:
- The size of the file written.
- How many files are written as part of each volume scan.
  - One scan could read/write only one file, or it could do a few to
increase the odds of catching a problem.
- How frequently back-to-back scans of the same volume are allowed.
  - Since there is a background and on-demand volume scanner, there is a
"cool down" period between scans to prevent a volume from being repeatedly
scanned in a short period of time.
  - This is a good throttling mechanism, but if it is too high, it can slow
down failure detection if multiple scans are required to determine failure.
See the next point.
- How many failures must be encountered before failing the volume. These
failures could span multiple volume scans, or be contained in one scan
using repeated IO operations. Some options for this are:
  - Require x consecutive failures before the volume is failed. If there is
a success in between, the failure count is cleared.
  - Require a certain percent of the last x checks to fail before the
volume is unhealthy. For example, of the last 10 checks, at least 8 must
pass or the volume will be declared unhealthy.

FWIW I have a draft PR <https://github.com/apache/ozone/pull/4867> out for
this that has some failure checks added, but I am not happy with them. They
currently require 3 consecutive scans to fail and leave the default volume
check gap at 15 minutes. This means you could have 66% IO failure rate and
still have a "healthy" disk. It could also take 45 minutes to determine if
there is a failure.

Disk failures are one of the primary motivations for using Ozone, so I
appreciate your insights in this key area.

Ethan

Detecting disk/volume failures in Ozone

Reply via email to