Re: Detecting disk/volume failures in Ozone

Ethan Rose Wed, 21 Jun 2023 16:27:15 -0700

I totally agree. We need write checksums on by default, and I am not sure
the historical reason they were turned off when they were added in HDDS-5623
<https://issues.apache.org/jira/browse/HDDS-5623>. We should at least test
to quantify the performance difference of on vs off before we flip the
switch though.


Write checksums are a good topic for another thread, but they are
tangential to background disk failure detection which can happen after data
has already been written.

On Wed, Jun 21, 2023 at 8:27 AM Stephen O'Donnell
<sodonn...@cloudera.com.invalid> wrote:

> Why is the write checksums validation not turned on by default? I have seen
> cases on HDFS where the "verify checksums on write" feature caught data
> corruption problems caused by faulty hardware / network cables before it
> was able to propagate into the system.
>
> The only reason I can think of for not enabling them would be write
> performance, but a small speed up in write should not be preferred over
> data integrity.
>
> On Wed, Jun 21, 2023 at 12:45 AM Ethan Rose <er...@cloudera.com.invalid>
> wrote:
>
> > Hi Uma,
> >
> > The datanode side checksums on write are still turned off. IO Exceptions
> on
> > the read/write path will trigger on demand container and volume/disk
> scans.
> > We could add containers to the on demand scanning queue after they are
> > closed for an initial scan, but this may place unnecessary burden on that
> > thread. Even this is still not a replacement for chunk level checksum
> > checks on write, since if all 3 replicas are corrupted during the write
> > process we cannot recover because the data is already committed. The
> > scanner would only identify the problem too late. For these reasons we
> > should work to a point where we can turn write checksums on by default.
> >
> > To determine scanning priority, each container file has a timestamp
> stating
> > the last time it was scanned, which may have been never. The background
> > container data scanner is iterating over a list of containers sorted in
> > ascending order by last scanned timestamp, so those that were scanned
> > farthest in the past (or never scanned) will be scanned first. The
> iterator
> > is implemented as a ConcurrentSkipListMap which java defines as "weakly
> > consistent", so newly added containers may not show up until the scanner
> > finishes its existing iteration and obtains a new iterator. This is
> > probably for the best since it prevents write workloads from starving bit
> > rot detection on older data and helps us define an upper bound on the
> time
> > to scan a volume without having to worry about disruption from ongoing
> > writes.
> >
> > Ethan
> >
> > On Tue, Jun 20, 2023 at 11:32 AM Uma Maheswara Rao Gangumalla <
> > umaganguma...@gmail.com> wrote:
> >
> > > Thank you Ethan for working on this important work.
> > > Looks like we are not enabled by default to validate the data checksums
> > > when writing. I am just thinking that we should validate data with
> > > priorities in background scanning?. Example: For files which did not
> get
> > > scanned before should be prioritized a bit aggressively compared to the
> > > data which was already scanned before. I know this might add some
> > > complexity, but let's think about this.
> > >
> > > For the other disk IO checking, if we can determine within 45 mins,
> that
> > > seems ok to me. The problem with more aggressive validations causes
> more
> > IO
> > > and results in perf impact.
> > > Do we have a mechanism today to learn the disk issues from ongoing
> > writes?
> > > Could IO exceptions be a trigger to validate the disk?
> > >
> > > Regards,
> > > Uma
> > >
> > > On Thu, Jun 15, 2023 at 3:46 PM Ethan Rose <er...@cloudera.com.invalid
> >
> > > wrote:
> > >
> > > > Hi Ozone devs,
> > > >
> > > > I am currently working on HDDS-8782
> > > > <https://issues.apache.org/jira/browse/HDDS-8782> to improve the
> > checks
> > > > that are run by Ozone's volume scanner. This is a thread that
> > > periodically
> > > > runs in the background of datanodes to check the health of
> > volumes/disks
> > > > configured with hdds.datanode.dir and determine whether or not they
> > have
> > > > failed. The existing checks need some improvement so I am looking for
> > > input
> > > > on a better implementation. The following aspects should be
> considered:
> > > > 1. A health check that is too strict may fail the volume
> unnecessarily
> > > due
> > > > to intermittent IO failures. This would trigger alerts and
> replication
> > > when
> > > > they are not required.
> > > > 2. A health check that is too lenient may take a long time to detect
> a
> > > disk
> > > > failure or miss it entirely. This leaves the data vulnerable as it
> will
> > > not
> > > > be replicated when it should.
> > > > 3. The strictness of the check should be set with sensible defaults,
> > but
> > > > allow configuration if required.
> > > > 4. The reason for volume failure should be clearly returned for
> > logging.
> > > >
> > > > Ozone's master branch is currently using `DiskChecker#checkDir` from
> > > Hadoop
> > > > to assess disk health. This call only checks directory existence and
> > > > permissions, which can be cached, and is not a good indication of
> > > hardware
> > > > failure. There is also `DiskChecker#checkDirWithDiskIO`, which
> writes a
> > > > file and syncs it as part of the check. Even using this check has
> > issues:
> > > > - In some cases booleans are used instead of exceptions, which masks
> > the
> > > > cause of the error. Violates 4.
> > > > - Aspects like the size of the file to write back to the disk, the
> > number
> > > > of files written, and the number of failures tolerated are not
> > > > configurable. Violates 3 and possibly 1 or 2 if the default values
> are
> > > not
> > > > good.
> > > > - The check does not read back the data written to check that the
> > > contents
> > > > match. Violates 2.
> > > >
> > > > The code to implement such checks is simple, so I propose
> implementing
> > > our
> > > > own set of checks in Ozone for fine grained control. In general those
> > > > checks should probably contain at least these three aspects:
> > > > 1. A check that the volume's directory exists.
> > > > 2. A check that the datanode has rwx permission on the directory.
> > > > 3. An IO operation consisting of the following steps:
> > > >   1. Write x bytes to a file.
> > > >   2. Sync the file to the disk.
> > > >   3. Read x bytes back.
> > > >   4. Make sure the read bytes match what was written.
> > > >   5. Delete the file.
> > > >
> > > > If either of the first two checks fail, the volume should be failed
> > > > immediately. More graceful handling of these errors is proposed in
> > > > HDDS-8785
> > > > <https://issues.apache.org/jira/browse/HDDS-8785>, but it is out of
> > > scope
> > > > of this current change. The third check is a bit more ambiguous. We
> > have
> > > > the following options to adjust:
> > > > - The size of the file written.
> > > > - How many files are written as part of each volume scan.
> > > >   - One scan could read/write only one file, or it could do a few to
> > > > increase the odds of catching a problem.
> > > > - How frequently back-to-back scans of the same volume are allowed.
> > > >   - Since there is a background and on-demand volume scanner, there
> is
> > a
> > > > "cool down" period between scans to prevent a volume from being
> > > repeatedly
> > > > scanned in a short period of time.
> > > >   - This is a good throttling mechanism, but if it is too high, it
> can
> > > slow
> > > > down failure detection if multiple scans are required to determine
> > > failure.
> > > > See the next point.
> > > > - How many failures must be encountered before failing the volume.
> > These
> > > > failures could span multiple volume scans, or be contained in one
> scan
> > > > using repeated IO operations. Some options for this are:
> > > >   - Require x consecutive failures before the volume is failed. If
> > there
> > > is
> > > > a success in between, the failure count is cleared.
> > > >   - Require a certain percent of the last x checks to fail before the
> > > > volume is unhealthy. For example, of the last 10 checks, at least 8
> > must
> > > > pass or the volume will be declared unhealthy.
> > > >
> > > > FWIW I have a draft PR <https://github.com/apache/ozone/pull/4867>
> out
> > > for
> > > > this that has some failure checks added, but I am not happy with
> them.
> > > They
> > > > currently require 3 consecutive scans to fail and leave the default
> > > volume
> > > > check gap at 15 minutes. This means you could have 66% IO failure
> rate
> > > and
> > > > still have a "healthy" disk. It could also take 45 minutes to
> determine
> > > if
> > > > there is a failure.
> > > >
> > > > Disk failures are one of the primary motivations for using Ozone, so
> I
> > > > appreciate your insights in this key area.
> > > >
> > > > Ethan
> > > >
> > >
> >
>

Re: Detecting disk/volume failures in Ozone

Reply via email to