Re: Detecting disk/volume failures in Ozone

Stephen O'Donnell Wed, 21 Jun 2023 08:27:28 -0700

Why is the write checksums validation not turned on by default? I have seen
cases on HDFS where the "verify checksums on write" feature caught data
corruption problems caused by faulty hardware / network cables before it
was able to propagate into the system.


The only reason I can think of for not enabling them would be write
performance, but a small speed up in write should not be preferred over
data integrity.

On Wed, Jun 21, 2023 at 12:45 AM Ethan Rose <er...@cloudera.com.invalid>
wrote:

> Hi Uma,
>
> The datanode side checksums on write are still turned off. IO Exceptions on
> the read/write path will trigger on demand container and volume/disk scans.
> We could add containers to the on demand scanning queue after they are
> closed for an initial scan, but this may place unnecessary burden on that
> thread. Even this is still not a replacement for chunk level checksum
> checks on write, since if all 3 replicas are corrupted during the write
> process we cannot recover because the data is already committed. The
> scanner would only identify the problem too late. For these reasons we
> should work to a point where we can turn write checksums on by default.
>
> To determine scanning priority, each container file has a timestamp stating
> the last time it was scanned, which may have been never. The background
> container data scanner is iterating over a list of containers sorted in
> ascending order by last scanned timestamp, so those that were scanned
> farthest in the past (or never scanned) will be scanned first. The iterator
> is implemented as a ConcurrentSkipListMap which java defines as "weakly
> consistent", so newly added containers may not show up until the scanner
> finishes its existing iteration and obtains a new iterator. This is
> probably for the best since it prevents write workloads from starving bit
> rot detection on older data and helps us define an upper bound on the time
> to scan a volume without having to worry about disruption from ongoing
> writes.
>
> Ethan
>
> On Tue, Jun 20, 2023 at 11:32 AM Uma Maheswara Rao Gangumalla <
> umaganguma...@gmail.com> wrote:
>
> > Thank you Ethan for working on this important work.
> > Looks like we are not enabled by default to validate the data checksums
> > when writing. I am just thinking that we should validate data with
> > priorities in background scanning?. Example: For files which did not get
> > scanned before should be prioritized a bit aggressively compared to the
> > data which was already scanned before. I know this might add some
> > complexity, but let's think about this.
> >
> > For the other disk IO checking, if we can determine within 45 mins, that
> > seems ok to me. The problem with more aggressive validations causes more
> IO
> > and results in perf impact.
> > Do we have a mechanism today to learn the disk issues from ongoing
> writes?
> > Could IO exceptions be a trigger to validate the disk?
> >
> > Regards,
> > Uma
> >
> > On Thu, Jun 15, 2023 at 3:46 PM Ethan Rose <er...@cloudera.com.invalid>
> > wrote:
> >
> > > Hi Ozone devs,
> > >
> > > I am currently working on HDDS-8782
> > > <https://issues.apache.org/jira/browse/HDDS-8782> to improve the
> checks
> > > that are run by Ozone's volume scanner. This is a thread that
> > periodically
> > > runs in the background of datanodes to check the health of
> volumes/disks
> > > configured with hdds.datanode.dir and determine whether or not they
> have
> > > failed. The existing checks need some improvement so I am looking for
> > input
> > > on a better implementation. The following aspects should be considered:
> > > 1. A health check that is too strict may fail the volume unnecessarily
> > due
> > > to intermittent IO failures. This would trigger alerts and replication
> > when
> > > they are not required.
> > > 2. A health check that is too lenient may take a long time to detect a
> > disk
> > > failure or miss it entirely. This leaves the data vulnerable as it will
> > not
> > > be replicated when it should.
> > > 3. The strictness of the check should be set with sensible defaults,
> but
> > > allow configuration if required.
> > > 4. The reason for volume failure should be clearly returned for
> logging.
> > >
> > > Ozone's master branch is currently using `DiskChecker#checkDir` from
> > Hadoop
> > > to assess disk health. This call only checks directory existence and
> > > permissions, which can be cached, and is not a good indication of
> > hardware
> > > failure. There is also `DiskChecker#checkDirWithDiskIO`, which writes a
> > > file and syncs it as part of the check. Even using this check has
> issues:
> > > - In some cases booleans are used instead of exceptions, which masks
> the
> > > cause of the error. Violates 4.
> > > - Aspects like the size of the file to write back to the disk, the
> number
> > > of files written, and the number of failures tolerated are not
> > > configurable. Violates 3 and possibly 1 or 2 if the default values are
> > not
> > > good.
> > > - The check does not read back the data written to check that the
> > contents
> > > match. Violates 2.
> > >
> > > The code to implement such checks is simple, so I propose implementing
> > our
> > > own set of checks in Ozone for fine grained control. In general those
> > > checks should probably contain at least these three aspects:
> > > 1. A check that the volume's directory exists.
> > > 2. A check that the datanode has rwx permission on the directory.
> > > 3. An IO operation consisting of the following steps:
> > >   1. Write x bytes to a file.
> > >   2. Sync the file to the disk.
> > >   3. Read x bytes back.
> > >   4. Make sure the read bytes match what was written.
> > >   5. Delete the file.
> > >
> > > If either of the first two checks fail, the volume should be failed
> > > immediately. More graceful handling of these errors is proposed in
> > > HDDS-8785
> > > <https://issues.apache.org/jira/browse/HDDS-8785>, but it is out of
> > scope
> > > of this current change. The third check is a bit more ambiguous. We
> have
> > > the following options to adjust:
> > > - The size of the file written.
> > > - How many files are written as part of each volume scan.
> > >   - One scan could read/write only one file, or it could do a few to
> > > increase the odds of catching a problem.
> > > - How frequently back-to-back scans of the same volume are allowed.
> > >   - Since there is a background and on-demand volume scanner, there is
> a
> > > "cool down" period between scans to prevent a volume from being
> > repeatedly
> > > scanned in a short period of time.
> > >   - This is a good throttling mechanism, but if it is too high, it can
> > slow
> > > down failure detection if multiple scans are required to determine
> > failure.
> > > See the next point.
> > > - How many failures must be encountered before failing the volume.
> These
> > > failures could span multiple volume scans, or be contained in one scan
> > > using repeated IO operations. Some options for this are:
> > >   - Require x consecutive failures before the volume is failed. If
> there
> > is
> > > a success in between, the failure count is cleared.
> > >   - Require a certain percent of the last x checks to fail before the
> > > volume is unhealthy. For example, of the last 10 checks, at least 8
> must
> > > pass or the volume will be declared unhealthy.
> > >
> > > FWIW I have a draft PR <https://github.com/apache/ozone/pull/4867> out
> > for
> > > this that has some failure checks added, but I am not happy with them.
> > They
> > > currently require 3 consecutive scans to fail and leave the default
> > volume
> > > check gap at 15 minutes. This means you could have 66% IO failure rate
> > and
> > > still have a "healthy" disk. It could also take 45 minutes to determine
> > if
> > > there is a failure.
> > >
> > > Disk failures are one of the primary motivations for using Ozone, so I
> > > appreciate your insights in this key area.
> > >
> > > Ethan
> > >
> >
>

Re: Detecting disk/volume failures in Ozone

Reply via email to