I totally agree. We need write checksums on by default, and I am not sure the historical reason they were turned off when they were added in HDDS-5623 <https://issues.apache.org/jira/browse/HDDS-5623>. We should at least test to quantify the performance difference of on vs off before we flip the switch though.
Write checksums are a good topic for another thread, but they are tangential to background disk failure detection which can happen after data has already been written. On Wed, Jun 21, 2023 at 8:27 AM Stephen O'Donnell <sodonn...@cloudera.com.invalid> wrote: > Why is the write checksums validation not turned on by default? I have seen > cases on HDFS where the "verify checksums on write" feature caught data > corruption problems caused by faulty hardware / network cables before it > was able to propagate into the system. > > The only reason I can think of for not enabling them would be write > performance, but a small speed up in write should not be preferred over > data integrity. > > On Wed, Jun 21, 2023 at 12:45 AM Ethan Rose <er...@cloudera.com.invalid> > wrote: > > > Hi Uma, > > > > The datanode side checksums on write are still turned off. IO Exceptions > on > > the read/write path will trigger on demand container and volume/disk > scans. > > We could add containers to the on demand scanning queue after they are > > closed for an initial scan, but this may place unnecessary burden on that > > thread. Even this is still not a replacement for chunk level checksum > > checks on write, since if all 3 replicas are corrupted during the write > > process we cannot recover because the data is already committed. The > > scanner would only identify the problem too late. For these reasons we > > should work to a point where we can turn write checksums on by default. > > > > To determine scanning priority, each container file has a timestamp > stating > > the last time it was scanned, which may have been never. The background > > container data scanner is iterating over a list of containers sorted in > > ascending order by last scanned timestamp, so those that were scanned > > farthest in the past (or never scanned) will be scanned first. The > iterator > > is implemented as a ConcurrentSkipListMap which java defines as "weakly > > consistent", so newly added containers may not show up until the scanner > > finishes its existing iteration and obtains a new iterator. This is > > probably for the best since it prevents write workloads from starving bit > > rot detection on older data and helps us define an upper bound on the > time > > to scan a volume without having to worry about disruption from ongoing > > writes. > > > > Ethan > > > > On Tue, Jun 20, 2023 at 11:32 AM Uma Maheswara Rao Gangumalla < > > umaganguma...@gmail.com> wrote: > > > > > Thank you Ethan for working on this important work. > > > Looks like we are not enabled by default to validate the data checksums > > > when writing. I am just thinking that we should validate data with > > > priorities in background scanning?. Example: For files which did not > get > > > scanned before should be prioritized a bit aggressively compared to the > > > data which was already scanned before. I know this might add some > > > complexity, but let's think about this. > > > > > > For the other disk IO checking, if we can determine within 45 mins, > that > > > seems ok to me. The problem with more aggressive validations causes > more > > IO > > > and results in perf impact. > > > Do we have a mechanism today to learn the disk issues from ongoing > > writes? > > > Could IO exceptions be a trigger to validate the disk? > > > > > > Regards, > > > Uma > > > > > > On Thu, Jun 15, 2023 at 3:46 PM Ethan Rose <er...@cloudera.com.invalid > > > > > wrote: > > > > > > > Hi Ozone devs, > > > > > > > > I am currently working on HDDS-8782 > > > > <https://issues.apache.org/jira/browse/HDDS-8782> to improve the > > checks > > > > that are run by Ozone's volume scanner. This is a thread that > > > periodically > > > > runs in the background of datanodes to check the health of > > volumes/disks > > > > configured with hdds.datanode.dir and determine whether or not they > > have > > > > failed. The existing checks need some improvement so I am looking for > > > input > > > > on a better implementation. The following aspects should be > considered: > > > > 1. A health check that is too strict may fail the volume > unnecessarily > > > due > > > > to intermittent IO failures. This would trigger alerts and > replication > > > when > > > > they are not required. > > > > 2. A health check that is too lenient may take a long time to detect > a > > > disk > > > > failure or miss it entirely. This leaves the data vulnerable as it > will > > > not > > > > be replicated when it should. > > > > 3. The strictness of the check should be set with sensible defaults, > > but > > > > allow configuration if required. > > > > 4. The reason for volume failure should be clearly returned for > > logging. > > > > > > > > Ozone's master branch is currently using `DiskChecker#checkDir` from > > > Hadoop > > > > to assess disk health. This call only checks directory existence and > > > > permissions, which can be cached, and is not a good indication of > > > hardware > > > > failure. There is also `DiskChecker#checkDirWithDiskIO`, which > writes a > > > > file and syncs it as part of the check. Even using this check has > > issues: > > > > - In some cases booleans are used instead of exceptions, which masks > > the > > > > cause of the error. Violates 4. > > > > - Aspects like the size of the file to write back to the disk, the > > number > > > > of files written, and the number of failures tolerated are not > > > > configurable. Violates 3 and possibly 1 or 2 if the default values > are > > > not > > > > good. > > > > - The check does not read back the data written to check that the > > > contents > > > > match. Violates 2. > > > > > > > > The code to implement such checks is simple, so I propose > implementing > > > our > > > > own set of checks in Ozone for fine grained control. In general those > > > > checks should probably contain at least these three aspects: > > > > 1. A check that the volume's directory exists. > > > > 2. A check that the datanode has rwx permission on the directory. > > > > 3. An IO operation consisting of the following steps: > > > > 1. Write x bytes to a file. > > > > 2. Sync the file to the disk. > > > > 3. Read x bytes back. > > > > 4. Make sure the read bytes match what was written. > > > > 5. Delete the file. > > > > > > > > If either of the first two checks fail, the volume should be failed > > > > immediately. More graceful handling of these errors is proposed in > > > > HDDS-8785 > > > > <https://issues.apache.org/jira/browse/HDDS-8785>, but it is out of > > > scope > > > > of this current change. The third check is a bit more ambiguous. We > > have > > > > the following options to adjust: > > > > - The size of the file written. > > > > - How many files are written as part of each volume scan. > > > > - One scan could read/write only one file, or it could do a few to > > > > increase the odds of catching a problem. > > > > - How frequently back-to-back scans of the same volume are allowed. > > > > - Since there is a background and on-demand volume scanner, there > is > > a > > > > "cool down" period between scans to prevent a volume from being > > > repeatedly > > > > scanned in a short period of time. > > > > - This is a good throttling mechanism, but if it is too high, it > can > > > slow > > > > down failure detection if multiple scans are required to determine > > > failure. > > > > See the next point. > > > > - How many failures must be encountered before failing the volume. > > These > > > > failures could span multiple volume scans, or be contained in one > scan > > > > using repeated IO operations. Some options for this are: > > > > - Require x consecutive failures before the volume is failed. If > > there > > > is > > > > a success in between, the failure count is cleared. > > > > - Require a certain percent of the last x checks to fail before the > > > > volume is unhealthy. For example, of the last 10 checks, at least 8 > > must > > > > pass or the volume will be declared unhealthy. > > > > > > > > FWIW I have a draft PR <https://github.com/apache/ozone/pull/4867> > out > > > for > > > > this that has some failure checks added, but I am not happy with > them. > > > They > > > > currently require 3 consecutive scans to fail and leave the default > > > volume > > > > check gap at 15 minutes. This means you could have 66% IO failure > rate > > > and > > > > still have a "healthy" disk. It could also take 45 minutes to > determine > > > if > > > > there is a failure. > > > > > > > > Disk failures are one of the primary motivations for using Ozone, so > I > > > > appreciate your insights in this key area. > > > > > > > > Ethan > > > > > > > > > >