Why is the write checksums validation not turned on by default? I have seen cases on HDFS where the "verify checksums on write" feature caught data corruption problems caused by faulty hardware / network cables before it was able to propagate into the system.
The only reason I can think of for not enabling them would be write performance, but a small speed up in write should not be preferred over data integrity. On Wed, Jun 21, 2023 at 12:45 AM Ethan Rose <er...@cloudera.com.invalid> wrote: > Hi Uma, > > The datanode side checksums on write are still turned off. IO Exceptions on > the read/write path will trigger on demand container and volume/disk scans. > We could add containers to the on demand scanning queue after they are > closed for an initial scan, but this may place unnecessary burden on that > thread. Even this is still not a replacement for chunk level checksum > checks on write, since if all 3 replicas are corrupted during the write > process we cannot recover because the data is already committed. The > scanner would only identify the problem too late. For these reasons we > should work to a point where we can turn write checksums on by default. > > To determine scanning priority, each container file has a timestamp stating > the last time it was scanned, which may have been never. The background > container data scanner is iterating over a list of containers sorted in > ascending order by last scanned timestamp, so those that were scanned > farthest in the past (or never scanned) will be scanned first. The iterator > is implemented as a ConcurrentSkipListMap which java defines as "weakly > consistent", so newly added containers may not show up until the scanner > finishes its existing iteration and obtains a new iterator. This is > probably for the best since it prevents write workloads from starving bit > rot detection on older data and helps us define an upper bound on the time > to scan a volume without having to worry about disruption from ongoing > writes. > > Ethan > > On Tue, Jun 20, 2023 at 11:32 AM Uma Maheswara Rao Gangumalla < > umaganguma...@gmail.com> wrote: > > > Thank you Ethan for working on this important work. > > Looks like we are not enabled by default to validate the data checksums > > when writing. I am just thinking that we should validate data with > > priorities in background scanning?. Example: For files which did not get > > scanned before should be prioritized a bit aggressively compared to the > > data which was already scanned before. I know this might add some > > complexity, but let's think about this. > > > > For the other disk IO checking, if we can determine within 45 mins, that > > seems ok to me. The problem with more aggressive validations causes more > IO > > and results in perf impact. > > Do we have a mechanism today to learn the disk issues from ongoing > writes? > > Could IO exceptions be a trigger to validate the disk? > > > > Regards, > > Uma > > > > On Thu, Jun 15, 2023 at 3:46 PM Ethan Rose <er...@cloudera.com.invalid> > > wrote: > > > > > Hi Ozone devs, > > > > > > I am currently working on HDDS-8782 > > > <https://issues.apache.org/jira/browse/HDDS-8782> to improve the > checks > > > that are run by Ozone's volume scanner. This is a thread that > > periodically > > > runs in the background of datanodes to check the health of > volumes/disks > > > configured with hdds.datanode.dir and determine whether or not they > have > > > failed. The existing checks need some improvement so I am looking for > > input > > > on a better implementation. The following aspects should be considered: > > > 1. A health check that is too strict may fail the volume unnecessarily > > due > > > to intermittent IO failures. This would trigger alerts and replication > > when > > > they are not required. > > > 2. A health check that is too lenient may take a long time to detect a > > disk > > > failure or miss it entirely. This leaves the data vulnerable as it will > > not > > > be replicated when it should. > > > 3. The strictness of the check should be set with sensible defaults, > but > > > allow configuration if required. > > > 4. The reason for volume failure should be clearly returned for > logging. > > > > > > Ozone's master branch is currently using `DiskChecker#checkDir` from > > Hadoop > > > to assess disk health. This call only checks directory existence and > > > permissions, which can be cached, and is not a good indication of > > hardware > > > failure. There is also `DiskChecker#checkDirWithDiskIO`, which writes a > > > file and syncs it as part of the check. Even using this check has > issues: > > > - In some cases booleans are used instead of exceptions, which masks > the > > > cause of the error. Violates 4. > > > - Aspects like the size of the file to write back to the disk, the > number > > > of files written, and the number of failures tolerated are not > > > configurable. Violates 3 and possibly 1 or 2 if the default values are > > not > > > good. > > > - The check does not read back the data written to check that the > > contents > > > match. Violates 2. > > > > > > The code to implement such checks is simple, so I propose implementing > > our > > > own set of checks in Ozone for fine grained control. In general those > > > checks should probably contain at least these three aspects: > > > 1. A check that the volume's directory exists. > > > 2. A check that the datanode has rwx permission on the directory. > > > 3. An IO operation consisting of the following steps: > > > 1. Write x bytes to a file. > > > 2. Sync the file to the disk. > > > 3. Read x bytes back. > > > 4. Make sure the read bytes match what was written. > > > 5. Delete the file. > > > > > > If either of the first two checks fail, the volume should be failed > > > immediately. More graceful handling of these errors is proposed in > > > HDDS-8785 > > > <https://issues.apache.org/jira/browse/HDDS-8785>, but it is out of > > scope > > > of this current change. The third check is a bit more ambiguous. We > have > > > the following options to adjust: > > > - The size of the file written. > > > - How many files are written as part of each volume scan. > > > - One scan could read/write only one file, or it could do a few to > > > increase the odds of catching a problem. > > > - How frequently back-to-back scans of the same volume are allowed. > > > - Since there is a background and on-demand volume scanner, there is > a > > > "cool down" period between scans to prevent a volume from being > > repeatedly > > > scanned in a short period of time. > > > - This is a good throttling mechanism, but if it is too high, it can > > slow > > > down failure detection if multiple scans are required to determine > > failure. > > > See the next point. > > > - How many failures must be encountered before failing the volume. > These > > > failures could span multiple volume scans, or be contained in one scan > > > using repeated IO operations. Some options for this are: > > > - Require x consecutive failures before the volume is failed. If > there > > is > > > a success in between, the failure count is cleared. > > > - Require a certain percent of the last x checks to fail before the > > > volume is unhealthy. For example, of the last 10 checks, at least 8 > must > > > pass or the volume will be declared unhealthy. > > > > > > FWIW I have a draft PR <https://github.com/apache/ozone/pull/4867> out > > for > > > this that has some failure checks added, but I am not happy with them. > > They > > > currently require 3 consecutive scans to fail and leave the default > > volume > > > check gap at 15 minutes. This means you could have 66% IO failure rate > > and > > > still have a "healthy" disk. It could also take 45 minutes to determine > > if > > > there is a failure. > > > > > > Disk failures are one of the primary motivations for using Ozone, so I > > > appreciate your insights in this key area. > > > > > > Ethan > > > > > >