Nick Holland writes: > First of all, I've been informed who Greg Oster is...a/the maintainer of > RAIDframe.
Guilty as charged. > So, let's start by acknowledging his superior knowledge in > the area (possibly a little bias, but his knowledge of this topic is to > be respected). Well.... "superior" is rather strong... Let's just say I've done a bit more than just "dabbled" in some of these areas :) And I'll readily admit to being a little biased towards RAIDframe too.... > I am NOT A file system expert. I am barely file system aware. Some > readers of my posts might confuse my knowledge of the OpenBSD boot > process and disk layout process as being file-system knowledgable. That > would be a big error -- very different topics! > > Greg Oster wrote: > > Nick Holland writes: > >> Greg Oster wrote: [snip] > >> > 6) Do an md5 checksum of each of the parts of the mirror, and see if > >> > they differ. (they shouldn't, but I bet the do!!) > >> > >> I think the md5 test of the mirror elements is bogus here. > >> I don't care if an unallocated block is different. I care if the files > >> are different. I might not even care about that much. See below... > > > > Umm.... There is still a non-zero chance that metadata on one disk > > will be different than metadata on the other, or that data on one > > disk will be different than the other... > > I'll agree to that ('specially following later results). But I do not > see the point of getting excited about a difference in non-allocated > data. My test is lame, yours is too strict. I can't think of a test > that is "just right". :-/ Yes... I acknowledge that my test isn't something one would see in "typical use"... but testing needs to cover the atypical cases too... > > Your results here might lead us to wonder why RAID systems all worry > > about keeping the mirrors in sync.. just think of all the cycles that > > could be saved if they didn't bother!! ;) > > Actually, that occured to me, yes. > HOWEVER, I wish to point out (again) that I am NOT a file system expert. > Every OS and most HW based systems seem to compulsively rebuild > mirrors. I think it is best to assume they know something I don't. :) :) > I know of two HW RAID systems which aren't so compulsive: Both the > Accusys and the Arco IDE mirroring boxes seem to be indifferent to > powerdowns (they should be indifferent to crashes, as they'll just > finish the last writes without the OS's help). Come to think of it, the > "cheapie" BIOS-assisted SW-RAID cards I've played with on Windows seem > to do the same thing -- I'm guessing they just don't try to optimize > writes, so it "is write to disk 0, write the same thing to disk 1, and > don't let anything else happen in between." I've heard they don't > perform as well as some of the "pure software mirroring" solutions, so > that may be evidence of this. Hmm... The one advantage these hardware cards should have is the ability to keep track of the "last n sectors written". By keeping track of that in a battery-backed-up-manner, it'd be ~trival for them to know what sectors were "in flight" at the time the lights went out. That makes for pretty quick recovery of the RAID set -- perhaps to the point where one doesn't think it's doing anything at all... (I don't know -- I'll admit that I'm more of a fan of software RAID because I know how the bits are stored, and I know I can get them off the disk again. Even though hardware RAID runs software at some level, I've yet to use a hardware RAID array that hasn't threatened to eat my data at some time or another..) [snip] > >> Yes, ccd(4) mirroring is not for every application. But for some, it > >> can be useful. My above mentioned DNS/DHCP server is an example -- I'd > >> like to keep two copies of constantly changing data. If I lose one, I'd > >> like to have rapid repair. If I lose them both, it will not be the end > >> of the world. > > > > I don't have a problem with people using ccd mirroring for data they > > don't care about... I do have a problem when they havn't fully > > understood the implications, and believe it is doing something that > > it isn't! > > yes. I agree with you whole-heartedly on this. I've been working on a > ccd(4) mirroring FAQ entry for a few months. It will have some pretty > big disclaimers, bigger now, as I have verified, at least, in part, some > of your concerns. It also has some pretty big disclaimers about RAID in > general. My experience has been that most people are idiots about how > they implement any form of RAID (most notably, assuming some magic will > happen in the recovery process). Most people only go "half way".... They test to the point where it looks like things will work in redundant mode, and then they put it into production. They don't actually test or go through the "what will I need to do if a disk goes south" part of the procedure... They also don't tend to think about recovering from a "what happens if two disks think they are dead in a RAID 5 set" condition. > ... > Let's get to the results of my second and third sets of tests... > > First, I did eight untarings of src.tar.gz (from one file on a > non-mirrored partition to eight different destinations). As it was > running, I realized I had forgot to delete the Maildir I had (partly) > unpacked before, so I launched an "rm -r" on that one. > > This rather anemic machine was pretty much unusable by this point. I'll > need some better hardware before I get too much more ugly than it is > currently. :) This machine has a pair of 4G IDE drives, 64M RAM, and a > Celeron 333. 64M of RAM should mean not much was cached...much more > than that was written to disk, though I did use only one copy of > src.tar.gz, so some caching was taking place there, probably. > > Initial comparison produced some weird results...massive numbers of "No > such file or directory" messages...until I realized the src.tar.gz file > I used contained symlinks to non-existant things in the obj directory. > So..yeah. Expected. But also would mask other errors, so I deleted > them from both test file systems. Oh.. yup. (I recall seeing those symlinks on a non-OpenBSD box, and wondering the value of them to me :) ) > Here were my results after that: > > > # diff -ur /home/test /mnt/test > Only in /mnt/test/1/gnu/egcs/gcc: cp > Only in /mnt/test/2/gnu/egcs/gcc: cccp.1 > Only in /mnt/test/2/gnu/egcs/gcc: cccp.c > Only in /mnt/test/2/gnu/egcs/gcc: cexp.y > Only in /mnt/test/2/gnu/egcs/gcc: collect2.c > # diff -ur /home/Maildir/ /mnt/Maildir/ > > # > > So, we DID have errors due to different content on the two disks on the > untar'ing, none on the rm'ing. As you've discovered, it's not easy to make these errors appear "at will"... It's basically a race condition, and even though you know the race is there, getting it to manifest itself can be Hard... > I hate ambigious failures, I'd much rather have a spectacular failure. > So, I repeated your tests, following a little closer to your guidelines, > and your revised guidelines. > > I only had space for five copies of src.tar.gz on the smaller ccd(4) > mirrored partition (and even then, I was at 104% utilization!), so only > got five simultanious reads going here. That should be sufficient... > So here's the plan: > 500M /home partition > 1G /var partition > on ccd mirroring. > > # ls /home > src.tar.gz src1.tar.gz src2.tar.gz src3.tar.gz src4.tar.gz > > Run the following script: > -------- [snip recipe] > -------- > > Wait for /var to get around 70% full (starting from 1% full)... > (*thrash*thrash*thrash*) > yank cord when df shows /var is to 70%...takes a while, this thing is > not fast. > > Reboot (still mirroring). fsck runs, lots of errors. > > Get rid of the various obj symlinks: > # cd /var/test > # find . -type l -name obj | xargs rm > > Split the mirror, mount the second half of /var on /mnt > > # diff -ur /var/test /mnt/test > # > > Ummmmmmm...no errors? > that wasn't what I expected. Good thing the previous test was successful ;) It can be really hard to trigger these things... My concern with even suggesting these tests was that you actually wouldn't see any difference after the fsck, in that the fsck would likely "fix" most of the problems.... And it's entirely possible that running even 10 of these tests wouldn't trigger the race... But such is the nature of this beast... > These ARE rather old IDE drives with an old IDE interface....I suspect > newer drives and interfaces or SCSI drives with better support for > concurant disk activities might produce more spectacular failures. Yes, perhaps... Adjusting the "read mixture" would probably help here too... (i.e. do the "50 dd's of src.tar.gz to /dev/null") My guess is the drives were on different IDE channels -- actually... having them share a channel might add to the entertainment value of the tests as well :) I'm just glad you saw the failure mode at least once, so I didn't have to say "believe me, it can happen, try it again!" ;) Later... Greg Oster