Re: Updated CCD Mirroring HOWTO

Greg Oster Fri, 02 Dec 2005 22:38:59 -0800

Nick Holland writes:
> Greg Oster wrote:
> ...
> > Here's what I'd encourage you (or anyone else) to do:
> 
> actually, I'd encourage you do try your own test.  Results were interesting.


Well... as we see, you did *your* version of the test, not mine ;) 

> > 1) Create a ccd as you describe in the HOWTO and mount the filesystem.
> 
> used my own instructions, if you don't mind. :)
> Softdeps on.  That may matter.  Or it may not.  Not sure.

Shouldn't be a big deal either way..

> > 2) Start extracting 5 copies of src.tar.gz onto the filesystem (
> > simultanously is preferred, but basically anything that will generate 
> > a lot of IO here is what is needed).
> 
> I wussed out here.  Did one unpacking of a Maildir in a .tgz file.  But
> lots of IO, lots of thrashing, disks were basically saturated with work,
> processor was waiting for disk.  Lots of tiny files.  On the other hand,
> that's a lot more activity than this machine will ever see in production.

Um... that's just one thread of IO... 64K (or whatever MAXPHYS is) 
presented, in sequence, to the underlying driver.  A rather boring 
sequence of IO, with not much chance for one disk to get ahead or 
behind the other in terms of servicing requests.  The "5" was there 
for a reason :)  So, actually, was src.tar.gz.  To make things more 
interesting, do a whole mess of reads from the ccd while you're 
doing the 5 extractions (preferably for something that isn't cached). 
(If I were testing this on my machine, I'd likely start with 10 
different copies of src.tar.gz on the ccd, and then extract all 
10 simultanous (to different destinationson the ccd).  Once that 
was going, I'd then start about 50 dd's of the src.tar.gz files,
each dd starting about 10 seconds after the previous.   When all 
IO had begun, I'd wait a few minutes and *then* pull the rug out 
from the system.  But I didn't expect anyone to push their system 
that hard for this test, and so went with 5, and just one copy of 
src.tar.gz in an unspecified location :) )

> My first (and second) test was copying the 86M .tgz file, but that was
> horribly uninteresting.  Resetting the machine well into the copy
> resulted in a zero-byte file after fsck.  Truncated.  Not a big
> surprise, really.
> 
> > 3) After that's been going for a while, and while still in progress, 
> > pull the power from the machine.
> 
> Drop power mid write, you are risking your disk.  Yes, I have spiked
> disks with a nail gun to test RAID in the past, but didn't feel like
> possibly toasting two disks by powering down the machine mid-write at
> this time.  This system has purpose for me. :)

Heh.. my RAID test box has a disk in external case.. disk 'failure' 
is simulated by powering off that case... I don't know how many power 
outages that poor little disk has seen :) 
 
> So, I hit the reset button on the machine.  That should give something
> similar to (though admittedly, not identical to) a crash.

Yes, should suffice for this test ...

> No, hitting the reset is NOT the same as a power outage.  It isn't the
> same as a crash either -- in the later case, I'm going to say that it is
> just different, not easier or harder...so my test is only one kind of
> failure (and I REALLY didn't feel like pulling a memory module out to
> simulate a HW failure... :)
> 
> > 4) Fire the machine back up, configure the ccd again, and run fsck a 
> >    few times to make sure the ccd filesystem is "clean".
> 
> once did the job.  Second fsck came up clean.  Don't expect different
> results on the third or fourth...
> 
> > 5) Now unconfigure the ccd.
> 
> mounted each separately as a non-mirrored ccd file system.
> 
> > 6) Do an md5 checksum of each of the parts of the mirror, and see if 
> > they differ.  (they shouldn't, but I bet the do!!)
> 
> I think the md5 test of the mirror elements is bogus here.
> I don't care if an unallocated block is different. I care if the files
> are different.  I might not even care about that much.  See below...

Umm.... There is still a non-zero chance that metadata on one disk 
will be different than metadata on the other, or that data on one 
disk will be different than the other...

> > If they differ, tell me how ccd detected that difference, and how it 
> > warned you that if the primary drive died that you'd have incorrect 
> > data.  If they don't differ, go buy a lottery ticket, cause it's
> > your lucky day! ;) 
> 
> I used diff(1) to compare the two trees created by splitting the mirror.
> 
> No difference found.  i.e., ccd(4) mirroring passed a somewhat
> simplified version of your test.  I even modified one of the files to
> make sure I didn't blow the diff command usage...  188M of files in the
> tree, no differences.
> 
> I will admit I was pleasantly surprised, though not totally shocked that
> it did.

With only one IO thread, I'm not overly surprised with these results...
 
> My first clue was what happened when I tried to interrupt the copy of a
> single very large file to the ccd(4) file system.  Even though many
> megabytes had been transfered, by the time fsck got finished, the file
> had been truncated to zero bytes (this test was repeated twice, same
> results each time).  Zero byte files tend to match pretty well. :)
> 
> I haven't looked closely at the code, but I rather suspect that the
> ccd(4) code sends the same data out to both disks at very close to the
> same time, without wandering off to do other things in between.  In
> order for things to get out of sync, the "event" would have to happen
> between the time data was sent to the first disk and before it got sent
> to the second.  

Right.  That's what the original test was attempting to do.

> I'm not sure, but I suspect there are relatively few
> times you will get a software crash that would cause that (yes, your
> disk IO code could crash,  but I suspect if that was prone to happening,
> you have much bigger problems on your hands!).  However, that doesn't
> cover power outages, HW failure, or careless hitting of the reset button.
> 
> But let's think about this a moment...
> 
> The file system IS wrong.  I was untaring a big .tgz file, and what is
> on the file system does not match what was in the .tgz file, as it
> hadn't finished!  If that was a critical task, my mail spool is hosed
> right now, and needs to be fixed.  fsck didn't magically finish the job,
> it just cleaned up the lose ends.  It lets your system reboot, but that
> isn't the same as saying, "nothing happened".  fsck makes the file
> system consistent, but it can't complete the interrupted job.  I think
> people forget this sometimes.  I think I forget it sometimes. :)
> 
> So, that IS an error.  That's expected when the system goes down hard,
> mirror, no mirror, ccd(4), raid(4), hardware, whatever.  It's going to
> be incomplete, and possibly badly wrong (and maybe corrupted beyond
> repair).  Ok, let's say you are right, let's say my test is a fluke (and
> I'll be quick to say, YES, I am sure under some circumstances, you WILL
> end up with a data mismatch between disks!).  Which disk is "right"?
> BOTH are wrong, just differently wrong.  Which one becomes the "master"
> during the remirror?  I've worked with a lot of Netware servers with SW
> disk mirroring, a system I consider the best SW mirroring I've seen,
> never figured that one out.  It makes a decision, it copies one to the
> other.  What if that decision is wrong?  Well, who cares, they are BOTH
> wrong, pick one and move on.

Right.  In the face of no evidence to the contrary, RAIDframe picks 
the "primary", and assumes that is the most correct.

> If the data being written when the event happens matters, you have to
> re-do whatever you were doing, restore from backup, back out a
> transaction on a TTS system, or otherwise, deal with it.  That process
> will probably "heal" the active files on the ccd(4) set, having
> re-written both of them.

Right.  For stuff you detect is "wrong".  And you'd need to do that 
with a RAID setup as well.  But again.. it's not the ccd/raid that's 
telling you that something is amiss.  If you don't detect that 
something is wrong and things aren't really in sync, that's what's 
going to come back to bite you at a later date... (Murphy likely 
marks it on his calendar right away...)

It's not this sort of stuff that is the concern -- it's coming to a 
machine that's just come up after some failure, and not knowing a) 
exactly what it was doing for file IO when it died or b) whether the 
mirror is really in sync.  Do you really know that all the meta-data 
is in sync in this case?  What verifies that it is?  

> On the other hand, if the data being written at the time of the crash is
> something like logs, hey, it's undesirable to lose them, but does it
> really matter that the two disks are different?  There was a nasty
> event, the data is going to be wrong (or missing or .. ), regardless.
> 
> 
> The machine I was testing on is going to be my new in-house logging
> DNS/DHCP server.  I'm using ccd(4) on the /var partition (where the logs
> will end up) and on the /home partition (the rest will be
> dumped/restored  weekly).  The only files that will be regularly written
> to are going to be log files.  If I end up with an event that causes the
> drives to get out of sync, I really can't imagine a scenario where this
> causes me problems that wouldn't be just as bad without mirroring.  If
> these logs are rotated, within a few days, I should be back to having
> all active files in sync.

I just hope that Murphy doesn't find you :) 
 
> Short version:
> I recognize your concern.  I suspect you are right, the disks could get
> out of sync.  I was a bit concerned about this for a while myself.
> However, the more I think about this, the more I keep coming to the "so
> what?" conclusion.

My data might not be worth much, but it's mine, and I guess that's 
what makes it more valuable to me... :)  And so "so what" isn't the 
conclusion I come to here...  Even less so when it's someone else's 
data on the line....  (For me, "close enough" isn't good enough in a 
mirroring setting, unless you can really guarantee that you'll never 
need to read a particular block that's not in sync.)

Your results here might lead us to wonder why RAID systems all worry 
about keeping the mirrors in sync.. just think of all the cycles that 
could be saved if they didn't bother!!  ;) 

>  My three tests indicated one can't universally even
> demonstrate a difference in the written files, though I'd want to repeat
> it an infinite number more times before I say "and there never will be a
> difference". :)

If you have time, I'd try the test as originally outlined, or as 
modified to have reading (or, better yet, heavy reading) being done 
from the ccd mirror...  As "interesting" as your results are, they 
a) don't surprise me nor b) have much to do with the test in question.

> Yes, ccd(4) mirroring is not for every application.  But for some, it
> can be useful.  My above mentioned DNS/DHCP server is an example -- I'd
> like to keep two copies of constantly changing data.  If I lose one, I'd
> like to have rapid repair.  If I lose them both, it will not be the end
> of the world. 

I don't have a problem with people using ccd mirroring for data they 
don't care about...  I do have a problem when they havn't fully 
understood the implications, and believe it is doing something that 
it isn't! 

> I'm less likely to lose them both with ccd(4) than I am
> without any mirroring.  This is good.  It isn't worth the effort of a
> RAIDframe kernel to me, it isn't worth the price of an Accusys box to me.
>
> Nick.
> (shoulda bought a lottery ticket)

Perhaps :)  Although the test you did didn't leave much room for luck
to be required at all :)

Thanks for posting your thoughts...

Later...

Greg Oster

Re: Updated CCD Mirroring HOWTO

Reply via email to