Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Toby Thain Mon, 27 Jul 2009 17:35:39 -0700


On 27-Jul-09, at 3:44 PM, Frank Middleton wrote:

On 07/27/09 01:27 PM, Eric D. Mudama wrote:

Everyone on this list seems to blame lying hardware for ignoring
commands, but disks are relatively mature and I can't believe that
major OEMs would qualify disks or other hardware that willinglyignore
commands.


You are absolutely correct, but if the cache flush command never makes
it to the disk, then it won't see it. The contention is that by not
relaying the cache flush to the disk,


No - by COMPLETELY ignoring the flush.

VirtualBox caused the OP to lose
his pool.

IMO this argument is bogus because AFAIK the OP didn't actually power
his system down, so the data would still have been in the cache, and
presumably have eventually have been written. The out-of-order writes

theory is also somewhat dubious, since he was able to write 10TBwithout

VB relaying the cache flushes.


Huh? Of course he could. The guest didn't crash while he was doing it!

The corruption occurred when the guest crashed (iirc). And the "outof order theory" need not be the *only possible* explanation, but it*is* sufficient.

This is all highly hardware dependant,


Not in the least. It's a logical problem.

and AFAIK no one ever asked the OP what hardware he had, instead,
blasting him for running VB on MSWindows.

Which is certainly not relevant to my hypothesis of what broke. Idon't care what host he is running. The argument is the same for all.

Since IIRC he was using raw
disk access, it is questionable whether or not MS was to blame, but
in general it simply shouldn't be possible to lose a pool under
any conditions.


How about "when flushes are ignored"?


It does raise the question of what happens in general if a cache
flush doesn't happen if, for example, a system crashes in such a way
that it requires a power cycle to restart, and the cache never gets
flushed.


Previous explanations have not dented your misunderstanding one iota.

The problem is not that an attempted flush did not complete. It wasthat any and all flushes *prior to crash* were ignored. This is wherethe failure mode diverges from real hardware.


Again, look:

A B C FLUSH D E F FLUSH<CRASH>

Note that it does not matter *at all* whether the 2nd flushcompleted. What matters from an integrity point of view is that the*previous* flush was completed (and synchronously). Visualise this onthe two scenarios:

1) real hardware: (barring actual defects) that A,B,C were writtenwas guaranteed by the first flush (otherwise D would never have beenissued). Integrity of system is intact regardless of whether the 2ndflush completed.

2) VirtualBox: flush never happened. Integrity of system is lost, orat best unknown, if it depends on A,B,C all completing before D.

...

Of course the ZIL isn't a journal in the traditional sense, and
AFAIK it has no undo capability the way that a DBMS usually has,
but it needs to be structured so that bizarre things that happen
when something as robust as Solaris crashes don't cause data loss.

A lot of engineering effort has been expended in UFS and ZFS toachieve just that. Which is why it's so nutty to undermine that byviolating semantics in lower layers.

The nightmare scenario is when one disk of a mirror begins to
fail and the system comes to a grinding halt where even stop-a
doesn't respond, and a power cycle is the only way out. Who
knows what writes may or may not have been issued or what the
state of the disk cache might be at such a time.


Again, if the flush semantics are respected*, this is not a problem.

--Toby

* - "When this operation completes, previous writes are verifiably ondurable media**."

** - Durable media meaning physical media in a bare metalenvironment, and potentially "virtual media" in a virtualisedenvironment.


-- Frank

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Reply via email to