On 27-Jul-09, at 3:44 PM, Frank Middleton wrote:

On 07/27/09 01:27 PM, Eric D. Mudama wrote:

Everyone on this list seems to blame lying hardware for ignoring
commands, but disks are relatively mature and I can't believe that
major OEMs would qualify disks or other hardware that willingly ignore
commands.

You are absolutely correct, but if the cache flush command never makes
it to the disk, then it won't see it. The contention is that by not
relaying the cache flush to the disk,

No - by COMPLETELY ignoring the flush.

VirtualBox caused the OP to lose
his pool.

IMO this argument is bogus because AFAIK the OP didn't actually power
his system down, so the data would still have been in the cache, and
presumably have eventually have been written. The out-of-order writes
theory is also somewhat dubious, since he was able to write 10TB without
VB relaying the cache flushes.

Huh? Of course he could. The guest didn't crash while he was doing it!

The corruption occurred when the guest crashed (iirc). And the "out of order theory" need not be the *only possible* explanation, but it *is* sufficient.

This is all highly hardware dependant,

Not in the least. It's a logical problem.

and AFAIK no one ever asked the OP what hardware he had, instead,
blasting him for running VB on MSWindows.

Which is certainly not relevant to my hypothesis of what broke. I don't care what host he is running. The argument is the same for all.

Since IIRC he was using raw
disk access, it is questionable whether or not MS was to blame, but
in general it simply shouldn't be possible to lose a pool under
any conditions.

How about "when flushes are ignored"?


It does raise the question of what happens in general if a cache
flush doesn't happen if, for example, a system crashes in such a way
that it requires a power cycle to restart, and the cache never gets
flushed.

Previous explanations have not dented your misunderstanding one iota.

The problem is not that an attempted flush did not complete. It was that any and all flushes *prior to crash* were ignored. This is where the failure mode diverges from real hardware.

Again, look:

A B C FLUSH D E F FLUSH<CRASH>

Note that it does not matter *at all* whether the 2nd flush completed. What matters from an integrity point of view is that the *previous* flush was completed (and synchronously). Visualise this on the two scenarios:

1) real hardware: (barring actual defects) that A,B,C were written was guaranteed by the first flush (otherwise D would never have been issued). Integrity of system is intact regardless of whether the 2nd flush completed.

2) VirtualBox: flush never happened. Integrity of system is lost, or at best unknown, if it depends on A,B,C all completing before D.


...

Of course the ZIL isn't a journal in the traditional sense, and
AFAIK it has no undo capability the way that a DBMS usually has,
but it needs to be structured so that bizarre things that happen
when something as robust as Solaris crashes don't cause data loss.

A lot of engineering effort has been expended in UFS and ZFS to achieve just that. Which is why it's so nutty to undermine that by violating semantics in lower layers.

The nightmare scenario is when one disk of a mirror begins to
fail and the system comes to a grinding halt where even stop-a
doesn't respond, and a power cycle is the only way out. Who
knows what writes may or may not have been issued or what the
state of the disk cache might be at such a time.

Again, if the flush semantics are respected*, this is not a problem.

--Toby

* - "When this operation completes, previous writes are verifiably on durable media**."

** - Durable media meaning physical media in a bare metal environment, and potentially "virtual media" in a virtualised environment.



-- Frank

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to