Thanks for following up with this, Russel.
On Jul 31, 2009, at 7:11 AM, Russel wrote:
After all the discussion here about VB, and all the finger pointing
I raised a bug on VB about flushing.
Remember I am using RAW disks via the SATA emulation in VB
the disks are WD 2TB drives. Also remember the HOST machine
NEVER crashed or stopped. BUT the guest OS OpenSolaris was
hung and so I powered off the VIRTUAL host.
OK, this is what the VB engineer had to say after reading this and
another thread I had pointed him to. (he missed the fast I was
using RAW not supprising as its a rather long thread now!)
===============================
Just looked at those two threads, and from what I saw all vital
information is missing - no hint whatsoever on how the user set up
his disks, nothing about what errors should be dealt with and so on.
So hard to say anything sensible, especially as people seem most
interested in assigning blame to some product. ZFS doesn't deserve
this, and VirtualBox doesn't deserve this either.
In the first place, there is absolutely no difference in how the IDE
and SATA devices handle the flush command. The documentation just
wasn't updated to talk about the SATA controller. Thanks for
pointing this out, it will be fixed in the next major release. If
you want to get the information straight away: just replace
"piix3ide" with "ahci", and all other flushing behavior settings
apply as well. See a bit further below of what it buys you (or not).
What I haven't mentioned is the rationale behind the current
behavior. The reason for ignoring flushes is simple: the biggest
competitor does it by default as well, and one gets beaten up by
every reviewer if VirtualBox is just a few percent slower than you
know what. Forget about arguing with reviewers.
That said, a bit about what flushing can achieve - or not. Just keep
in mind that VirtualBox doesn't really buffer anything. In the IDE
case every read and write requests gets handed more or less straight
(depending on the image format complexity) to the host OS. So there
is absolutely nothing which can be lost if one assumes the host OS
doesn't crash.
In the SATA case things are slightly more complicated. If you're
using anything but raw disks or flat file VMDKs, the behavior is
100% identical to IDE. If you use raw disks or flat file VMDKs, we
activate NCQ support in the SATA device code, which means that the
guest can push through a number of commands at once, and they get
handled on the host via async I/O. Again - if the host OS works
reliably there is nothing to lose.
The problem with this thought process is that since the data is not
on medium, a fault that occurs between the flush request and
the bogus ack goes undetected. The OS trusts when the disk
said "the data is on the medium" that the data is on the medium
with no errors.
This problem also affects "hardware" RAID arrays which provide
nonvolatile caches. If the array acks a write and flush, but the
data is not yet committed to medium, then if the disk fails, the
data must remain in nonvolatile cache until it can be committed
to the medium. A use case may help, suppose the power goes
out. Most arrays have enough battery to last for some time. But
if power isn't restored prior to the batteries discharging, then
there is a risk of data loss.
For ZFS, cache flush requests are not gratuitous. One critical
case is the uberblock or label update. ZFS does:
1. update labels 0 and 2
2. flush
3. check for errors
4. update labels 1 and 3
5. flush
6. check for errors
Making flush be a nop destroys the ability to check for errors
thus breaking the trust between ZFS and the data on medium.
-- richard
The only thing what flushing can potentially improve is the behavior
when the host OS crashes. But that depends on many assumptions on
what the respective OS does, the filesystems do etc etc.
Hope those facts can be the basis of a real discussion. Feel free to
raise any issue you have in this context, as long as it's not purely
hypothetical.
===================================
--
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss