On 06/19/2012 11:55 AM, Phil Frost wrote:
I want to guarantee that fsync() doesn't return until writes have made
it to physical storage. In particular, I care about PostgreSQL
database integrity.
Well, this is proving very frustrating. I still don't know if I'm
chasing behavior that simply isn't implemented, or isn't working in my
environment. However, I'm very sure something is wrong here. I tried
digging around in the source code (3.2.0 kernel from debian
squeeze-backports) a bit, and I'm CCing drbd-dev since I don't imagine
too many users read the code. I pretty much have no experience with
block device programming, but I did find some good documentation in the
kernel [1] that provided some good grep victims, specifically REQ_FLUSH
and REQ_FUA. I found evidence that these are supported by DRBD, in
drbd_main.c:
static u32 bio_flags_to_wire(struct drbd_conf *mdev, unsigned long bi_rw)
{
if (mdev->agreed_pro_version >= 95)
return (bi_rw & REQ_SYNC ? DP_RW_SYNC : 0) |
(bi_rw & REQ_FUA ? DP_FUA : 0) |
(bi_rw & REQ_FLUSH ? DP_FLUSH : 0) |
(bi_rw & REQ_DISCARD ? DP_DISCARD : 0);
else
return bi_rw & REQ_SYNC ? DP_RW_SYNC : 0;
}
This appears to be responsible for encoding the block request flags into
a network format for the peer, and there is an inverse function in
drbd_receiver.c. However, [1] also says block device drivers (well,
"request_fn based" drivers, but I don't know what that means, but I
think it applies) must call blk_queue_flush to advertise support for
REQ_FUA and REQ_FLUSH. grep tells me DRBD doesn't do this anywhere, but
I do see it in other drivers I recognize, MD, loop, xen-blkfront, etc.
So, my hypothesis is that DRBD had the code to pass REQ_FUA and
REQ_FLUSH through to the underlying device, but it never sees those
flags because it doesn't claim to support them. So, they get stripped
off by the block IO system, which figures the best it can do is drain
the queue, which is clearly the Wrong Thing.
Unfortunately, I don't feel very qualified in this area, so can anyone
tell me if I'm totally off base here? Any suggestions on how I might
test this?
[1]
http://www.mjmwired.net/kernel/Documentation/block/writeback_cache_control.txt
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user