On Sat, Jan 24, 2015 at 05:32:41PM +0000, Taylor R Campbell wrote:
>    Date: Sat, 24 Jan 2015 14:34:24 +0100
>    From: Manuel Bouyer <[email protected]>
> 
>    I guess what happens is:
>    vndthread is blocked trying to get the vn_lock because the  thread
>    doing the sys_fsync() is holding it.
>    The thread in sys_fsync() waits for v_numoutput to go down to 0,
>    but this won't happen because vndthread has increased v_numoutput but
>    has not queued the I/O yet.
> 
> Whatever v_numoutput is accounting needs to be done alongside the
> accounting if anything is going to cv_wait on it.  It is clearly wrong
> API design (no surprise) that there are many places in the tree that
> read and modify v_numoutput directly.  For example, why does
> genfs_do_io increment v_numoutput by 2??
> 
> Step #1 is to identify what v_numoutput is supposed to be accounting,
> document it, and move all uses of it to a single abstraction in
> vfs_vnode.c...

Yes, I agree it's not clear what this is supposed to do. This could probably
be improved (but not trivial)

> 
>    Should a thread hold the vnode lock (vn_lock()) before increasing 
> v_numoutput ?
> 
> No, I don't think so.  It seems to be the number of pending bufs that
> have been submitted to the disk device to satisfy writes to the vnode,
> so there's no invariant related to the per-file-system state.
> 
> It looks to me as though it is wrong to
> 
> (a) increase v_numoutput,
> (b) wait for the vnode lock, and then
> (c) submit the buf you just counted in v_numoutput,
> 
> because we can't lock the vnode until the buf is completed if fsync
> began waiting between (a) and (b).  (`Submit' in (c) means sending to
> a disk, e.g. with VOP_STRATEGY, or, in this case, just calling
> nestiobuf_done.)
> 
> On the other hand, it is also wrong to
> 
> (a) submit a buf, and then
> (b) increase v_numoutput,
> 
> because there may be outstanding bufs which an fsync initiated between
> (a) and (b) won't notice.
> 
> But is that wrong for the parent buf of a nestiobuf transaction?
> These don't actually correspond to data themselves, and it seems that
> the purpose of the v_numoutput accounting is to notify fsync when all
> data submitted to disk have been written.  So based on this analysis I
> suspect it would be safe to move the v_numoutput++ to just before the
> call to nestiobuf_done in handle_with_strategy.

No, it's not: the call to nestiobuf_done() may be a noop (it probably is
in most case) when skipped is 0. The real biodone() may happen before,
we need to v_numoutput++ before that.
The attached patch increases v_numoutput for the parent buf just before
queing the last nested buf. I think this is the correct fix (or should
we call this a workaround?). With this I couldn't reproduce the problem
on my test machine.

-- 
Manuel Bouyer <[email protected]>
     NetBSD: 26 ans d'experience feront toujours la difference
--
Index: vnd.c
===================================================================
RCS file: /cvsroot/src/sys/dev/vnd.c,v
retrieving revision 1.219.8.2
diff -u -p -u -r1.219.8.2 vnd.c
--- vnd.c       5 Jul 2012 18:12:46 -0000       1.219.8.2
+++ vnd.c       24 Jan 2015 21:41:11 -0000
@@ -785,12 +785,6 @@ handle_with_strategy(struct vnd_softc *v
 
        flags = obp->b_flags;
 
-       if (!(flags & B_READ)) {
-               vp = bp->b_vp;
-               mutex_enter(vp->v_interlock);
-               vp->v_numoutput++;
-               mutex_exit(vp->v_interlock);
-       }
 
        /* convert to a byte offset within the file. */
        bn = obp->b_rawblkno * vnd->sc_dkdev.dk_label->d_secsize;
@@ -862,6 +856,21 @@ handle_with_strategy(struct vnd_softc *v
                            nbp->vb_buf.b_flags, nbp->vb_buf.b_data,
                            nbp->vb_buf.b_bcount);
 #endif
+               if (!(flags & B_READ) && resid == sz) {
+                       struct vnode *w_vp;
+                       /*
+                        * this is the last nested buf, account for
+                        * the parent buf write too.
+                        * This has to be done last, so that
+                        * fsync won't wait for this write which
+                        * has to chance to complete before all nested bufs
+                        * have been queued.
+                        */
+                       w_vp = bp->b_vp;
+                       mutex_enter(w_vp->v_interlock);
+                       w_vp->v_numoutput++;
+                       mutex_exit(w_vp->v_interlock);
+               }
                VOP_STRATEGY(vp, nbp);
                bn += sz;
        }

Reply via email to