On Sat, 22 Dec 2007, Kostik Belousov wrote:

On Sun, Dec 23, 2007 at 04:08:09AM +1100, Bruce Evans wrote:
On Sat, 22 Dec 2007, Kostik Belousov wrote:
Yes, rewriting the syncer is the right solution. It probably cannot be done
quickly enough. If the yield workaround provide mitigation for now, it
shall go in.

I don't think rewriting the syncer just for this is the right solution.
Rewriting the syncer so that it schedules actual i/o more efficiently
might involve a solution.  Better scheduling would probably take more
CPU and increase the problem.
I think that we can easily predict what vnode(s) become dirty at the
places where we do vn_start_write().

This works for writes to regular files at most.  There are also reads
(for ffs, these set IN_ATIME unless the file system is mounted with
noatime) and directory operations.  By grepping for IN_CHANGE, I get
78 places in ffs alone where dirtying of the inode occurs or is scheduled
to occur (ffs = /sys/ufs).  The efficiency of "marking" timestamps,
especially for atimes, depends on just setting a flag in normal operation
and picking up coalesced settings of the flag later, often at sync time
by scanning all vnodes.

Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is
needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH().
There are 4 places in vfs and 13 places in 6 file systems:
...

Only file systems that support writing need it (for VOP_SYNC() and for
MNT_RELOAD), else there would be many more places.  There would also
be more places if MNT_RELOAD support were not missing for some file
systems.

Ok, since you talked about this first :). I already made the following
patch, but did not published it since I still did not inspected all
callers of MNT_VNODE_FOREACH() for safety of dropping mount interlock.
It shall be safe, but better to check. Also, I postponed the check
until it was reported that yielding does solve the original problem.

Good.  I'd still like to unobfuscate the function call.

diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c
index 14acc5b..046af82 100644
--- a/sys/kern/vfs_mount.c
+++ b/sys/kern/vfs_mount.c
@@ -1994,6 +1994,12 @@ __mnt_vnode_next(struct vnode **mvp, struct mount *mp)
        mtx_assert(MNT_MTX(mp), MA_OWNED);

        KASSERT((*mvp)->v_mount == mp, ("marker vnode mount list mismatch"));
+       if ((*mvp)->v_yield++ == 500) {
+               MNT_IUNLOCK(mp);
+               (*mvp)->v_yield = 0;
+               uio_yield();

Another unobfuscation is to not name this uio_yield().

+               MNT_ILOCK(mp);
+       }
        vp = TAILQ_NEXT(*mvp, v_nmntvnodes);
        while (vp != NULL && vp->v_type == VMARKER)
                vp = TAILQ_NEXT(vp, v_nmntvnodes);
diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h
index dc70417..6e3119b 100644
--- a/sys/sys/vnode.h
+++ b/sys/sys/vnode.h
@@ -131,6 +131,7 @@ struct vnode {
                struct socket   *vu_socket;     /* v unix domain net (VSOCK) */
                struct cdev     *vu_cdev;       /* v device (VCHR, VBLK) */
                struct fifoinfo *vu_fifoinfo;   /* v fifo (VFIFO) */
+               int             vu_yield;       /*   yield count (VMARKER) */
        } v_un;

        /*

Putting the count in the union seems fragile at best.  Even if nothing
can access the marker vnode, you need to context-switch its old contents
while using it for the count, in case its old contents is used.  Vnode-
printing routines might still be confused.

Bruce
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to