io iodev.c

Bruce Evans Tue, 12 Aug 2008 07:08:09 -0700

On Tue, 12 Aug 2008, Bruce Evans wrote:

On Sat, 9 Aug 2008, John Baldwin wrote:

You failed to note that not using D_TRACKCLOSE doesn't actually reliableclose
devices.


No, I noted that vfs doesn't count last closes right.  Unreliability of
last-close is a special case of that.  But using D_TRACKCLOSE mostly makes
things worse.  To avoid the vfs bugs in tracked closes, it is necessary
for individual drivers to understand vfs's bugs better than vfs does.

I set D_TRACKCLOSE on bpf because under load at work (lots of
concurrent nmap's) bpf devices would sometimes never get fully closed, so
you'd have an unopened bpf device that was no longer available.


It is unclear how that helps.  vfs's bugs must be really large for the
last-close to go completely missing.  kib fixed some of them a few months
ago (before you changed bpf I think).  bpf is especially simple since it
is exclusive-accesses, so the vfs bugs should be smaller for it.

D_TRACKCLOSE
gives much saner semantics for devices, each open() of an fd calls
the 'd_open' method, and the last close of that fd (which could be inanother
process) calls 'd_close'.


This is insane semantics for most devices, and D_TRACKCLOSE doesn't give it.
Most devices don't care about the difference between file descriptors and
files -- they want d_close to be called only when the last file descriptor
on the device is closed.  !D_TRACKCLOSE gives this, modulo the vfs bugs.

D_TRACKCLOSE doesn't give these semantics: revoke() closes everything
in 1 step and then calls d_close(), giving an N-1 correspondence between
d_open()s and d_close()s (modulo the vfs bugs giving a fuzzier
correspondence).  (IIRC, at least before kib's fixes, it was possible
for revoke() to cause any of 0, 1 or 2 calls to d_close().  I think 1
call always gets as far as devfs_close() but this is only guaranteed
to reaach d_close() with D_TRACKCLOSE.)  This is not a bug, but all
drivers that depend on the correspondence being 1-1 are broken.  Drivers
that enforce "exclusive" access like bpf have N=1, so they probably
work modulo the vfs bugs, and they have a good chance of not being
affected by the vfs bug that gives an extra devfs_close() and thus an
extra d_close().


I checked that bpf panics (even under UP) due to the obvious bugs in
its d_close():

    # Generate lots of network activity using something like:
    sysctl net.inet.icmp.icmplim=0; ping -fq localhost &

    # Race to panic eventually:
    while :; do tcpdump -i lo0 & sleep 0.001; revoke /dev/bpf0

Most or all device drivers have obvious bugs in their d_close(); bpf
is just a bit easier to understand and more likely to cause a panic
than most device drivers, since it is simple and frees resources.  A
panic is very likely when si_drv1 is freed, and si_drv1 is only locked
accidentally.

% static        int
% bpfclose(struct cdev *dev, int flags, int fmt, struct thread *td)
% {
%       struct bpf_d *d = dev->si_drv1;

%% BPFD_LOCK(d);

%       if (d->bd_state == BPF_WAITING)
%               callout_stop(&d->bd_callout);
%       d->bd_state = BPF_IDLE;
%       BPFD_UNLOCK(d);
%       funsetown(&d->bd_sigio);
%       mtx_lock(&bpf_mtx);
%       if (d->bd_bif)
%               bpf_detachd(d);
%       mtx_unlock(&bpf_mtx);
%       selwakeuppri(&d->bd_sel, PRINET);
% #ifdef MAC
%       mac_bpfdesc_destroy(d);
% #endif /* MAC */
%       knlist_destroy(&d->bd_sel.si_note);
%       bpf_freed(d);
%       dev->si_drv1 = NULL;
%       free(d, M_BPF);

Lots of possibly-in-used data structures are destroyed here.

%% return (0);

% }

%% static int

% bpfread(struct cdev *dev, struct uio *uio, int ioflag)
% {
%       struct bpf_d *d = dev->si_drv1;

There is no locking for si_drv1 here and elsewhere.  There used to be
locking -- long ago, there was implicit locking for non-SMP non-preemptive
kernels.  Then there was Giant locking.

d can be either NULL or garbage here.  (It is null if revoke() just
cleared si_drv1 and there is an implicit or accidental memory barrier
that makes the change visible here.  It is garbage if revoke() has
started destroying it but hasn't yet set si_drv1 to NULL when we read
si_drv1, or if revoke() starts destroying it after we read si_drv1.)

%       int timed_out;
%       int error;

%% /*

%        * Restrict application to use a buffer the same size as
%        * as kernel buffers.
%        */
%       if (uio->uio_resid != d->bd_bufsize)
%               return (EINVAL);

If we're lucky after losing the race, then d is null and we get a null
pointer panic here.  I didn't observe this.

%% BPFD_LOCK(d);


If we're unlucky after losing the race, then we start corrupting *d here.

%               error = msleep(d, &d->bd_mtx, PRINET|PCATCH,
%                    "bpf", d->bd_rtout);

Since we block, we implemented panic(9) even with implicit or explicit
Giant locking.  The sleep gives a very large race window...

%               if (error == EINTR || error == ERESTART) {
%                       BPFD_UNLOCK(d);
%                       return (error);
%               }

...however, we now have locking for d, and bpfclose() blocks on this,
so I think we don't implement panic(9) here.  We have a broken revoke()
instead -- it should do a forced close without waiting, but instead it
waits for the unlock in the above or later.

%               if (error == EWOULDBLOCK) {
%                       /*
%                        * On a timeout, return what's in the buffer,
%                        * which may be nothing.  If there is something
%                        * in the store buffer, we can rotate the buffers.
%                        */
%                       if (d->bd_hbuf)
%                               /*
%                                * We filled up the buffer in between
%                                * getting the timeout and arriving
%                                * here, so we don't need to rotate.
%                                */
%                               break;

%% if (d->bd_slen == 0) {

%                               BPFD_UNLOCK(d);
%                               return (0);
%                       }
%                       ROTATE_BUFFERS(d);
%                       break;
%               }
%       }
%       /*
%        * At this point, we know we have something in the hold slot.
%        */
%       BPFD_UNLOCK(d);

If bpfclose() was blocked on d, then our d is especially likely to become
corrupt just after here.

%% /*

%        * Move data from hold buffer into user space.
%        * We know the entire buffer is transferred since
%        * we checked above that the read buffer is bpf_bufsize bytes.
%        *
%        * XXXRW: More synchronization needed here: what if a second thread
%        * issues a read on the same fd at the same time?  Don't want this
%        * getting invalidated.
%        */
%       error = bpf_uiomove(d, d->bd_hbuf, d->bd_hlen, uio);

%% BPFD_LOCK(d);

%       d->bd_fbuf = d->bd_hbuf;
%       d->bd_hbuf = NULL;
%       d->bd_hlen = 0;
%       bpf_buf_reclaimed(d);
%       BPFD_UNLOCK(d);

%% return (error);

% }

Bruce
_______________________________________________
cvs-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/cvs-all
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: cvs commit: src/sys/dev/io iodev.c

Reply via email to