Module Name: src Committed By: martin Date: Tue Aug 1 14:49:06 UTC 2023
Modified Files: src/sys/dev/dkwedge [netbsd-10]: dk.c src/sys/kern [netbsd-10]: subr_disk.c src/sys/sys [netbsd-10]: disk.h Log Message: Pull up following revision(s) (requested by riastradh in ticket #284): sys/dev/dkwedge/dk.c 1.125-1.158 sys/kern/subr_disk.c 1.135-1.137 sys/sys/disk.h 1.78 dk(4): Explain why dk_rawopens can't overflow and assert it. dk(4): Restore assertions in dklastclose. We only enter dklastclose if the wedge is open (sc->sc_dk.dk_openmask != 0), which can happen only if dkfirstopen has succeeded, in which case we hold a dk_rawopens reference to the parent that prevents anyone else from closing it. Hence sc->sc_parent->dk_rawopens > 0. On open, sc->sc_parent->dk_rawvp is set to nonnull, and it is only reset to null on close. Hence if the parent is still open, as it must be here, sc->sc_parent->dk_rawvp must be nonnull. dk(4): Avoid holding dkwedges_lock while allocating array. This is not great -- we shouldn't be choosing the unit number here anyway; we should just let autoconf do it for us -- but it's better than potentially blocking any dk_openlock or dk_rawlock (which are sometimes held when waiting for dkwedges_lock) for memory allocation. dk(4): KNF: return (v) -> return v. No functional change intended. dk(4): KNF: Whitespace. No functional change intended. dk(4): Omit needless void * cast. No functional change intended. dk(4): Fix typo in comment: dkstrategy, not dkstragegy. No functional change intended. dk(4): ENXIO, not ENODEV, means no such device. ENXIO is `device not configured', meaning there is no such device. ENODEV is `operation not supported by device', meaning the device is there but refuses the operation, like writing to a read-only medium. Exception: For undefined ioctl commands, it's not ENODEV _or_ ENXIO, but rather ENOTTY, because why make any of this obvious when you could make it obscure Unix lore? dk(4): KNF: Sort includes. No functional change intended. dk(4): <sys/rwlock.h> for rwlock(9). dk(4): Prevent races in access to struct dkwedge_softc::sc_size. Rules: 1. Only ever increases, never decreases. (Decreases require removing and readding the wedge.) 2. Increases are serialized by dk_openlock. 3. Reads can happen unlocked in any context where the softc is valid. Access is gathered into dkwedge_size* subroutines -- don't touch sc_size outside these. For now, we use rwlock(9) to keep the reasoning simple. This should be done with atomics on 64-bit platforms and a seqlock on 32-bit platforms to avoid contention. However, we can do that in a later change. dk(4): Move CFDRIVER_DECL and CFATTACH_DECL3_NEW earlier in file. Follows the pattern of most drivers, and will be necessary for referencing dk_cd in dk_bdevsw and dk_cdevsw soon, to prevent open/detach races. No functional change intended. dk(4): Don't touch dkwedges or ndkwedges outside dkwedges_lock. dk(4): Assert parent vp is nonnull before we stash it away. Let's enable early attribution if this goes wrong. If it's not the parent's first open, also assert the parent vp is already nonnull. dk(4): Assert dkwedges[unit] is the sc we're about to free. dk(4): Require dk_openlock in dk_set_geometry. Not strictly necessary but this makes reasoning easier and documents with an assertion how disk_set_info is serialized. disk(9): Fix use-after-free race with concurrent disk_set_info. This can happen with dk(4), which allows wedges to have their size increased without destroying and recreating the device instance. Drivers which allow concurrent disk_set_info and disk_ioctl must serialize disk_set_info with dk_openlock. dk(4): Add null d_cancel routine to devsw. This way, dkclose is guaranteed that dkopen, dkread, dkwrite, dkioctl, &c., have all returned before it runs. For block opens, setting d_cancel also guarantees that any buffered writes are flushed with vinvalbuf before dkclose is called. dk(4): Fix callout detach race. 1. Set a flag sc_iostop under the lock sc_iolock so dkwedge_detach and dkstart don't race over it. 2. Decline to schedule the callout if sc_iostop is set. The callout is already only ever scheduled while the lock is held. 3. Use callout_halt to wait for any concurrent callout to complete. At this point, it can't reschedule itself. Without this change, the callout could be concurrently rescheduling itself as we issue callout_stop, leading to use-after-free later. dk(4): Use disk_begindetach and rely on vdevgone to close instances. The first step is to decide whether we can detach (if forced, yes; if not forced, only if not already open), and prevent new opens if so. There's no need to start closing open instances at this point -- we're just making a decision to detach, and preventing new opens by transitioning state that dkopen will respect[*]. The second step is to force all open instances to close. This is done by vdevgone. By the time vdevgone returns, there can be no open instances, so if there _were_ any, closing them via vdevgone will have passed through dklastclose. After that point, there can be no opens and no I/O operations, so dk_openmask must already be zero and the bufq must be empty. Thus, there's no need to have an explicit call to dklastclose (via dkwedge_cleanup_parent) before or after making the decision to detach. [*] Currently access to this state is racy: nothing serializes dkwedge_detach's state transition with dkopen's test. TBD in a separate commit shortly. dk(4): Set .d_cfdriver and .d_devtounit to plug open/detach race. This way, opening dkN or rdkN will wait if attach or detach is still in progress, and vdevgone will wake up such pending opens and make them fail. So it is no longer possible for a wedge to be detached after dkopen has already started using it. For now, we use a custom .d_devtounit function that looks up the autoconf unit number via the dkwedges array, which conceivably may use an independent unit numbering system -- nothing guarantees they match up. (In practice they will mostly match up, but concurrent wedge creation could lead to different numbering.) Eventually this should be changed so the two numbering systems match, which would let us delete the new dkunit function and just use dev_minor_unit like many other drivers can. dk(4): Take a read-lock on dkwedges_lock if we're only reading. - dkwedge_find_by_name - dkwedge_find_by_parent - dkwedge_print_wnames dk(4): Omit needless locking in dksize, dkdump. All the members these use are stable after initialization, except for the wedge size, which dkwedge_size safely reads a snapshot of without locking in the caller. dk(4): dkdump: Simplify. No functional change intended. dk(4): Narrow the scope of the device numbering lookup on detach. Just need it for vdevgone, order relative to other things in detach doesn't matter. No functional change intended. disk(9): Fix missing unlock in error branch in previous change. dk(4): Fix racy access to sc->sc_dk.dk_openmask in dkwedge_delall1. Need sc->sc_parent->dk_rawlock for this, as used in dkopen/dkclose. dk(4): Convert tests to assertions in various devsw operations. .d_cancel, .d_strategy, .d_read, .d_write, .d_ioctl, and .d_discard are only ever used between successful .d_open return and entry to .d_close. .d_open doesn't return until sc is nonnull and sc_state is RUNNING, and dkwedge_detach waits for the last .d_close before setting sc_state to DEAD. So there is no possibility for sc to be null or for sc_state to be anything other than RUNNING or DYING. There is a small functional change here but only in the event of a race: in the short window between when dkwedge_detach is entered, and when .d_close runs, any I/O operations (read, write, ioctl, &c.) may be issued that would have failed with ENXIO before. This shouldn't matter for anything: disk I/O operations are supposed to complete reasonably promptly, and these operations _could_ have begun milliseconds prior, before dkwedge_detach was entered, so it's not a significant distinction. Notes: - .d_open must still contend with trying to open a nonexistent wedge, of course. - .d_close must also contend with closing a nonexistent wedge, in case there were two calls to open in quick succession and the first failed while the second hadn't yet determined it would fail. - .d_size and .d_dump are used from ddb without any open/close. dk(4): Fix lock assertion in size increase: parent's, not wedge's. dk(4): Rename label for consistency. No functional change intended. dk(4): dkclose must handle a dying wedge too to close the parent. Otherwise the parent open leaks on detach (or revoke) when the wedge was open and had to be forcibly closed. Fixes assertion sc->sc_dk.dk_openmask == 0. ioctl(DIOCRMWEDGES): Delete only idle wedges. Don't forcibly delete busy wedges. Fixes accidental destruction of the busy wedge that the root file system is mounted on, triggered by syzbot's ioctl(DIOCRMWEDGES). dk(4): Omit needless sc_iopend, sc_dkdrn mechanism. vdevgone guarantees that all instances are closed by the time it returns, which in turn guarantees all I/O operations (read, write, ioctl, &c.) have completed, and, if the block device is open, vinvalbuf(V_SAVE) -> vflushbuf has completed, which forces all buffered transfers to be issued and waits for them to complete. So by the time vdevgone returns, no further transfers can be submitted and the bufq must be empty. dk(4): Fix typo: sc_state, not sc_satte. Had tested a patch series, but not every patch in it, and I inadvertently fixed the typo in a later patch in the series, not in the one I committed. dk(4): Make it clearer that dkopen EROFS branch doesn't leak. It looked like we may need to sometimes call dklastclose in error branch for the case of (flags & ~sc->sc_mode & FWRITE) != 0, but it is not actually possible to reach that case: if the caller requested read/write, and the parent is read-only, and it is the first time we've opened the parent, then dkfirstopen will fail with EROFS so we never get there. But this is confusing and it looked like the error branch is wrong, so let's rearrange the conditional to make it clearer that we cannot goto out after dkfirstopen has succeeded. And then assert that the case cannot happen when we do call dkfirstopen. dk(4): Need pdk->dk_openlock to read pdk->dk_wedges. To generate a diff of this commit: cvs rdiff -u -r1.124 -r1.124.4.1 src/sys/dev/dkwedge/dk.c cvs rdiff -u -r1.134 -r1.134.4.1 src/sys/kern/subr_disk.c cvs rdiff -u -r1.77 -r1.77.10.1 src/sys/sys/disk.h Please note that diffs are not public domain; they are subject to the copyright notices on the relevant files.