Date: Sat, 24 Jan 2015 14:34:24 +0100 From: Manuel Bouyer <[email protected]>
I guess what happens is: vndthread is blocked trying to get the vn_lock because the thread doing the sys_fsync() is holding it. The thread in sys_fsync() waits for v_numoutput to go down to 0, but this won't happen because vndthread has increased v_numoutput but has not queued the I/O yet. Whatever v_numoutput is accounting needs to be done alongside the accounting if anything is going to cv_wait on it. It is clearly wrong API design (no surprise) that there are many places in the tree that read and modify v_numoutput directly. For example, why does genfs_do_io increment v_numoutput by 2?? Step #1 is to identify what v_numoutput is supposed to be accounting, document it, and move all uses of it to a single abstraction in vfs_vnode.c... Should a thread hold the vnode lock (vn_lock()) before increasing v_numoutput ? No, I don't think so. It seems to be the number of pending bufs that have been submitted to the disk device to satisfy writes to the vnode, so there's no invariant related to the per-file-system state. It looks to me as though it is wrong to (a) increase v_numoutput, (b) wait for the vnode lock, and then (c) submit the buf you just counted in v_numoutput, because we can't lock the vnode until the buf is completed if fsync began waiting between (a) and (b). (`Submit' in (c) means sending to a disk, e.g. with VOP_STRATEGY, or, in this case, just calling nestiobuf_done.) On the other hand, it is also wrong to (a) submit a buf, and then (b) increase v_numoutput, because there may be outstanding bufs which an fsync initiated between (a) and (b) won't notice. But is that wrong for the parent buf of a nestiobuf transaction? These don't actually correspond to data themselves, and it seems that the purpose of the v_numoutput accounting is to notify fsync when all data submitted to disk have been written. So based on this analysis I suspect it would be safe to move the v_numoutput++ to just before the call to nestiobuf_done in handle_with_strategy. Aside: It looks like the v_numoutput += 2 in genfs_do_io represents the parent buf of the nestiobuf transactions, and reusing that buf for another iodone callback which calls vwakeup.
