On 18/11/22(Fri) 21:33, Mark Kettenis wrote:
> > Date: Thu, 17 Nov 2022 20:23:37 +0100
> > From: Mark Kettenis <[email protected]>
> >
> > > From: Jeremie Courreges-Anglas <[email protected]>
> > > Date: Thu, 17 Nov 2022 18:00:21 +0100
> > >
> > > On Tue, Nov 15 2022, Martin Pieuchot <[email protected]> wrote:
> > > > UVM vnode objects include a reference count to keep track of the number
> > > > of processes that have the corresponding pages mapped in their VM space.
> > > >
> > > > When the last process referencing a given library or executable dies,
> > > > the reaper will munmap this object on its behalf. When this happens it
> > > > doesn't free the associated pages to speed-up possible re-use of the
> > > > file. Instead the pages are placed on the inactive list but stay ready
> > > > to be pmap_enter()'d without requiring I/O as soon as a newly process
> > > > needs to access them.
> > > >
> > > > The mechanism to keep pages populated, known as UVM_VNODE_CANPERSIST,
> > > > doesn't work well with swapping [0]. For some reason when the page
> > > > daemon
> > > > wants to free pages on the inactive list it tries to flush the pages to
> > > > disk and panic(9) because it needs a valid reference to the vnode to do
> > > > so.
> > > >
> > > > This indicates that the mechanism described above, which seems to work
> > > > fine for RO mappings, is currently buggy in more complex situations.
> > > > Flushing the pages when the last reference of the UVM object is dropped
> > > > also doesn't seem to be enough as bluhm@ reported [1].
> > > >
> > > > The diff below, which has already be committed and reverted, gets rid of
> > > > the UVM_VNODE_CANPERSIST logic. I'd like to commit it again now that
> > > > the arm64 caching bug has been found and fixed.
> > > >
> > > > Getting rid of this logic means more I/O will be generated and pages
> > > > might have a faster reuse cycle. I'm aware this might introduce a small
> > > > slowdown,
> > >
> > > Numbers for my usual make -j4 in libc,
> > > on an Unmatched riscv64 box, now:
> > > 16m32.65s real 21m36.79s user 30m53.45s system
> > > 16m32.37s real 21m33.40s user 31m17.98s system
> > > 16m32.63s real 21m35.74s user 31m12.01s system
> > > 16m32.13s real 21m36.12s user 31m06.92s system
> > > After:
> > > 19m14.15s real 21m09.39s user 36m51.33s system
> > > 19m19.11s real 21m02.61s user 36m58.46s system
> > > 19m21.77s real 21m09.23s user 37m03.85s system
> > > 19m09.39s real 21m08.96s user 36m36.00s system
> > >
> > > 4 cores amd64 VM, before (-current plus an other diff):
> > > 1m54.31s real 2m47.36s user 4m24.70s system
> > > 1m52.64s real 2m45.68s user 4m23.46s system
> > > 1m53.47s real 2m43.59s user 4m27.60s system
> > > After:
> > > 2m34.12s real 2m51.15s user 6m20.91s system
> > > 2m34.30s real 2m48.48s user 6m23.34s system
> > > 2m37.07s real 2m49.60s user 6m31.53s system
> > >
> > > > however I believe we should work towards loading files from the
> > > > buffer cache to save I/O cycles instead of having another layer of
> > > > cache.
> > > > Such work isn't trivial and making sure the vnode <-> UVM relation is
> > > > simple and well understood is the first step in this direction.
> > > >
> > > > I'd appreciate if the diff below could be tested on many architectures,
> > > > include the offending rpi4.
> > >
> > > Mike has already tested a make build on a riscv64 Unmatched. I have
> > > also run regress in sys, lib/libc and lib/libpthread on that arch. As
> > > far as I can see this looks stable on my machine, but what I really care
> > > about is the riscv64 bulk build cluster (I'm going to start another
> > > bulk build soon).
> > >
> > > > Comments? Oks?
> > >
> > > The performance drop in my microbenchmark kinda worries me but it's only
> > > a microbenchmark...
> >
> > I wouldn't call this a microbenchmark. I fear this is typical for
> > builds of anything on clang architectures. And I expect it to be
> > worse on single-processor machine where *every* time we execute clang
> > or lld all the pages are thrown away upon exit and will need to be
> > read back from disk again for the next bit of C code we compile.
> >
> > Having mmap'ed reads go through the buffer cache should indeed help
> > here, but that smells like UBC and even if it isn't, it is something
> > we don't have and will be tricky to get right.
> >
> > The discussions we had during h2k22 made me understand this thing a
> > bit better. Is there any reason why we can't keep a reference to the
> > vnode and have the pagedaemon drop it when it drops the pages? Is
> > there a chance that we run out of vnodes this way? I vaguely remember
> > beck@ saying that we can have tons of vnodes now.
>
> Maybe that isn't the best idea. So here is what may be an actual fix
> for the "macppc panic: vref used where vget required" problem we're
> seeing that prompted us to look at UVM_VNODE_CANPERSIST.
>
> If we look back to at the race that bluhm@ posted:
>
> panic: vref used where vget required
> Stopped at db_enter+0x24: lwz r11,12(r1)
> TID PID UID PRFLAGS PFLAGS CPU COMMAND
> 192060 78628 21 0x2 0 0 c++
> *132472 74971 0 0x14000 0x200 1K pagedaemon
> db_enter() at db_enter+0x20
> panic(91373c) at panic+0x158
> vref(23b8fa20) at vref+0xac
> uvm_vnp_uncache(e7eb7c50) at uvm_vnp_uncache+0x88
> ffs_write(7ed2e423) at ffs_write+0x3b0
> VOP_WRITE(23b8fa20,e7eb7c50,40,1ff3f60) at VOP_WRITE+0x48
> uvn_io(7ed2e423,a93d0c,a93814,ffffffff,e4010000) at uvn_io+0x264
> uvn_put(414b673a,e7eb7dd4,24f00070,5326e90) at uvn_put+0x64
> uvm_pager_put(0,0,e7eb7d70,6ee0b8,2000000,80000000,0) at uvm_pager_put+0x15c
> uvmpd_scan_inactive(0) at uvmpd_scan_inactive+0x224
> uvmpd_scan() at uvmpd_scan+0x158
> uvm_pageout(7e932633) at uvm_pageout+0x398
> fork_trampoline() at fork_trampoline+0x14
> end trace frame: 0x0, count: 2
>
> I don't think there is anything complicated going on. The pagedaemon
> is paging out a page associated with a persisting vnode. That page
> isn't mapped anymore so nobody holds a reference to the vnode anymore
> and it sits happily on the freelist to either be reused or recycled.
>
> The code repsonsible for paging out this page is uvn_put(), which
> calls uvn_io() and eventually ffs_write() *without* holding a
> reference to the vnode. I believe that is wrong; ffs_write() clearly
> assumes that its caller holds a reference to vnode. But it doesn't
> check until we call uvm_vnp_uncache() which attempts to grab another
> reference, which triggers the diagnostic panic we're seeing.
>
> So I think uvn_put() should make sure it holds a reference to the
> vnode. And since it won't hold one if called by the pagedaemon, this
> means we should use vget() to get one. Now there may be a risk of a
> deadlock here. The pagedaemon may be trying to page out a page that
> is associated with a vnode that is already in the process of being
> recycled. But cleaning the vnode as part of the recycling process
> requires the vmobjlock that we're holding. So use LK_NOWAIT and trust
> that the page will be written to disk and freed as the associated
> vnode gets cleaned. Which means the pagedaemon can simply skip the
> page so we return VM_PAGER_PEND.
>
> Thoughts? If this works, I probably should add some comments
> explaining what's going on here.
I like the idea and the diff.
I don't like the use of VM_PAGER_PEND. This return value is used to
indicate that I/O has been started asynchronously. This currently never
happens with vnode so we're introducing a new tricky error code path.
Plus the page hasn't been flushed to disk when uvm_pager_put() returns.
I'd rather return VM_PAGER_AGAIN which would leave the page on the queue
and once the page daemon unlocks `vmobjlock' the racing thread will be
able to proceed.
> Index: uvm/uvm_vnode.c
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm_vnode.c,v
> retrieving revision 1.130
> diff -u -p -r1.130 uvm_vnode.c
> --- uvm/uvm_vnode.c 20 Oct 2022 13:31:52 -0000 1.130
> +++ uvm/uvm_vnode.c 18 Nov 2022 19:56:51 -0000
> @@ -899,11 +899,18 @@ uvn_cluster(struct uvm_object *uobj, vof
> int
> uvn_put(struct uvm_object *uobj, struct vm_page **pps, int npages, int flags)
> {
> + struct uvm_vnode *uvn = (struct uvm_vnode *)uobj;
> int retval;
>
> KASSERT(rw_write_held(uobj->vmobjlock));
>
> + KASSERT(uvn->u_flags & UVM_VNODE_VALID);
> + if (vget(uvn->u_vnode, LK_NOWAIT))
> + return VM_PAGER_PEND;
> +
> retval = uvn_io((struct uvm_vnode*)uobj, pps, npages, flags, UIO_WRITE);
> +
> + vrele(uvn->u_vnode);
>
> return retval;
> }