On Sun, Jun 28, 2020 at 9:57 PM Rick Macklem wrote:
>
> Just in case you were waiting for another email, I have now run several
> cycles of the kernel build over NFS on a recent head kernel with the
> one line change and it has not hung.
>
> I don't know if this is the correct fix, but it would be nice to get something
> into head to fix this.
>
> If I don't hear anything in the next few days, I'll put it in a PR so it
> doesn't get forgotten.
>
> rick
Thanks for the follow through on this.
I think the patch is not complete. It looks like the problem is that
for systems that do not have UMA_MD_SMALL_ALLOC, we do
uma_zone_set_allocf(vmem_bt_zone, vmem_bt_alloc);
but we haven't set an appropriate free function. This is probably why
UMA_ZONE_NOFREE was originally there. When NOFREE was removed, it was
appropriate for systems with uma_small_alloc.
So by default we get page_free as our free function. That calls
kmem_free, which calls vmem_free ... but we do our allocs with
vmem_xalloc. I'm not positive, but I think the problem is that in
effect we vmem_xalloc -> vmem_free, not vmem_xfree.
Three possible fixes:
1: The one you tested, but this is not best for systems with
uma_small_alloc.
2: Pass UMA_ZONE_NOFREE conditional on UMA_MD_SMALL_ALLOC.
3: Actually provide an appropriate vmem_bt_free function.
I think we should just do option 2 with a comment, it's simple and it's
what we used to do. I'm not sure how much benefit we would see from
option 3, but it's more work.
Ryan
>
>
> From: owner-freebsd-curr...@freebsd.org
> on behalf of Rick Macklem
> Sent: Thursday, June 18, 2020 11:42 PM
> To: Ryan Libby
> Cc: Konstantin Belousov; Jeff Roberson; freebsd-current@freebsd.org
> Subject: Re: r358252 causes intermittent hangs where processes are stuck
> sleeping on btalloc
>
> Ryan Libby wrote:
> >On Mon, Jun 15, 2020 at 5:06 PM Rick Macklem wrote:
> >>
> >> Rick Macklem wrote:
> >> >r358098 will hang fairly easily, in 1-3 cycles of the kernel build over =
> NFS.
> >> >I thought this was the culprit, since I did 6 cycles of r358097 without =
> a hang.
> >> >However, I just got a hang with r358097, but it looks rather different.
> >> >The r358097 hang did not have any processes sleeping on btalloc. They
> >> >appeared to be waiting on two different locks in the buffer cache.
> >> >As such, I think it might be a different problem. (I'll admit I should h=
> ave
> >> >made notes about this one before rebooting, but I was flustrated that
> >> >it happened and rebooted before looking at it mush detail.)
> >> Ok, so I did 10 cycles of the kernel build over NFS for r358096 and never
> >> got a hang.
> >> --> It seems that r358097 is the culprit and r358098 makes it easier
> >> to reproduce.
> >> --> Basically runs out of kernel memory.
> >>
> >> It is not obvious if I can revert these two commits without reverting
> >> other ones, since there were a bunch of vm changes after these.
> >>
> >> I'll take a look, but if you guys have any ideas on how to fix this, plea=
> se
> >> let me know.
> >>
> >> Thanks, rick
> >
> >Interesting. Could you try re-adding UMA_ZONE_NOFREE to the vmem btag
> >zone to see if that rescues it, on whatever base revision gets you a
> >reliable repro?
> Good catch! That seems to fix it. I've done 8 cycles of kernel build over
> NFS without a hang (normally I'd get one in the first 1-3 cycles).
>
> I don't know if the intend was to delete UMA_ZONE_VM and r358097
> had a typo in it and deleted UMA_ZONE_NOFREE or ???
>
> Anyhow, I just put it back to UMA_ZONE_VM | UMA_ZONE_NOFREE and
> the hangs seem to have gone away.
>
> The small patch I did is attached, in case that isn't what you meant.
>
> I'll run a few more cycles just in case, but I think this fixes it.
>
> Thanks, rick
>
> >
> > Jeff, to fill you in, I have been getting intermittent hangs on a Pentium=
> 4
> > (single core i386) with 1.25Gbytes ram when doing kernel builds using
> > head kernels from this winter. (I also saw one when doing a kernel build
> > on UFS, so they aren't NFS specific, although easier to reproduce that wa=
> y.)
> > After a typical hang, there will be a bunch of processes sleeping on "bta=
> lloc"
> > and several processes holding the following lock:
> > exclusive sx lock @ vm/vm_map.c:4761
> > - I have seen hangs where that is the only lock held by any process excep=
> t
> >the interrupt thread.
> > - I have also seen processes waiting on the following locks:
> > kern/subr_vmem.c:1343
> > kern/subr_vmem.c:633
> >
> > I can't be absolutely sure r358098 is the culprit, but it seems to make t=
> he
> > problem more reproducible.
> >
> > If anyone has a patch suggestion, I can test it.
> > Otherwise, I will continue to test r358097 and earlier, to try and see wh=
> at hangs
> > occur. (I've done 8 cycles of testing of r356776 without difficulties, bu=
> t that
> > doesn't guarantee it isn't broken.)
> >
> > There is a bunch more of