On Tue, 20 Dec 2022, Mark Johnston wrote:
On Sun, Dec 18, 2022 at 10:52:58AM -0600, Kyle Evans wrote:
On Sat, Dec 17, 2022 at 11:22 AM Gleb Smirnoff <gleb...@freebsd.org> wrote:
Zhenlei,
On Fri, Dec 16, 2022 at 06:30:57PM +0800, Zhenlei Huang wrote:
Z> I managed to repeat this issue on CURRENT/14 with this small snip:
Z>
Z> -------------------------------------------
Z> #!/bin/sh
Z>
Z> # test jail name
Z> n="test_ref_leak"
Z>
Z> jail -c name=$n path=/ vnet persist
Z> # The following line trigger jail pr_ref leak
Z> jexec $n ifconfig lo0 inet 127.0.0.1/8
Z>
Z> jail -R $n
Z>
Z> # wait a moment
Z> sleep 1
Z>
Z> jls -j $n
Z>
Z> After DDB debugging and tracing , it seems that is triggered by a combine of
[1] and [2]
Z>
Z> [1] https://reviews.freebsd.org/rGfec8a8c7cbe4384c7e61d376f3aa5be5ac895915
<https://reviews.freebsd.org/rGfec8a8c7cbe4384c7e61d376f3aa5be5ac895915>
Z> [2] https://reviews.freebsd.org/rGeb93b99d698674e3b1cc7139fda98e2b175b8c5b
<https://reviews.freebsd.org/rGeb93b99d698674e3b1cc7139fda98e2b175b8c5b>
Z>
Z>
Z> In [1] the per-VNET uma zone is shared with the global one.
Z> `pcbinfo->ipi_zone = pcbstor->ips_zone;`
Z>
Z> In [2] unref `inp->inp_cred` is deferred called in inpcb_dtor() by
uma_zfree_smr() .
Z>
Z> Unfortunately inps freed by uma_zfree_smr() are cached and inpcb_dtor() is
not called immediately ,
Z> thus leaking `inp->inp_cred` ref and hence `prison->pr_ref`.
Z>
Z> And it is also not possible to free up the cache by per-VNET SYSUNINIT
tcp_destroy / udp_destroy / rip_destroy.
This is known issue and I'd prefer not to call it a problem. The "leak" of a
jail
happens only if machine is idle wrt the networking activity.
Getting back to the problem that started this thread - the epair(4)s not
immediately
popping back to prison0. IMHO, the problem again lies in the design of if_vmove
and
epair(4) in particular. The if_vmove shall not exist, instead we should do a
full
if_attach() and if_detach(). The state of an ifnet when it undergoes if_vmove
doesn't
carry any useful information. With Alexander melifaro@ we discussed better
options
for creating or attaching interfaces to jails that if_vmove. Until they are
ready
the most easy workaround to deal with annoying epair(4) come back problem is to
remove it manually before destroying a jail, like I did in 80fc25025ff.
It still behaved much better prior to eb93b99d6986, which you and Mark
were going to work on a solution for to allow the cred "leak" to close
up much more quickly. CC markj@, since I think it's been six months
since the last time I inquired about it, making this a good time to do
it again...
I spent some time trying to see if we could fix this in UMA/SMR and
talked to Jeff about it a bit. At this point I don't think it's the
right approach, at least for now. Really we have a composability
problem where different layers are using different techniques to signal
that they're done with a particular piece of memory, and they just
aren't compatible.
One thing I tried is to implement a UMA function which walks over all
SMR zones and synchronizes all cached items (so that their destructors
are called). This is really expensive, at minimum it has to bind to all
A semi-unrelated question -- do we have any documentation around SMR
in the tree which is not in subr_smr.c?
(I have to admit I find it highly confusing that the acronym is more
easily found as "Shingled Magnetic Recording (SMR)" in a different
header file).
CPUs in the system so that it can flush per-CPU buckets. If
jail_deref() calls that function, the bug goes away at least in my
limited testing, but its use is really a layering violation.
We could, say, periodically scan cached UMA/SMR items and invoke their
destructors, but for most SMR consumers this is unnecessary, and again
there's a layering problem: the inpcb layer shouldn't "know" that it has
to do that for its zones, since it's the jail layer that actually cares.
It also seems kind of strange that dying jails still occupy a slot in
the jail namespace. I don't really understand why the existence of a
dying jail prevents creation of a new jail with the same name, but
presumably there's a good reason for it?
You can create a new jail but if you have (physical) resources tied to
the old one which are not released, then you are stuck (physical
network interfaces for example).
Now my inclination is to try and fix this in the inpcb layer, by not
accessing the inp_cred at all in the lookup path until we hold the inpcb
lock, and then releasing the cred ref before freeing a PCB to its zone.
I think this is doable based on a few observations:
- When doing an SMR-protected lookup, we always lock the returned inpcb
before handing it to the caller. So we could in principle perform
inp_cred checking after acquiring the lock but before returning.
- If there are no jailed PCBs in a hash chain in_pcblookup_hash_locked()
always scans the whole chain.
- If we match only one PCB in a lookup, we can probably(?) return that
PCB without dereferencing the cred pointer at all. If not, then the
scan only has to keep track of a fixed number of PCBs before picking
which one to return. So it looks like we can perform a lockless scan
and keep track of matches on the stack, then lock the matched PCBs and
perform prison checks if necessary, without making the common case
more expensive.
In fact there is a parallel thread on freebsd-jail which reports that
this inp_cred access is a source of frequent cache misses. I was
surprised to see that the scan calls prison_flag() before even checking
the PCB's local address. So if the hash chain is large then we're
potentially performing a lot of unnecessary memory accesses (though
presumably it's common for most of the PCBs to be sharing a single
cred?). In particular we can perhaps solve two problems at once.
I haven't heard back after I sent the test program there; I hope that
can be solved independently first and any optimisations can then come.
Any thoughts? Are there some fundamental reasons this can't work?
--
Bjoern A. Zeeb r15:7