On 14 Mar 2022, at 7:44, Michael Gmelin wrote:
On Sun, 13 Mar 2022 17:53:44 +0000
"Bjoern A. Zeeb" <bzeeb-li...@lists.zabbadoz.net> wrote:

On 13 Mar 2022, at 17:45, Michael Gmelin wrote:

On 13. Mar 2022, at 18:16, Bjoern A. Zeeb
<bzeeb-li...@lists.zabbadoz.net> wrote:

On 13 Mar 2022, at 16:33, Michael Gmelin wrote:
It's important to point out that this only happens with
kern.ncpu>1. With kern.ncpu==1 nothing gets stuck.

This perfectly fits into the picture, since, as pointed out by
Johan,
the first commit that is affected[0] is about multicore support.

Ignore my ignorance, what is the default of net.isr.maxthreads and
net.isr.bindthreads (in stable/13) these days?


My tests were on CURRENT and I’m afk, but according to cgit[0][1],
max is 1 and bind is 0.

Would it make sense to repeat the test with max=-1?

I’d say yes, I’d also bind, but that’s just me.

I would almost assume Kristof running with -1 by default (but he can
chime in on that).

I tried various configuration permutations, all with ncpu=2:

- 14.0-CURRENT #0 main-n253697-f1d450ddee6
- 13.1-BETA1 #0 releng/13.1-n249974-ad329796bdb
- net.isr.maxthreads: -1 (which results in 2 threads), 1, 2
- net.isr.bindthreads: -1, 0, 1, 2
- net.isr.dispatch: direct, deferred

All resulting in the same behavior (hang after a few seconds). They all work ok when running on a single core instance (threads=1 in this case).

I also ran the same test on 13.0-RELEASE-p7 for
comparison (unsurprisingly, it's ok).

I placed the script to reproduce the issue on freefall for your
convenience, so running it is as simple as:

    fetch https://people.freebsd.org/~grembo/hang_epair.sh
    # inspect content
    sh hang_epair.sh

or, if you feel lucky

    fetch -o - https://people.freebsd.org/~grembo/hang_epair.sh | sh

With that script I can also reproduce the problem.

I’ve experimented with this hack:

        diff --git a/sys/net/if_epair.c b/sys/net/if_epair.c
        index c39434b31b9f..1e6bb07ccc4e 100644
        --- a/sys/net/if_epair.c
        +++ b/sys/net/if_epair.c
@@ -415,7 +415,10 @@ epair_ioctl(struct ifnet *ifp, u_long cmd, caddr_t data)

                case SIOCSIFMEDIA:
                case SIOCGIFMEDIA:
        +               printf("KP: %s() SIOCGIFMEDIA\n", __func__);
                        sc = ifp->if_softc;
+ taskqueue_enqueue(epair_tasks.tq[0], &sc->queues[0].tx_task);
        +
                        error = ifmedia_ioctl(ifp, ifr, &sc->media, cmd);
                        break;

That kicks the receive code whenever I `ifconfig epair0a`, and I see a little more traffic every time I do so. That suggests pretty strongly that there’s an issue with how we dispatch work to the handler thread. So presumably there’s a race between epair_menq() and epair_tx_start_deferred().

epair_menq() tries to only enqueue the receive work if there’s nothing in the buf_ring, on the grounds that if there is the previous packet scheduled the work. Clearly there’s an issue there.

I’ll try to dig into that in the next few days.

Kristof

Reply via email to