> From: Stephen Hemminger [mailto:step...@networkplumber.org] > Sent: Tuesday, 5 November 2024 16.55 > > On Tue, 5 Nov 2024 09:49:39 +0100 > Morten Brørup <m...@smartsharesystems.com> wrote: > > > > > > > I suspect AF_PACKET provides an intermediate step which can buffer > more > > > or spread out the work. > > > > Agree. It's a Linux scheduling issue. > > > > With DPDK polling, there is no interrupt in the kernel scheduler. > > If the CPU core running the DPDK polling thread is running some other > thread when the packets arrive on the hardware, the DPDK polling thread > is NOT scheduled immediately, but has to wait for the kernel scheduler > to switch to this thread instead of the other thread. > > Quite a lot of time can pass before this happens - the kernel > scheduler does not know that the DPDK polling thread has urgent work > pending. > > And the number of RX descriptors needs to be big enough to absorb all > packets arriving during the scheduling delay. > > It is not well described how to *guarantee* that nothing but the DPDK > polling thread runs on a dedicated CPU core. > > That why any non-trivial DPDK application needs to run on isolated > cpu's.
Exactly. And it is non-trivial and not well described how to do this. Especially in virtual environments. E.g. I ran some scheduling latency tests earlier today, and frequently observed 500-1000 us scheduling latency under vmware vSphere ESXi. This requires a large number of RX descriptors to absorb without packet loss. (Disclaimer: The virtual machine configuration had not been optimized. Tweaking the knobs offered by the hypervisor might improve this.) The exact same firmware (same kernel, rootfs, libraries, applications etc.) running directly on our purpose-built hardware has scheduling latency very close to the kernel's default "timerslack" (50 us).