On Tue, Jan 26, 2016 at 6:52 AM, David Young <dyo...@pobox.com> wrote: > On Mon, Jan 25, 2016 at 04:47:32PM +0900, Ryota Ozaki wrote: >> On Mon, Jan 25, 2016 at 3:53 PM, Ryota Ozaki <ozak...@netbsd.org> wrote: >> > On Mon, Jan 25, 2016 at 1:06 PM, Taylor R Campbell >> > <campbell+netbsd-tech-k...@mumble.net> wrote: >> >> Date: Mon, 25 Jan 2016 11:25:16 +0900 >> >> From: Ryota Ozaki <ozak...@netbsd.org> >> >> >> >> On Tue, Jan 19, 2016 at 2:22 PM, Ryota Ozaki <ozak...@netbsd.org> >> >> wrote: >> >> (snip) >> >> >> (a) a per-CPU pktq that never distributes packets to another CPU, or >> >> >> (b) a single-CPU pktq, to be used only from the CPU to which the >> >> >> device's (queue's) interrupt handler is bound. >> >> >> >> >> > I'll rewrite the patch as your suggestion (I prefer (a) for now). >> >> >> >> Through rewriting it, I feel that it seems to be a lesser version of >> >> pktqueue. So I think it may be better changing pktqueue to have a flag >> >> to not distribute packets between CPUs than implementing another one >> >> duplicating pktqueue. Here is a patch with the approach: >> >> http://www.netbsd.org/~ozaki-r/pktq-without-ipi.diff >> >> >> >> If we call pktq_create with PKTQ_F_NO_DISTRIBUTION, pktqueue doesn't >> >> setup IPI for softint and never call softint_schedule_cpu (i.e., >> >> never distribute packets). >> >> >> >> How about the approach? >> >> >> >> Some disjointed thoughts: >> >> >> >> 1. I don't think you actually need to change pktq(9). It looks like >> >> if you pass in cpu_index(curcpu()) for the hash, it will consistently >> >> use the current CPU, for which softint_schedule_cpu has a special case >> >> that avoids ipi. So I don't expect it's substantially different from >> >> <https://www.netbsd.org/~ozaki-r/softint-if_input.diff> -- though >> >> maybe measurements will show my analysis is wrong! >> > >> > My intention is to prevent ipi_register in pktq_create and >> > so we don't need ipi_sysinit movement... >> > >> >> >> >> 2. Even though you avoid ipi(9), you're still using pcq(9), which >> >> requires interprocessor synchronization -- but that is an unnecessary >> >> cost because you're simply passing packets from hardintr to softintr >> >> context on a single CPU. So that's why I specifically suggested ifq, >> >> not pcq or pktqueue. >> > >> > ...though, right. membars in pcq(9) are just overhead. >> > >> > Okay, I'll implement softint + percpu irqs. >> > >> >> >> >> 3. Random thought: If we do polling, I wonder whether instead of (or >> >> in addition to) polling for up to (say) 100 packets in a softint, we >> >> really ought to poll for arbitrarily many packets in a kthread with >> >> KTHREAD_TS, so that we don't need to go back and forth between >> >> hardintr/softintr during high throughput, but we also don't starve >> >> user threads in that case. >> > >> > Actually that was a POC implementation just to measure how polling >> > is efficient (or not). So I don't intend to use the implementation >> > as it is. >> > >> >> >> >> I seem to recall starvation of user threads is what motivated matt@ to >> >> split packet processing between a softint and a workqueue, depending >> >> on the load, in bcmeth(4) (sys/arch/arm/broadcom/bcm53xx_eth.c). >> >> Maybe he can comment on this? Have you studied how this driver works, >> >> and maybe pq3etsec(4) too, which also does polling? >> > >> > I had read pq3etsec(4) but not bcmeth(4). pq3etsec(4) seems to use >> > only softint. >> > >> > Anyway I also concerned user threads starvation during implementing >> > polling on wm(4). So the combination use of softint and workqueue >> > sounds good. (FreeBSD's igb driver also does a similar technique, >> > IIUC.) >> >> Hmm, I misunderstood a bit. bcmeth(4) kicks softint OR workqueue >> depending on the load from HW interrupt (I thought HW interrupt >> always calls softint and the softint kicks workqueue if there are >> more incoming packets). I'm curious about throughput and latency >> of this approach :) > > I tried once to make network-processing softints provide opportunities > for user threads to run, but I realized after struggling with it that > I was essentially solving a scheduling problem when we already had an > adequate scheduler in the kernel. I ended up using a timesharing thread > to process the Rx ring and a very basic hardware Rx-interrupt handler, > kind of like this: > > hardware interrupt handler: > disable interrupts > wake processing thread > > processing thread: > loop forever: > enable interrupts > wait for wakeup > for each Rx packet on ring: > process packet > > That stopped the user-tickle watchdog from firing. It was handy having > a full-fledged thread context to process packets in. But there were > trade-offs. As Matt Thomas pointed out to me, if it takes longer for > the NIC to read the next packet off of the network than it takes your > thread to process the current packet, then your Rx thread is going to go > back to sleep again after every single packet. So there's potentially > a lot of context-switch overhead and latency when you're receiving > back-to-back large packets.
IIUC, bcmeth(4) solves (a part of?) the issue by using both softint and workqueue; if load is not so high, only softint is dispatched and there is lesser context-switch overhead than always using a Rx thread. However, I'm not sure the approach works well or not actually. > > ISTR Matt had some ideas how context switches could be made faster, or > h/w interrupt handlers could have an "ordinary" thread context, or the > scheduler could control the rate of softints, or all of the above. I > don't know if there's been any progress along those lines in the mean > time. He left some notes at http://www.netbsd.org/~matt/smpnet , but I'm not sure they are related to the above ideas. I think any of them aren't in -current yet. ozaki-r