From: Jeff Garzik <[EMAIL PROTECTED]> Date: Mon, 08 Oct 2007 21:13:59 -0400
> If you assume a scheduler implementation where each prio band is mapped > to a separate CPU, you can certainly see where some CPUs could be > substantially idle while others are overloaded, largely depending on the > data workload (and priority contained within). Right, which is why Peter added the prio DRR scheduler stuff for TX multiqueue (see net/sched/sch_prio.c:rr_qdisc_ops) because this is what the chips do. But this doesn't get us to where we want to be as Peter has been explaining a bit these past few days. Ok, we're talking a lot but not pouring much concrete, let's start doing that. I propose: 1) A library for transmit load balancing functions, with an interface that can be made visible to userspace. I can write this and test it on real multiqueue hardware. The whole purpose of this library is to set skb->queue_mapping based upon the load balancing function. Facilities will be added to handle virtualization port selection based upon destination MAC address as one of the "load balancing" methods. 2) Switch the default qdisc away from pfifo_fast to a new DRR fifo with load balancing using the code in #1. I think this is kind of in the territory of what Peter said he is working on. I know this is controversial, but realistically I doubt users benefit at all from the prioritization that pfifo provides. They will, on the other hand, benefit from TX queue load balancing on fast interfaces. 3) Work on discovering a way to make the locking on transmit as localized to the current thread of execution as possible. Things like RCU and statistic replication, techniques we use widely elsewhere in the stack, begin to come to mind. I also want to point out another issue. Any argument wrt. reordering is specious at best because right now reordering from qdisc to device happens anyways. And that's because we drop the qdisc lock first, then we grab the transmit lock on the device and submit the packet. So, after we drop the qdisc lock, another cpu can get the qdisc lock, get the next packet (perhaps a lower priority one) and then sneak in to get the device transmit lock before the first thread can, and thus the packets will be submitted out of order. This, along with other things, makes me believe that ordering really doesn't matter in practice. And therefore, in practice, we can treat everything from the qdisc to the real hardware as a FIFO even if something else is going on inside the black box which might reorder packets on the wire. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html