On 2 Sep, 2014, at 1:14 am, Aaron Wood wrote:

>> For the purposes of shaping, the CPU shouldn't need to touch the majority of 
>> the payload - only the headers, which are relatively small.  The bulk of the 
>> payload should DMA from one NIC to RAM, then DMA back out of RAM to the 
>> other NIC.  It has to do that anyway to route them, and without shaping 
>> there'd be more of them to handle.  The difference might be in the data 
>> structures used by the shaper itself, but I think those are also reasonably 
>> compact.  It doesn't even have to touch userspace, since it's not acting as 
>> the endpoint as my PowerBook was during my tests.
> 
> In an ideal case, yes.  But is that how this gets managed?  (I have no idea, 
> I'm certainly not a kernel developer).

It would be monumentally stupid to integrate two GigE MACs onto an SoC, and 
then to call it a "network processor", without adequate DMA support.  I don't 
think Atheros are that stupid.

Here's a more detailed datasheet:
        
http://pdf.datasheetarchive.com/indexerfiles/Datasheets-SW6/DSASW00118777.pdf

"Another memory factor is the ability to support multiple I/O operations in 
parallel via the WNPU's various ports. The on-chip SRAM in AR7100 WNPUs has 5 
ports that enable simultaneous access to and from five sources: the two gigabit 
Ethernet ports, the PCI port, the USB 2.0 port and the MIPS processor."

It's a reasonable question, however, whether the driver uses that support 
properly.  Mainline Linux kernel code seems to support the SoC but not the 
Ethernet; if it were just a minor variant of some other Atheros hardware, I'd 
have expected to see it integrated into one of the existing drivers.  Or maybe 
it is, and my greps just aren't showing it.

At minimum, however, there are MMIO ranges reported for each MAC during 
OpenWRT's boot sequence.  That's where the ring buffers are.  The most the CPU 
has to do is read each packet from RAM and write it into those buffers, or vice 
versa for receive - I think that's what my PowerBook has to do.  Ideally, a 
bog-standard DMA engine would take over that simple duty.  Either way, that's 
something that has to happen whether it's shaped or not, so it's unlikely to be 
our problem.

The same goes for the wireless MACs, incidentally.  These are standard ath9k 
mini-PCI cards, and the drivers *are* in mainline.  There shouldn't be any 
surprises with them.

> If the packet data is getting moved about from buffer to buffer (for instance 
> to do the htb calculations?) could that substantially change the processing 
> load?

The qdiscs only deal with packet and socket headers, not the full packet data.  
Even then, they largely pass pointers around, inserting the headers into linked 
lists rather than copying them into arrays.  I believe a lot of attention has 
been directed at cache-friendliness in this area, and the MIPS caches are of 
conventional type.

>> Which brings me back to the timers, and other items of black magic.
> 
> Which would point to under-utilizing the processor core, while still having 
> high load? (I'm not seeing that, I'm curious if that would be the case).

It probably wouldn't manifest as high system load.  Rather, poor timer 
resolution or latency would show up as excessive delays between packets, during 
which the CPU is idle.  The packet egress times may turn out to be quantised - 
that would be a smoking gun, if detectable.

>> Incidentally, transfer speed benchmarks involving wireless will certainly be 
>> limited by the wireless link.  I assume that's not a factor here.
> 
> That's the usual suspicion.  But these are RF-chamber, short-range lab setups 
> where the radios are running at full speed in perfect environments...

Sure.  But even turbocharged 'n' gear tops out at 450Mbps signalling, and much 
less than that is available even theoretically for TCP/IP throughput.  My point 
is that you're probably not running *your* tests over wireless.

> What this makes me realize is that I should go instrument the cpu stats with 
> each of the various operating modes:
> 
> * no shaping, anywhere
> * egress shaping
> * egress and ingress shaping at various limited levels:
>     * 10Mbps
>     * 20Mbps
>     * 50Mbps
>     * 100Mbps

Smaller increments at the high end of the range may prove to be useful.  I 
would expect the CPU usage to climb nonlinearly (busy-waiting) if there's a 
bottleneck in a peripheral device, such as the PCI bus.  The way the kernel 
classifies that usage may also be revealing.

> Heck, what about running HTB simply from a 1ms timer instead of from a data 
> driven timer?

That might be what's already happening.  We have to figure out that before we 
can work out a solution.

 - Jonathan Morton

_______________________________________________
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel

Reply via email to