Improving OCTEON II 10G Ethernet performance

2016-08-24 Thread Ed Swierk
I'm trying to migrate from the Octeon SDK to a vanilla Linux 4.4
kernel for a Cavium OCTEON II (CN6880) board running in 64-bit
little-endian mode. So far I've gotten most of the hardware features I
need working, including XAUI/RXAUI, USB, boot bus and I2C, with a
fairly small set of patches.
https://github.com/skyportsystems/linux/compare/master...octeon2

The biggest remaining hurdle is improving 10G Ethernet performance:
iperf -P 10 on the SDK kernel gets close to 10 Gbit/sec throughput,
while on my 4.4 kernel, it tops out around 1 Gbit/sec.

Comparing the octeon-ethernet driver in the SDK
(http://git.yoctoproject.org/cgit/cgit.cgi/linux-yocto-contrib/tree/drivers/net/ethernet/octeon?h=apaliwal/octeon)
against the one in 4.4, the latter appears to utilize only a single
CPU core for the rx path. It's not clear to me if there is a similar
issue on the tx side, or other bottlenecks.

I started trying to port multi-CPU rx from the SDK octeon-ethernet
driver, but had trouble teasing out just the necessary bits without
following a maze of dependencies on unrelated functions. (Dragging
major parts of the SDK wholesale into 4.4 defeats the purpose of
switching to a vanilla kernel, and doesn't bring us closer to getting
octeon-ethernet out of staging.)

Has there been any work on the octeon-ethernet driver since this patch
set? https://www.linux-mips.org/archives/linux-mips/2015-08/msg00338.html

Any hints on what to pick out of the SDK code to improve 10G
performance would be appreciated.

--Ed
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 0/9] staging: octeon: multi rx group (queue) support

2016-08-30 Thread Ed Swierk
Hi Aaro,

On Tue, Aug 30, 2016 at 11:47 AM, Aaro Koskinen  wrote:
> This series implements multiple RX group support that should improve
> the networking performance on multi-core OCTEONs. Basically we register
> IRQ and NAPI for each group, and ask the HW to select the group for
> the incoming packets based on hash.
>
> Tested on EdgeRouter Lite with a simple forwarding test using two flows
> and 16 RX groups distributed between two cores - the routing throughput
> is roughly doubled.

I applied the series to my 4.4.19 tree, which involved backporting a
bunch of other patches from master, most of them trivial.

When I test it on a Cavium Octeon 2 (CN6880) board, I get an immediate
crash (bus error) in the netif_receive_skb() call from cvm_oct_poll().
Replacing the rx_group argument to cvm_oct_poll() with int group, and
dereferencing rx_group->group in the caller (cvm_oct_napi_poll())
instead makes the crash disappear. Apparently there's some race in
dereferencing rx_group from within cvm_oct_poll().

With this workaround in place, I can send and receive on XAUI
interfaces, but don't see any performance improvement. I'm guessing I
need to set receive_group_order > 0. But any value between 1 and 4
seems to break rx altogether. When I ping another host I see both
request and response on the wire, and the interface counters increase,
but the response doesn't make it back to ping.

Is some other configuration needed to make use of multiple rx groups?

--Ed
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 0/9] staging: octeon: multi rx group (queue) support

2016-08-31 Thread Ed Swierk
Aaro Koskinen wrote:
> Oops, looks like I tested without CONFIG_NET_POLL_CONTROLLER enabled
> and that seems to be broken. Sorry.

I'm not using CONFIG_NET_POLL_CONTROLLER either; the problem is in the
normal cvm_oct_napi_poll() path.

Here's my workaround:

--- a/drivers/staging/octeon/ethernet-rx.c
+++ b/drivers/staging/octeon/ethernet-rx.c
@@ -159,7 +159,7 @@ static inline int cvm_oct_check_rcv_error(cvmx_wqe_t *work)
return 0;
 }

-static int cvm_oct_poll(struct oct_rx_group *rx_group, int budget)
+static int cvm_oct_poll(int group, int budget)
 {
const int   coreid = cvmx_get_core_num();
u64 old_group_mask;
@@ -181,13 +181,13 @@ static int cvm_oct_poll(struct oct_rx_group *rx_group, 
int budget)
if (OCTEON_IS_MODEL(OCTEON_CN68XX)) {
old_group_mask = cvmx_read_csr(CVMX_SSO_PPX_GRP_MSK(coreid));
cvmx_write_csr(CVMX_SSO_PPX_GRP_MSK(coreid),
-  BIT(rx_group->group));
+  BIT(group));
cvmx_read_csr(CVMX_SSO_PPX_GRP_MSK(coreid)); /* Flush */
} else {
old_group_mask = cvmx_read_csr(CVMX_POW_PP_GRP_MSKX(coreid));
cvmx_write_csr(CVMX_POW_PP_GRP_MSKX(coreid),
   (old_group_mask & ~0xull) |
-  BIT(rx_group->group));
+  BIT(group));
}

if (USE_ASYNC_IOBDMA) {
@@ -212,15 +212,15 @@ static int cvm_oct_poll(struct oct_rx_group *rx_group, 
int budget)
if (!work) {
if (OCTEON_IS_MODEL(OCTEON_CN68XX)) {
cvmx_write_csr(CVMX_SSO_WQ_IQ_DIS,
-  BIT(rx_group->group));
+  BIT(group));
cvmx_write_csr(CVMX_SSO_WQ_INT,
-  BIT(rx_group->group));
+  BIT(group));
} else {
union cvmx_pow_wq_int wq_int;

wq_int.u64 = 0;
-   wq_int.s.iq_dis = BIT(rx_group->group);
-   wq_int.s.wq_int = BIT(rx_group->group);
+   wq_int.s.iq_dis = BIT(group);
+   wq_int.s.wq_int = BIT(group);
cvmx_write_csr(CVMX_POW_WQ_INT, wq_int.u64);
}
break;
@@ -447,7 +447,7 @@ static int cvm_oct_napi_poll(struct napi_struct *napi, int 
budget)
 napi);
int rx_count;

-   rx_count = cvm_oct_poll(rx_group, budget);
+   rx_count = cvm_oct_poll(rx_group->group, budget);

if (rx_count < budget) {
/* No more work */
@@ -466,7 +466,7 @@ static int cvm_oct_napi_poll(struct napi_struct *napi, int 
budget)
  */
 void cvm_oct_poll_controller(struct net_device *dev)
 {
-   cvm_oct_poll(oct_rx_group, 16);
+   cvm_oct_poll(oct_rx_group->group, 16);
 }
 #endif

> Can you see multiple ethernet IRQs in /proc/interrupts and their
> counters increasing?
> 
> With receive_group_order=4 you should see 16 IRQs.

I see the 16 IRQs, and the first one does increase. But packets don't make
it to the application.

--Ed
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 0/9] staging: octeon: multi rx group (queue) support

2016-08-31 Thread Ed Swierk
On 8/31/16 14:20, Aaro Koskinen wrote:
> On Wed, Aug 31, 2016 at 09:20:07AM -0700, Ed Swierk wrote:
>> Here's my workaround:
> 
> [...]
> 
>> -static int cvm_oct_poll(struct oct_rx_group *rx_group, int budget)
>> +static int cvm_oct_poll(int group, int budget)
>>  {
>>  const int   coreid = cvmx_get_core_num();
>>  u64 old_group_mask;
>> @@ -181,13 +181,13 @@ static int cvm_oct_poll(struct oct_rx_group *rx_group, 
>> int budget)
>>  if (OCTEON_IS_MODEL(OCTEON_CN68XX)) {
>>  old_group_mask = cvmx_read_csr(CVMX_SSO_PPX_GRP_MSK(coreid));
>>  cvmx_write_csr(CVMX_SSO_PPX_GRP_MSK(coreid),
>> -   BIT(rx_group->group));
>> +   BIT(group));
>> @@ -447,7 +447,7 @@ static int cvm_oct_napi_poll(struct napi_struct *napi, 
>> int budget)
>>   napi);
>>  int rx_count;
>>
>> -rx_count = cvm_oct_poll(rx_group, budget);
>> +rx_count = cvm_oct_poll(rx_group->group, budget);
> 
> I'm confused - there should be no difference?!

I can't figure out the difference either. I get a crash within the first
couple packets, while with the workaround I can't get it to crash at all.
It always bombs in netif_receive_skb(), which isn't very close to any
rx_group pointer dereference.

# ping 172.16.100.253
PING 172.16.100.253 (172.16.100.253): 56 data bytes
Data bus error, epc == 803fd4ac, ra == 801943d8
Oops[#1]:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.19+ #94
task: 80863e80 ti: 8084 task.ti: 8084
$ 0   :  80126078 ef7bdef7bdef7bdf 815d3860
$ 4   : 80e045c8 81aae950 81aae950 
$ 8   : 81aae950 0038 0070 03bf
$12   : 5400 03bd  
$16   : 81aae950 81aae950 80e045c8 
$20   : 00fa 0001 5c02e0fa 
$24   : 0062 80548468  
$28   : 8084 808436d0 80feba38 801943d8
Hi: 
Lo: 05198e3760c0
epc   : 803fd4ac __list_add_rcu+0x7c/0xa0
ra: 801943d8 __lock_acquire+0xd94/0x1bf0
Status: 10008ce2KX SX UX KERNEL EXL 
Cause : 40808c1c (ExcCode 07)
PrId  : 000d910a (Cavium Octeon II)
Modules linked in:
Process swapper/0 (pid: 0, threadinfo=8084, task=80863e80, 
tls=)
Stack : 80863e80 808646c8 81aae950 801943d8
  00fa 808646c0  0002
   8057ab90 80864690 80870990
  0001   0017
   80193e08 0017 80864688
  0001 8057ab90 808a7d28 80007f4b7500
  80007a0b52e8 0001 807f 80007f768068
  8085fac8 8019568c  
  808a7d10 80645e60 80007f4a8600 0254
  808a7d58 8057ab90 0008 80007f7680a0
  ...
Call Trace:
[<__list_add_rcu at list_debug.c:97 (discriminator 2)>] __list_add_rcu+0x7c/0xa0
[] __lock_acquire+0xd94/0x1bf0
[] lock_acquire+0x50/0x78
[<__raw_read_lock at rwlock_api_smp.h:150
 (inlined by) _raw_read_lock at spinlock.c:223>] _raw_read_lock+0x4c/0x90
[] raw_local_deliver+0x58/0x1e8
[] 
ip_local_deliver_finish+0x118/0x4a8
[] ip_local_deliver+0x68/0xe0
[] ip_rcv+0x398/0x478
[<__netif_receive_skb_core at dev.c:3948>] __netif_receive_skb_core+0x764/0x818
[] 
netif_receive_skb_internal+0x148/0x214
[] 
cvm_oct_napi_poll+0x790/0xa2c
[] net_rx_action+0x130/0x2e0
[] __do_softirq+0x1f0/0x318
[] irq_exit+0x64/0xcc
[] octeon_irq_ciu2+0x154/0x1c4
[] plat_irq_dispatch+0x70/0x108
[] ret_from_irq+0x0/0x4
[] __r4k_wait+0x20/0x40
[] cpu_startup_entry+0x154/0x1d0
[] start_kernel+0x538/0x554

Presumably there's some sort of race condition that my change doesn't
really fix but happens to avoid by dereferencing rx_group just once early
on?

--Ed
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2 00/11] staging: octeon: multi rx group (queue) support

2016-08-31 Thread Ed Swierk
On 8/31/16 13:57, Aaro Koskinen wrote:
> This series implements multiple RX group support that should improve
> the networking performance on multi-core OCTEONs. Basically we register
> IRQ and NAPI for each group, and ask the HW to select the group for
> the incoming packets based on hash.
> 
> Tested on EdgeRouter Lite with a simple forwarding test using two flows
> and 16 RX groups distributed between two cores - the routing throughput
> is roughly doubled.
> 
> Also tested with EBH5600 (8 cores) and EBB6800 (16 cores) by sending
> and receiving traffic in both directions using SGMII interfaces.

With this series on 4.4.19, rx works with receive_group_order > 0.
Setting receive_group_order=4, I do see 16 Ethernet interrupts. I tried
fiddling with various smp_affinity values (e.g. setting them all to
, or assigning a different one to each interrupt, or giving a
few to some and a few to others), as well as different values for
rps_cpus. 10-thread parallel iperf performance varies between 0.5 and 1.5
Gbit/sec total depending on the particular settings.

With the SDK kernel I get over 8 Gbit/sec. It seems to be achieving that
using just one interrupt (not even a separate one for tx, as far as I can
tell) pegged to CPU 0 (the default smp_affinity). I must be missing some
other major configuration tweak, perhaps specific to 10G.

Can you run a test on the EBB6800 with the interfaces in 10G mode?

--Ed
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel