Improving OCTEON II 10G Ethernet performance
I'm trying to migrate from the Octeon SDK to a vanilla Linux 4.4 kernel for a Cavium OCTEON II (CN6880) board running in 64-bit little-endian mode. So far I've gotten most of the hardware features I need working, including XAUI/RXAUI, USB, boot bus and I2C, with a fairly small set of patches. https://github.com/skyportsystems/linux/compare/master...octeon2 The biggest remaining hurdle is improving 10G Ethernet performance: iperf -P 10 on the SDK kernel gets close to 10 Gbit/sec throughput, while on my 4.4 kernel, it tops out around 1 Gbit/sec. Comparing the octeon-ethernet driver in the SDK (http://git.yoctoproject.org/cgit/cgit.cgi/linux-yocto-contrib/tree/drivers/net/ethernet/octeon?h=apaliwal/octeon) against the one in 4.4, the latter appears to utilize only a single CPU core for the rx path. It's not clear to me if there is a similar issue on the tx side, or other bottlenecks. I started trying to port multi-CPU rx from the SDK octeon-ethernet driver, but had trouble teasing out just the necessary bits without following a maze of dependencies on unrelated functions. (Dragging major parts of the SDK wholesale into 4.4 defeats the purpose of switching to a vanilla kernel, and doesn't bring us closer to getting octeon-ethernet out of staging.) Has there been any work on the octeon-ethernet driver since this patch set? https://www.linux-mips.org/archives/linux-mips/2015-08/msg00338.html Any hints on what to pick out of the SDK code to improve 10G performance would be appreciated. --Ed ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH 0/9] staging: octeon: multi rx group (queue) support
Hi Aaro, On Tue, Aug 30, 2016 at 11:47 AM, Aaro Koskinen wrote: > This series implements multiple RX group support that should improve > the networking performance on multi-core OCTEONs. Basically we register > IRQ and NAPI for each group, and ask the HW to select the group for > the incoming packets based on hash. > > Tested on EdgeRouter Lite with a simple forwarding test using two flows > and 16 RX groups distributed between two cores - the routing throughput > is roughly doubled. I applied the series to my 4.4.19 tree, which involved backporting a bunch of other patches from master, most of them trivial. When I test it on a Cavium Octeon 2 (CN6880) board, I get an immediate crash (bus error) in the netif_receive_skb() call from cvm_oct_poll(). Replacing the rx_group argument to cvm_oct_poll() with int group, and dereferencing rx_group->group in the caller (cvm_oct_napi_poll()) instead makes the crash disappear. Apparently there's some race in dereferencing rx_group from within cvm_oct_poll(). With this workaround in place, I can send and receive on XAUI interfaces, but don't see any performance improvement. I'm guessing I need to set receive_group_order > 0. But any value between 1 and 4 seems to break rx altogether. When I ping another host I see both request and response on the wire, and the interface counters increase, but the response doesn't make it back to ping. Is some other configuration needed to make use of multiple rx groups? --Ed ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH 0/9] staging: octeon: multi rx group (queue) support
Aaro Koskinen wrote: > Oops, looks like I tested without CONFIG_NET_POLL_CONTROLLER enabled > and that seems to be broken. Sorry. I'm not using CONFIG_NET_POLL_CONTROLLER either; the problem is in the normal cvm_oct_napi_poll() path. Here's my workaround: --- a/drivers/staging/octeon/ethernet-rx.c +++ b/drivers/staging/octeon/ethernet-rx.c @@ -159,7 +159,7 @@ static inline int cvm_oct_check_rcv_error(cvmx_wqe_t *work) return 0; } -static int cvm_oct_poll(struct oct_rx_group *rx_group, int budget) +static int cvm_oct_poll(int group, int budget) { const int coreid = cvmx_get_core_num(); u64 old_group_mask; @@ -181,13 +181,13 @@ static int cvm_oct_poll(struct oct_rx_group *rx_group, int budget) if (OCTEON_IS_MODEL(OCTEON_CN68XX)) { old_group_mask = cvmx_read_csr(CVMX_SSO_PPX_GRP_MSK(coreid)); cvmx_write_csr(CVMX_SSO_PPX_GRP_MSK(coreid), - BIT(rx_group->group)); + BIT(group)); cvmx_read_csr(CVMX_SSO_PPX_GRP_MSK(coreid)); /* Flush */ } else { old_group_mask = cvmx_read_csr(CVMX_POW_PP_GRP_MSKX(coreid)); cvmx_write_csr(CVMX_POW_PP_GRP_MSKX(coreid), (old_group_mask & ~0xull) | - BIT(rx_group->group)); + BIT(group)); } if (USE_ASYNC_IOBDMA) { @@ -212,15 +212,15 @@ static int cvm_oct_poll(struct oct_rx_group *rx_group, int budget) if (!work) { if (OCTEON_IS_MODEL(OCTEON_CN68XX)) { cvmx_write_csr(CVMX_SSO_WQ_IQ_DIS, - BIT(rx_group->group)); + BIT(group)); cvmx_write_csr(CVMX_SSO_WQ_INT, - BIT(rx_group->group)); + BIT(group)); } else { union cvmx_pow_wq_int wq_int; wq_int.u64 = 0; - wq_int.s.iq_dis = BIT(rx_group->group); - wq_int.s.wq_int = BIT(rx_group->group); + wq_int.s.iq_dis = BIT(group); + wq_int.s.wq_int = BIT(group); cvmx_write_csr(CVMX_POW_WQ_INT, wq_int.u64); } break; @@ -447,7 +447,7 @@ static int cvm_oct_napi_poll(struct napi_struct *napi, int budget) napi); int rx_count; - rx_count = cvm_oct_poll(rx_group, budget); + rx_count = cvm_oct_poll(rx_group->group, budget); if (rx_count < budget) { /* No more work */ @@ -466,7 +466,7 @@ static int cvm_oct_napi_poll(struct napi_struct *napi, int budget) */ void cvm_oct_poll_controller(struct net_device *dev) { - cvm_oct_poll(oct_rx_group, 16); + cvm_oct_poll(oct_rx_group->group, 16); } #endif > Can you see multiple ethernet IRQs in /proc/interrupts and their > counters increasing? > > With receive_group_order=4 you should see 16 IRQs. I see the 16 IRQs, and the first one does increase. But packets don't make it to the application. --Ed ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH 0/9] staging: octeon: multi rx group (queue) support
On 8/31/16 14:20, Aaro Koskinen wrote: > On Wed, Aug 31, 2016 at 09:20:07AM -0700, Ed Swierk wrote: >> Here's my workaround: > > [...] > >> -static int cvm_oct_poll(struct oct_rx_group *rx_group, int budget) >> +static int cvm_oct_poll(int group, int budget) >> { >> const int coreid = cvmx_get_core_num(); >> u64 old_group_mask; >> @@ -181,13 +181,13 @@ static int cvm_oct_poll(struct oct_rx_group *rx_group, >> int budget) >> if (OCTEON_IS_MODEL(OCTEON_CN68XX)) { >> old_group_mask = cvmx_read_csr(CVMX_SSO_PPX_GRP_MSK(coreid)); >> cvmx_write_csr(CVMX_SSO_PPX_GRP_MSK(coreid), >> - BIT(rx_group->group)); >> + BIT(group)); >> @@ -447,7 +447,7 @@ static int cvm_oct_napi_poll(struct napi_struct *napi, >> int budget) >> napi); >> int rx_count; >> >> -rx_count = cvm_oct_poll(rx_group, budget); >> +rx_count = cvm_oct_poll(rx_group->group, budget); > > I'm confused - there should be no difference?! I can't figure out the difference either. I get a crash within the first couple packets, while with the workaround I can't get it to crash at all. It always bombs in netif_receive_skb(), which isn't very close to any rx_group pointer dereference. # ping 172.16.100.253 PING 172.16.100.253 (172.16.100.253): 56 data bytes Data bus error, epc == 803fd4ac, ra == 801943d8 Oops[#1]: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.19+ #94 task: 80863e80 ti: 8084 task.ti: 8084 $ 0 : 80126078 ef7bdef7bdef7bdf 815d3860 $ 4 : 80e045c8 81aae950 81aae950 $ 8 : 81aae950 0038 0070 03bf $12 : 5400 03bd $16 : 81aae950 81aae950 80e045c8 $20 : 00fa 0001 5c02e0fa $24 : 0062 80548468 $28 : 8084 808436d0 80feba38 801943d8 Hi: Lo: 05198e3760c0 epc : 803fd4ac __list_add_rcu+0x7c/0xa0 ra: 801943d8 __lock_acquire+0xd94/0x1bf0 Status: 10008ce2KX SX UX KERNEL EXL Cause : 40808c1c (ExcCode 07) PrId : 000d910a (Cavium Octeon II) Modules linked in: Process swapper/0 (pid: 0, threadinfo=8084, task=80863e80, tls=) Stack : 80863e80 808646c8 81aae950 801943d8 00fa 808646c0 0002 8057ab90 80864690 80870990 0001 0017 80193e08 0017 80864688 0001 8057ab90 808a7d28 80007f4b7500 80007a0b52e8 0001 807f 80007f768068 8085fac8 8019568c 808a7d10 80645e60 80007f4a8600 0254 808a7d58 8057ab90 0008 80007f7680a0 ... Call Trace: [<__list_add_rcu at list_debug.c:97 (discriminator 2)>] __list_add_rcu+0x7c/0xa0 [] __lock_acquire+0xd94/0x1bf0 [] lock_acquire+0x50/0x78 [<__raw_read_lock at rwlock_api_smp.h:150 (inlined by) _raw_read_lock at spinlock.c:223>] _raw_read_lock+0x4c/0x90 [] raw_local_deliver+0x58/0x1e8 [] ip_local_deliver_finish+0x118/0x4a8 [] ip_local_deliver+0x68/0xe0 [] ip_rcv+0x398/0x478 [<__netif_receive_skb_core at dev.c:3948>] __netif_receive_skb_core+0x764/0x818 [] netif_receive_skb_internal+0x148/0x214 [] cvm_oct_napi_poll+0x790/0xa2c [] net_rx_action+0x130/0x2e0 [] __do_softirq+0x1f0/0x318 [] irq_exit+0x64/0xcc [] octeon_irq_ciu2+0x154/0x1c4 [] plat_irq_dispatch+0x70/0x108 [] ret_from_irq+0x0/0x4 [] __r4k_wait+0x20/0x40 [] cpu_startup_entry+0x154/0x1d0 [] start_kernel+0x538/0x554 Presumably there's some sort of race condition that my change doesn't really fix but happens to avoid by dereferencing rx_group just once early on? --Ed ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH v2 00/11] staging: octeon: multi rx group (queue) support
On 8/31/16 13:57, Aaro Koskinen wrote: > This series implements multiple RX group support that should improve > the networking performance on multi-core OCTEONs. Basically we register > IRQ and NAPI for each group, and ask the HW to select the group for > the incoming packets based on hash. > > Tested on EdgeRouter Lite with a simple forwarding test using two flows > and 16 RX groups distributed between two cores - the routing throughput > is roughly doubled. > > Also tested with EBH5600 (8 cores) and EBB6800 (16 cores) by sending > and receiving traffic in both directions using SGMII interfaces. With this series on 4.4.19, rx works with receive_group_order > 0. Setting receive_group_order=4, I do see 16 Ethernet interrupts. I tried fiddling with various smp_affinity values (e.g. setting them all to , or assigning a different one to each interrupt, or giving a few to some and a few to others), as well as different values for rps_cpus. 10-thread parallel iperf performance varies between 0.5 and 1.5 Gbit/sec total depending on the particular settings. With the SDK kernel I get over 8 Gbit/sec. It seems to be achieving that using just one interrupt (not even a separate one for tx, as far as I can tell) pegged to CPU 0 (the default smp_affinity). I must be missing some other major configuration tweak, perhaps specific to 10G. Can you run a test on the EBB6800 with the interfaces in 10G mode? --Ed ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel