On 05/09/2018 11:48 AM, Eric Dumazet wrote:
On 05/09/2018 11:43 AM, Ben Greear wrote:
On 05/08/2018 10:10 AM, Eric Dumazet wrote:
On 05/08/2018 09:44 AM, Ben Greear wrote:
Hello,
I am trying to track down a performance regression that appears to be between
4.13
and 4.14.
I first saw the problem with a hacked version of pktgen on some ixgbe NICs.
4.13 can do
right at 10G bi-directional on two ports, and 4.14 and later can do only about
6Gbps.
I also tried with user-space UDP traffic on a stock kernel, and I can get about
3.2Gbps combined tx+rx
on 4.14 and about 4.4Gbps on 4.13.
Attempting to bisect seems to be triggering a weirdness in git, and also lots
of commits
crash or do not bring up networking, which makes the bisect difficult.
Looking at perf top, it would appear that some lock is probably to blame.
perf record -a -g -e cycles:pp sleep 5
perf report
Then you'll be able to tell us which lock (or call graph) is killing your perf.
I seem to be chasing multiple issues. For 4.13, at least part of my problem
was that LOCKDEP was enabled,
during my bisect, though it does NOT appear enabled in 4.16. I think maybe
CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING
in 4.16, or something like that? My 4.16 .config does have
CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it:
[greearb@ben-dt3 linux-4.16.x64]$ grep LOCKDEP .config
CONFIG_LOCKDEP_SUPPORT=y
For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need
to disable to keep from getting a performance hit from the spectre-related bug
fixes? At this point, I do not care about the security implications.
greearb@ben-dt3 linux-4.16.x64]$ grep RETPO .config
# CONFIG_RETPOLINE is not set
Thanks,
Ben
No idea really, you mention a 4.13 -> 4.14 regression and jump then to 4.16 :/
I initially saw the problem in 4.16, then bisected, and 4.14 still showed the
issue.
4.13 works, but only when I use a .config I originally built for 4.13, not the
4.16 .config
that I ended up using with the bisect (make oldconfig, accept all defaults). I
originally
configured 4.16 with a .config that had lockdep enabled, then manually tried to
disable it
through 'make xconfig'. I think that must leave "CONFIG_LOCKDEP=y" in the
.config, which
screws up older builds during bisect, perhaps?
Before doing a (painful) dissection, the perf output would immediately tell you
if
something is really wrong on your .config.
I didn't realize lockdep might be an issue at the time, but here is a 'bad' run
from
a 4.13+ (plus pktgen hacks). I guess lockdep is why this runs slowly, but I
see no obvious
proof of that in the output:
4.13+, patched pktgen, 6Gbps throughput, on commit
906dde0f355bd97c080c215811ae7db1137c4af8
Samples: 26K of event 'cycles:pp', Event count (approx.): 20119166736
Children Self Command Shared Object Symbol
+ 87.97% 0.00% kpktgend_1 [kernel.kallsyms] [k]
ret_from_fork
+ 87.97% 0.00% kpktgend_1 [kernel.kallsyms] [k] kthread
+ 86.89% 5.42% kpktgend_1 [kernel.kallsyms] [k]
pktgen_thread_worker
+ 33.75% 0.18% kpktgend_1 [kernel.kallsyms] [k]
getnstimeofday64
+ 32.77% 4.47% kpktgend_1 [kernel.kallsyms] [k]
__getnstimeofday64
+ 24.60% 10.91% kpktgend_1 [kernel.kallsyms] [k]
lock_acquire
+ 23.59% 0.03% kpktgend_1 [kernel.kallsyms] [k]
__do_softirq
+ 23.55% 0.07% kpktgend_1 [kernel.kallsyms] [k]
net_rx_action
+ 22.29% 0.47% kpktgend_1 [kernel.kallsyms] [k]
getRelativeCurNs
+ 21.33% 1.71% kpktgend_1 [kernel.kallsyms] [k]
ixgbe_poll
+ 15.79% 0.02% kpktgend_1 [kernel.kallsyms] [k]
ret_from_intr
+ 15.78% 0.01% kpktgend_1 [kernel.kallsyms] [k] do_IRQ
+ 15.34% 0.01% kpktgend_1 [kernel.kallsyms] [k] irq_exit
+ 13.95% 10.00% kpktgend_1 [kernel.kallsyms] [k]
ip_send_check
+ 13.80% 13.80% kpktgend_1 [kernel.kallsyms] [k]
__lock_acquire.isra.31
+ 12.98% 0.53% kpktgend_1 [kernel.kallsyms] [k]
pktgen_finalize_skb
+ 12.31% 0.20% kpktgend_1 [kernel.kallsyms] [k]
timestamp_skb.isra.24
+ 11.68% 0.13% kpktgend_1 [kernel.kallsyms] [k]
napi_gro_receive
+ 11.36% 0.25% kpktgend_1 [kernel.kallsyms] [k]
netif_receive_skb_internal
+ 10.93% 0.00% swapper [kernel.kallsyms] [k]
verify_cpu
+ 10.93% 0.00% swapper [kernel.kallsyms] [k]
cpu_startup_entry
+ 10.92% 0.02% swapper [kernel.kallsyms] [k] do_idle
+ 10.71% 0.00% swapper [kernel.kallsyms] [k]
cpuidle_enter
+ 10.71% 0.00% swapper [kernel.kallsyms] [k]
call_cpuidle
+ 10.66% 0.06% swapper [kernel.kallsyms] [k]
cpuidle_enter_state
+ 9.78% 0.00% swapper [kernel.kallsyms] [k]
x86_64_start_kernel
+ 9.78% 0.00% swapper [kernel.kallsyms] [k]
x86_64_start_reservations
+ 9.78% 0.00% swapper [kernel.kallsyms] [k]
start_kernel
+ 9.78% 0.00% swapper [kernel.kallsyms] [k]
rest_init
+ 9.25% 0.00% swapper [kernel.kallsyms] [k]
ret_from_intr
+ 9.25% 0.00% swapper [kernel.kallsyms] [k] do_IRQ
+ 9.24% 0.30% kpktgend_1 [kernel.kallsyms] [k]
pktgen_alloc_skb
+ 9.14% 0.00% swapper [kernel.kallsyms] [k] irq_exit
+ 8.89% 0.00% swapper [kernel.kallsyms] [k]
__do_softirq
+ 8.82% 0.01% swapper [kernel.kallsyms] [k]
net_rx_action
+ 8.80% 0.47% kpktgend_1 [kernel.kallsyms] [k]
__local_bh_enable_ip
+ 8.67% 6.87% kpktgend_1 [kernel.kallsyms] [k]
lock_release
+ 8.57% 1.10% kpktgend_1 [kernel.kallsyms] [k]
__netdev_alloc_skb
+ 8.28% 0.00% kpktgend_1 [kernel.kallsyms] [k]
do_softirq.part.17
+ 8.28% 0.00% kpktgend_1 [kernel.kallsyms] [k]
do_softirq_own_stack
+ 8.01% 0.74% swapper [kernel.kallsyms] [k]
ixgbe_poll
+ 6.85% 2.28% kpktgend_1 [kernel.kallsyms] [k]
__build_skb
+ 4.89% 0.07% kpktgend_1 [kernel.kallsyms] [k]
__netif_receive_skb
+ 4.87% 0.75% kpktgend_1 [kernel.kallsyms] [k]
__netif_receive_skb_core
+ 4.66% 0.16% kpktgend_1 [kernel.kallsyms] [k]
_raw_spin_lock
+ 4.37% 0.06% swapper [kernel.kallsyms] [k]
napi_gro_receive
+ 4.23% 0.10% swapper [kernel.kallsyms] [k]
netif_receive_skb_internal
Here is a 'bad' run of 4.16.0 + my full out-of-tree patches, including my
normal pktgen changes
(pktgen code should be nearly identical between the run above and the run here)
Samples: 26K of event 'cycles:pp', Event count (approx.): 19888249841
Children Self Command Shared Object Symbol
+ 74.24% 5.38% kpktgend_1 [kernel.kallsyms] [k]
pktgen_thread_worker
+ 73.89% 0.00% kpktgend_1 [kernel.kallsyms] [k]
ret_from_fork
+ 73.89% 0.00% kpktgend_1 [kernel.kallsyms] [k] kthread
+ 27.31% 0.25% kpktgend_1 [kernel.kallsyms] [k]
getnstimeofday64
+ 26.15% 4.24% kpktgend_1 [kernel.kallsyms] [k]
__getnstimeofday64
+ 22.64% 0.02% kpktgend_1 [kernel.kallsyms] [k]
__softirqentry_text_s
+ 22.62% 0.06% kpktgend_1 [kernel.kallsyms] [k]
net_rx_action
+ 22.59% 2.04% kpktgend_1 [kernel.kallsyms] [k]
ixgbe_poll
+ 21.49% 7.92% kpktgend_1 [kernel.kallsyms] [k]
lock_acquire
+ 17.78% 17.78% kpktgend_1 [kernel.kallsyms] [k]
__lock_acquire.isra.3
+ 15.95% 0.39% kpktgend_1 [kernel.kallsyms] [k]
getRelativeCurNs
+ 14.49% 0.02% kpktgend_1 [kernel.kallsyms] [k]
ret_from_intr
+ 14.48% 0.00% kpktgend_1 [kernel.kallsyms] [k] do_IRQ
+ 14.35% 0.01% kpktgend_1 [kernel.kallsyms] [k]
irq_exit
+ 13.46% 9.90% kpktgend_1 [kernel.kallsyms] [k]
ip_send_check
+ 11.36% 0.18% kpktgend_1 [kernel.kallsyms] [k]
napi_gro_receive
+ 11.25% 0.17% kpktgend_1 [kernel.kallsyms] [k]
netif_receive_skb_int
+ 10.37% 0.16% kpktgend_1 [kernel.kallsyms] [k]
timestamp_skb.isra.25
+ 8.85% 0.45% kpktgend_1 [kernel.kallsyms] [k]
__local_bh_enable_ip
+ 8.61% 0.02% swapper [kernel.kallsyms] [k] do_idle
+ 8.60% 0.00% swapper [kernel.kallsyms] [k]
secondary_startup_64
+ 8.60% 0.00% swapper [kernel.kallsyms] [k]
cpu_startup_entry
+ 8.37% 0.05% swapper [kernel.kallsyms] [k]
cpuidle_enter_state
+ 8.35% 7.57% kpktgend_1 [kernel.kallsyms] [k]
lock_release
+ 8.32% 0.00% kpktgend_1 [kernel.kallsyms] [k]
do_softirq.part.16
+ 8.31% 0.00% kpktgend_1 [kernel.kallsyms] [k]
do_softirq_own_stack
+ 7.16% 0.36% kpktgend_1 [kernel.kallsyms] [k]
pktgen_alloc_skb
+ 7.09% 1.14% kpktgend_1 [kernel.kallsyms] [k]
__netdev_alloc_skb
+ 6.87% 0.00% swapper [kernel.kallsyms] [k]
start_kernel
+ 6.55% 0.00% swapper [kernel.kallsyms] [k]
ret_from_intr
+ 6.55% 0.01% swapper [kernel.kallsyms] [k] do_IRQ
+ 6.51% 0.01% swapper [kernel.kallsyms] [k]
irq_exit
+ 6.34% 0.02% swapper [kernel.kallsyms] [k]
__softirqentry_text_s
+ 6.25% 0.01% swapper [kernel.kallsyms] [k]
net_rx_action
+ 6.06% 0.45% swapper [kernel.kallsyms] [k]
ixgbe_poll
+ 5.35% 2.11% kpktgend_1 [kernel.kallsyms] [k]
__build_skb
I tried testing with stock kernels and UDP user-space traffic, but results
varied widely
from run to run, so it was useless for bisecting.
I have irqbalance disabled on this system in order to pin pktgen process and
ixgbe IRQs
to be efficient, so maybe that is why user-space UDP traffic works so
unreliably. My hacked pktgen
reliably shows good on my good 4.13 kernel and reliably bad on higher kernels I
have compiled
so far, so I decided it was the only useful way to bisect this problem.
I will now try 4.14 and higher again with lockdep definitely disabled.
Thanks,
Ben
--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc http://www.candelatech.com