On Tue, Mar 27, 2018 at 2:37 AM, Jesper Dangaard Brouer <bro...@redhat.com> wrote: > On Mon, 26 Mar 2018 14:58:02 -0700 > William Tu <u9012...@gmail.com> wrote: > >> > Again high count for NMI ?!? >> > >> > Maybe you just forgot to tell perf that you want it to decode the >> > bpf_prog correctly? >> > >> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols >> > >> > Enable via: >> > $ sysctl net/core/bpf_jit_kallsyms=1 >> > >> > And use perf report (while BPF is STILL LOADED): >> > >> > $ perf report --kallsyms=/proc/kallsyms >> > >> > E.g. for emailing this you can use this command: >> > >> > $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms >> > --no-children --stdio -g none | head -n 40 >> > >> >> Thanks, I followed the steps, the result of l2fwd >> # Total Lost Samples: 119 >> # >> # Samples: 2K of event 'cycles:ppp' >> # Event count (approx.): 25675705627 >> # >> # Overhead CPU Command Shared Object Symbol >> # ........ ... ....... .................. >> .................................. >> # >> 10.48% 013 xdpsock xdpsock [.] main >> 9.77% 013 xdpsock [kernel.vmlinux] [k] clflush_cache_range >> 8.45% 013 xdpsock [kernel.vmlinux] [k] nmi >> 8.07% 013 xdpsock [kernel.vmlinux] [k] xsk_sendmsg >> 7.81% 013 xdpsock [kernel.vmlinux] [k] __domain_mapping >> 4.95% 013 xdpsock [kernel.vmlinux] [k] ixgbe_xmit_frame_ring >> 4.66% 013 xdpsock [kernel.vmlinux] [k] skb_store_bits >> 4.39% 013 xdpsock [kernel.vmlinux] [k] syscall_return_via_sysret >> 3.93% 013 xdpsock [kernel.vmlinux] [k] pfn_to_dma_pte >> 2.62% 013 xdpsock [kernel.vmlinux] [k] __intel_map_single >> 2.53% 013 xdpsock [kernel.vmlinux] [k] __alloc_skb >> 2.36% 013 xdpsock [kernel.vmlinux] [k] iommu_no_mapping >> 2.21% 013 xdpsock [kernel.vmlinux] [k] alloc_skb_with_frags >> 2.07% 013 xdpsock [kernel.vmlinux] [k] skb_set_owner_w >> 1.98% 013 xdpsock [kernel.vmlinux] [k] __kmalloc_node_track_caller >> 1.94% 013 xdpsock [kernel.vmlinux] [k] ksize >> 1.84% 013 xdpsock [kernel.vmlinux] [k] validate_xmit_skb_list >> 1.62% 013 xdpsock [kernel.vmlinux] [k] kmem_cache_alloc_node >> 1.48% 013 xdpsock [kernel.vmlinux] [k] __kmalloc_reserve.isra.37 >> 1.21% 013 xdpsock xdpsock [.] xq_enq >> 1.08% 013 xdpsock [kernel.vmlinux] [k] intel_alloc_iova >> > > You did use net/core/bpf_jit_kallsyms=1 and correct perf commands decoding of > bpf_prog, so the perf top#3 'nmi' is likely a real NMI call... which looks > wrong. > Thanks, you're right. Let me dig more on this NMI behavior.
> >> And l2fwd under "perf stat" looks OK to me. There is little context >> switches, cpu is fully utilized, 1.17 insn per cycle seems ok. >> >> Performance counter stats for 'CPU(s) 6': >> 10000.787420 cpu-clock (msec) # 1.000 CPUs utilized >> 24 context-switches # 0.002 K/sec >> 0 cpu-migrations # 0.000 K/sec >> 0 page-faults # 0.000 K/sec >> 22,361,333,647 cycles # 2.236 GHz >> 13,458,442,838 stalled-cycles-frontend # 60.19% frontend cycles idle >> 26,251,003,067 instructions # 1.17 insn per cycle >> # 0.51 stalled cycles per >> insn >> 4,938,921,868 branches # 493.853 M/sec >> 7,591,739 branch-misses # 0.15% of all branches >> 10.000835769 seconds time elapsed > > This perf stat also indicate something is wrong. > > The 1.17 insn per cycle is NOT okay, it is too low (compared to what > usually I see, e.g. 2.36 insn per cycle). > > It clearly says you have 'stalled-cycles-frontend' and '60.19% frontend > cycles idle'. This means your CPU have issues/bottleneck fetching > instructions. Explained by Andi Kleen here [1] > > [1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual > thanks for the link! It's definitely weird that my frontend cycle (fetch and decode) stalled is so high. I assume this xdpsock code is small and should all fit into the icache. However, doing another perf stat on xdpsock l2fwd shows 13,720,109,581 stalled-cycles-frontend # 60.01% frontend cycles idle (23.82%) <not supported> stalled-cycles-backend 7,994,837 branch-misses # 0.16% of all branches (23.80%) 996,874,424 bus-cycles # 99.679 M/sec (23.80%) 18,942,220,445 ref-cycles # 1894.067 M/sec (28.56%) 100,983,226 LLC-loads # 10.097 M/sec (23.80%) 4,897,089 LLC-load-misses # 4.85% of all LL-cache hits (23.80%) 66,659,889 LLC-stores # 6.665 M/sec (9.52%) 8,373 LLC-store-misses # 0.837 K/sec (9.52%) 158,178,410 LLC-prefetches # 15.817 M/sec (9.52%) 3,011,180 LLC-prefetch-misses # 0.301 M/sec (9.52%) 8,190,383,109 dTLB-loads # 818.971 M/sec (9.52%) 20,432,204 dTLB-load-misses # 0.25% of all dTLB cache hits (9.52%) 3,729,504,674 dTLB-stores # 372.920 M/sec (9.52%) 992,231 dTLB-store-misses # 0.099 M/sec (9.52%) <not supported> dTLB-prefetches <not supported> dTLB-prefetch-misses 11,619 iTLB-loads # 0.001 M/sec (9.52%) 1,874,756 iTLB-load-misses # 16135.26% of all iTLB cache hits (14.28%) I have super high iTLB-load-misses. This is probably the cause of high frontend stalled. Do you know any way to improve iTLB hit rate? Thanks William