On Tue, 27 Mar 2018 17:06:50 -0700
William Tu <u9012...@gmail.com> wrote:

> On Tue, Mar 27, 2018 at 2:37 AM, Jesper Dangaard Brouer
> <bro...@redhat.com> wrote:
> > On Mon, 26 Mar 2018 14:58:02 -0700
> > William Tu <u9012...@gmail.com> wrote:
> >  
> >> > Again high count for NMI ?!?
> >> >
> >> > Maybe you just forgot to tell perf that you want it to decode the
> >> > bpf_prog correctly?
> >> >
> >> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols
> >> >
> >> > Enable via:
> >> >  $ sysctl net/core/bpf_jit_kallsyms=1
> >> >
> >> > And use perf report (while BPF is STILL LOADED):
> >> >
> >> >  $ perf report --kallsyms=/proc/kallsyms
> >> >
> >> > E.g. for emailing this you can use this command:
> >> >
> >> >  $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms 
> >> > --no-children --stdio -g none | head -n 40
> >> >  
> >>
> >> Thanks, I followed the steps, the result of l2fwd
> >> # Total Lost Samples: 119
> >> #
> >> # Samples: 2K of event 'cycles:ppp'
> >> # Event count (approx.): 25675705627
> >> #
> >> # Overhead  CPU  Command  Shared Object       Symbol
> >> # ........  ...  .......  ..................  
> >> ..................................
> >> #
> >>     10.48%  013  xdpsock  xdpsock             [.] main
> >>      9.77%  013  xdpsock  [kernel.vmlinux]    [k] clflush_cache_range
> >>      8.45%  013  xdpsock  [kernel.vmlinux]    [k] nmi
> >>      8.07%  013  xdpsock  [kernel.vmlinux]    [k] xsk_sendmsg
> >>      7.81%  013  xdpsock  [kernel.vmlinux]    [k] __domain_mapping
> >>      4.95%  013  xdpsock  [kernel.vmlinux]    [k] ixgbe_xmit_frame_ring
> >>      4.66%  013  xdpsock  [kernel.vmlinux]    [k] skb_store_bits
> >>      4.39%  013  xdpsock  [kernel.vmlinux]    [k] syscall_return_via_sysret
> >>      3.93%  013  xdpsock  [kernel.vmlinux]    [k] pfn_to_dma_pte
> >>      2.62%  013  xdpsock  [kernel.vmlinux]    [k] __intel_map_single
> >>      2.53%  013  xdpsock  [kernel.vmlinux]    [k] __alloc_skb
> >>      2.36%  013  xdpsock  [kernel.vmlinux]    [k] iommu_no_mapping
> >>      2.21%  013  xdpsock  [kernel.vmlinux]    [k] alloc_skb_with_frags
> >>      2.07%  013  xdpsock  [kernel.vmlinux]    [k] skb_set_owner_w
> >>      1.98%  013  xdpsock  [kernel.vmlinux]    [k] 
> >> __kmalloc_node_track_caller
> >>      1.94%  013  xdpsock  [kernel.vmlinux]    [k] ksize
> >>      1.84%  013  xdpsock  [kernel.vmlinux]    [k] validate_xmit_skb_list
> >>      1.62%  013  xdpsock  [kernel.vmlinux]    [k] kmem_cache_alloc_node
> >>      1.48%  013  xdpsock  [kernel.vmlinux]    [k] __kmalloc_reserve.isra.37
> >>      1.21%  013  xdpsock  xdpsock             [.] xq_enq
> >>      1.08%  013  xdpsock  [kernel.vmlinux]    [k] intel_alloc_iova
> >>  
> >
> > You did use net/core/bpf_jit_kallsyms=1 and correct perf commands decoding 
> > of
> > bpf_prog, so the perf top#3 'nmi' is likely a real NMI call... which looks 
> > wrong.
> >  
> Thanks, you're right. Let me dig more on this NMI behavior.
> 
> >  
> >> And l2fwd under "perf stat" looks OK to me. There is little context
> >> switches, cpu is fully utilized, 1.17 insn per cycle seems ok.
> >>
> >> Performance counter stats for 'CPU(s) 6':
> >>   10000.787420      cpu-clock (msec)          #    1.000 CPUs utilized
> >>             24      context-switches          #    0.002 K/sec
> >>              0      cpu-migrations            #    0.000 K/sec
> >>              0      page-faults               #    0.000 K/sec
> >> 22,361,333,647      cycles                    #    2.236 GHz
> >> 13,458,442,838      stalled-cycles-frontend   #   60.19% frontend cycles 
> >> idle
> >> 26,251,003,067      instructions              #    1.17  insn per cycle
> >>                                               #    0.51  stalled cycles 
> >> per insn
> >>  4,938,921,868      branches                  #  493.853 M/sec
> >>      7,591,739      branch-misses             #    0.15% of all branches
> >>   10.000835769 seconds time elapsed  
> >
> > This perf stat also indicate something is wrong.
> >
> > The 1.17 insn per cycle is NOT okay, it is too low (compared to what
> > usually I see, e.g. 2.36  insn per cycle).
> >
> > It clearly says you have 'stalled-cycles-frontend' and '60.19% frontend
> > cycles idle'.   This means your CPU have issues/bottleneck fetching
> > instructions. Explained by Andi Kleen here [1]
> >
> > [1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
> >  
> thanks for the link!
>
> It's definitely weird that my frontend cycle (fetch and decode)
> stalled is so high.
>
> I assume this xdpsock code is small and should all fit into the icache.
> However, doing another perf stat on xdpsock l2fwd shows
> 
> 13,720,109,581      stalled-cycles-frontend   # 60.01% frontend cycles
> idle     (23.82%)
> 
> <not supported>      stalled-cycles-backend
>       7,994,837      branch-misses           # 0.16% of all branches
>        (23.80%)
>     996,874,424      bus-cycles      # 99.679 M/sec      (23.80%)
>  18,942,220,445      ref-cycles      # 1894.067 M/sec    (28.56%)
>     100,983,226      LLC-loads       # 10.097 M/sec      (23.80%)
>       4,897,089      LLC-load-misses # 4.85% of all LL-cache hits     (23.80%)
>      66,659,889      LLC-stores      # 6.665 M/sec       (9.52%)
>           8,373 LLC-store-misses     # 0.837 K/sec  (9.52%)
>     158,178,410      LLC-prefetches       # 15.817 M/sec  (9.52%)
>       3,011,180      LLC-prefetch-misses  # 0.301 M/sec   (9.52%)
>   8,190,383,109      dTLB-loads       # 818.971 M/sec     (9.52%)
>      20,432,204      dTLB-load-misses # 0.25% of all dTLB cache hits   (9.52%)
>   3,729,504,674      dTLB-stores       # 372.920 M/sec     (9.52%)
>         992,231  dTLB-store-misses         # 0.099 M/sec    (9.52%)
> <not supported>      dTLB-prefetches
> <not supported>      dTLB-prefetch-misses
>          11,619 iTLB-loads            # 0.001 M/sec (9.52%)
>       1,874,756      iTLB-load-misses # 16135.26% of all iTLB cache hits 
> (14.28%)

What was the sample period for this perf stat?

> I have super high iTLB-load-misses. This is probably the cause of high
> frontend stalled.

It looks very strange that your iTLB-loads are 11,619, while the
iTLB-load-misses are much much higher 1,874,756.

> Do you know any way to improve iTLB hit rate?

The xdpsock code should be small enough to fit in the iCache, but it
might be layout in memory in an unfortunate way.  You could play with
rearranging the C-code (look at the objdump alignments).

If you want to know the details about code alignment issue, and how to
troubleshoot them, you should read this VERY excellent blog post by
Denis Bakhvalov:
https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Reply via email to