On Fri, 9 Nov 2018 23:20:38 +0100 Paweł Staszewski <pstaszew...@itcare.pl> wrote:
> W dniu 08.11.2018 o 20:12, Paweł Staszewski pisze: > > CPU load is lower than for connectx4 - but it looks like bandwidth > > limit is the same :) > > But also after reaching 60Gbit/60Gbit > > > > bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help > > input: /proc/net/dev type: rate > > - iface Rx Tx Total > > ========================================================================== > > > > enp175s0: 45.09 Gb/s 15.09 Gb/s 60.18 Gb/s > > enp216s0: 15.14 Gb/s 45.19 Gb/s 60.33 Gb/s > > -------------------------------------------------------------------------- > > > > total: 60.45 Gb/s 60.48 Gb/s 120.93 Gb/s > > Today reached 65/65Gbit/s > > But starting from 60Gbit/s RX / 60Gbit TX nics start to drop packets > (with 50%CPU on all 28cores) - so still there is cpu power to use :). This is weird! How do you see / measure these drops? > So checked other stats. > softnet_stats shows average 1k squeezed per sec: Is below output the raw counters? not per sec? It would be valuable to see the per sec stats instead... I use this tool: https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl > cpu total dropped squeezed collision rps flow_limit > 0 18554 0 1 0 0 0 > 1 16728 0 1 0 0 0 > 2 18033 0 1 0 0 0 > 3 17757 0 1 0 0 0 > 4 18861 0 0 0 0 0 > 5 0 0 1 0 0 0 > 6 2 0 1 0 0 0 > 7 0 0 1 0 0 0 > 8 0 0 0 0 0 0 > 9 0 0 1 0 0 0 > 10 0 0 0 0 0 0 > 11 0 0 1 0 0 0 > 12 50 0 1 0 0 0 > 13 257 0 0 0 0 0 > 14 3629115363 0 3353259 0 0 0 > 15 255167835 0 3138271 0 0 0 > 16 4240101961 0 3036130 0 0 0 > 17 599810018 0 3072169 0 0 0 > 18 432796524 0 3034191 0 0 0 > 19 41803906 0 3037405 0 0 0 > 20 900382666 0 3112294 0 0 0 > 21 620926085 0 3086009 0 0 0 > 22 41861198 0 3023142 0 0 0 > 23 4090425574 0 2990412 0 0 0 > 24 4264870218 0 3010272 0 0 0 > 25 141401811 0 3027153 0 0 0 > 26 104155188 0 3051251 0 0 0 > 27 4261258691 0 3039765 0 0 0 > 28 4 0 1 0 0 0 > 29 4 0 0 0 0 0 > 30 0 0 1 0 0 0 > 31 0 0 0 0 0 0 > 32 3 0 1 0 0 0 > 33 1 0 1 0 0 0 > 34 0 0 1 0 0 0 > 35 0 0 0 0 0 0 > 36 0 0 1 0 0 0 > 37 0 0 1 0 0 0 > 38 0 0 1 0 0 0 > 39 0 0 1 0 0 0 > 40 0 0 0 0 0 0 > 41 0 0 1 0 0 0 > 42 299758202 0 3139693 0 0 0 > 43 4254727979 0 3103577 0 0 0 > 44 1959555543 0 2554885 0 0 0 > 45 1675702723 0 2513481 0 0 0 > 46 1908435503 0 2519698 0 0 0 > 47 1877799710 0 2537768 0 0 0 > 48 2384274076 0 2584673 0 0 0 > 49 2598104878 0 2593616 0 0 0 > 50 1897566829 0 2530857 0 0 0 > 51 1712741629 0 2489089 0 0 0 > 52 1704033648 0 2495892 0 0 0 > 53 1636781820 0 2499783 0 0 0 > 54 1861997734 0 2541060 0 0 0 > 55 2113521616 0 2555673 0 0 0 > > > So i rised netdev backlog and budged to rly high values > 524288 for netdev_budget and same for backlog Does it affect the squeezed counters? Notice, this (crazy) huge netdev_budget limit will also be limited by /proc/sys/net/core/netdev_budget_usecs. > This rised sortirqs from about 600k/sec to 800k/sec for NET_TX/NET_RX Hmmm, this could indicated not enough NAPI bulking is occurring. I have a BPF tool, that can give you some insight into NAPI bulking and softirq idle/kthread starting. Called 'napi_monitor', could you try to run this, so can try to understand this? You find the tool here: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/ https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/napi_monitor_user.c https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/napi_monitor_kern.c > But after this changes i have less packets drops. > > > Below perf top from max traffic reached: > PerfTop: 72230 irqs/sec kernel:99.4% exact: 0.0% [4000Hz > cycles], (all, 56 CPUs) > ------------------------------------------------------------------------------------------ > > 12.62% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_linear > 8.44% [kernel] [k] mlx5e_sq_xmit > 6.69% [kernel] [k] build_skb > 5.21% [kernel] [k] fib_table_lookup > 3.54% [kernel] [k] memcpy_erms > 3.20% [kernel] [k] mlx5e_poll_rx_cq > 2.25% [kernel] [k] vlan_do_receive > 2.20% [kernel] [k] mlx5e_post_rx_mpwqes > 2.02% [kernel] [k] mlx5e_handle_rx_cqe_mpwrq > 1.95% [kernel] [k] __dev_queue_xmit > 1.83% [kernel] [k] dev_gro_receive > 1.79% [kernel] [k] tcp_gro_receive > 1.73% [kernel] [k] ip_finish_output2 > 1.63% [kernel] [k] mlx5e_poll_tx_cq > 1.49% [kernel] [k] ipt_do_table > 1.38% [kernel] [k] inet_gro_receive > 1.31% [kernel] [k] __netif_receive_skb_core > 1.30% [kernel] [k] _raw_spin_lock > 1.28% [kernel] [k] mlx5_eq_int > 1.24% [kernel] [k] irq_entries_start > 1.19% [kernel] [k] __build_skb > 1.15% [kernel] [k] swiotlb_map_page > 1.02% [kernel] [k] vlan_dev_hard_start_xmit > 0.94% [kernel] [k] pfifo_fast_dequeue > 0.92% [kernel] [k] ip_route_input_rcu > 0.86% [kernel] [k] kmem_cache_alloc > 0.80% [kernel] [k] mlx5e_xmit > 0.79% [kernel] [k] dev_hard_start_xmit > 0.78% [kernel] [k] _raw_spin_lock_irqsave > 0.74% [kernel] [k] ip_forward > 0.72% [kernel] [k] tasklet_action_common.isra.21 > 0.68% [kernel] [k] pfifo_fast_enqueue > 0.67% [kernel] [k] netif_skb_features > 0.66% [kernel] [k] skb_segment > 0.60% [kernel] [k] skb_gro_receive > 0.56% [kernel] [k] validate_xmit_skb.isra.142 > 0.53% [kernel] [k] skb_release_data > 0.51% [kernel] [k] mlx5e_page_release > 0.51% [kernel] [k] ip_rcv_core.isra.20.constprop.25 > 0.51% [kernel] [k] __qdisc_run > 0.50% [kernel] [k] tcp4_gro_receive > 0.49% [kernel] [k] page_frag_free > 0.46% [kernel] [k] kmem_cache_free_bulk > 0.43% [kernel] [k] kmem_cache_free > 0.42% [kernel] [k] try_to_wake_up > 0.39% [kernel] [k] _raw_spin_lock_irq > 0.39% [kernel] [k] find_busiest_group > 0.37% [kernel] [k] __memcpy > > > > Remember those tests are now on two separate connectx5 connected to > two separate pcie x16 gen 3.0 That is strange... I still suspect some HW NIC issue, can you provide ethtool stats info via tool: https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl $ ethtool_stats.pl --dev enp175s0 --dev enp216s0 The tool remove zero-stats counters and report per sec stats. It makes it easier to spot that is relevant for the given workload. Can you give output put from: $ ethtool --show-priv-flag DEVICE I want you to experiment with: ethtool --set-priv-flags DEVICE rx_striding_rq off I think you already have played with 'rx_cqe_compress', right. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer