W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze:
On Fri, 9 Nov 2018 23:20:38 +0100 Paweł Staszewski <pstaszew...@itcare.pl>
wrote:
W dniu 08.11.2018 o 20:12, Paweł Staszewski pisze:
CPU load is lower than for connectx4 - but it looks like bandwidth
limit is the same :)
But also after reaching 60Gbit/60Gbit
bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
input: /proc/net/dev type: rate
- iface Rx Tx Total
==========================================================================
enp175s0: 45.09 Gb/s 15.09 Gb/s 60.18 Gb/s
enp216s0: 15.14 Gb/s 45.19 Gb/s 60.33 Gb/s
--------------------------------------------------------------------------
total: 60.45 Gb/s 60.48 Gb/s 120.93 Gb/s
Today reached 65/65Gbit/s
But starting from 60Gbit/s RX / 60Gbit TX nics start to drop packets
(with 50%CPU on all 28cores) - so still there is cpu power to use :).
This is weird!
How do you see / measure these drops?
Simple icmp test like ping -i 0.1
And im testing by icmp management ip address on vlan that is attacked to
one NIC (the side that is more stressed with RX)
And another icmp test is forward thru this router - host behind it
Both measurements shows same loss ratio from 0.1 to 0.5% after reaching
~45Gbit/s RX side - depends how much RX side is pushed drops vary
between 0.1 to 0.5 - even 0.6%:)
So checked other stats.
softnet_stats shows average 1k squeezed per sec:
Is below output the raw counters? not per sec?
It would be valuable to see the per sec stats instead...
I use this tool:
https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl
cpu total dropped squeezed collision rps flow_limit
0 18554 0 1 0 0 0
1 16728 0 1 0 0 0
2 18033 0 1 0 0 0
3 17757 0 1 0 0 0
4 18861 0 0 0 0 0
5 0 0 1 0 0 0
6 2 0 1 0 0 0
7 0 0 1 0 0 0
8 0 0 0 0 0 0
9 0 0 1 0 0 0
10 0 0 0 0 0 0
11 0 0 1 0 0 0
12 50 0 1 0 0 0
13 257 0 0 0 0 0
14 3629115363 0 3353259 0 0 0
15 255167835 0 3138271 0 0 0
16 4240101961 0 3036130 0 0 0
17 599810018 0 3072169 0 0 0
18 432796524 0 3034191 0 0 0
19 41803906 0 3037405 0 0 0
20 900382666 0 3112294 0 0 0
21 620926085 0 3086009 0 0 0
22 41861198 0 3023142 0 0 0
23 4090425574 0 2990412 0 0 0
24 4264870218 0 3010272 0 0 0
25 141401811 0 3027153 0 0 0
26 104155188 0 3051251 0 0 0
27 4261258691 0 3039765 0 0 0
28 4 0 1 0 0 0
29 4 0 0 0 0 0
30 0 0 1 0 0 0
31 0 0 0 0 0 0
32 3 0 1 0 0 0
33 1 0 1 0 0 0
34 0 0 1 0 0 0
35 0 0 0 0 0 0
36 0 0 1 0 0 0
37 0 0 1 0 0 0
38 0 0 1 0 0 0
39 0 0 1 0 0 0
40 0 0 0 0 0 0
41 0 0 1 0 0 0
42 299758202 0 3139693 0 0 0
43 4254727979 0 3103577 0 0 0
44 1959555543 0 2554885 0 0 0
45 1675702723 0 2513481 0 0 0
46 1908435503 0 2519698 0 0 0
47 1877799710 0 2537768 0 0 0
48 2384274076 0 2584673 0 0 0
49 2598104878 0 2593616 0 0 0
50 1897566829 0 2530857 0 0 0
51 1712741629 0 2489089 0 0 0
52 1704033648 0 2495892 0 0 0
53 1636781820 0 2499783 0 0 0
54 1861997734 0 2541060 0 0 0
55 2113521616 0 2555673 0 0 0
So i rised netdev backlog and budged to rly high values
524288 for netdev_budget and same for backlog
Does it affect the squeezed counters?
a little - but not much
After change budget from 65536 to to 524k - number of squeezed counters
for all cpus changed from 1.5k per second to 0.9-1k per second - but
increasing it more like above 524k change nothing - same 0.9 to 1k/s
squeezed
Notice, this (crazy) huge netdev_budget limit will also be limited
by /proc/sys/net/core/netdev_budget_usecs.
Yes changed that also to 1000 / 2000 / 3000 / 4000 not much difference
on squeezed - even cant see the difference
This rised sortirqs from about 600k/sec to 800k/sec for NET_TX/NET_RX
Hmmm, this could indicated not enough NAPI bulking is occurring.
I have a BPF tool, that can give you some insight into NAPI bulking and
softirq idle/kthread starting. Called 'napi_monitor', could you try to
run this, so can try to understand this? You find the tool here:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/napi_monitor_user.c
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/napi_monitor_kern.c
yes will try it
But after this changes i have less packets drops.
Below perf top from max traffic reached:
PerfTop: 72230 irqs/sec kernel:99.4% exact: 0.0% [4000Hz
cycles], (all, 56 CPUs)
------------------------------------------------------------------------------------------
12.62% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_linear
8.44% [kernel] [k] mlx5e_sq_xmit
6.69% [kernel] [k] build_skb
5.21% [kernel] [k] fib_table_lookup
3.54% [kernel] [k] memcpy_erms
3.20% [kernel] [k] mlx5e_poll_rx_cq
2.25% [kernel] [k] vlan_do_receive
2.20% [kernel] [k] mlx5e_post_rx_mpwqes
2.02% [kernel] [k] mlx5e_handle_rx_cqe_mpwrq
1.95% [kernel] [k] __dev_queue_xmit
1.83% [kernel] [k] dev_gro_receive
1.79% [kernel] [k] tcp_gro_receive
1.73% [kernel] [k] ip_finish_output2
1.63% [kernel] [k] mlx5e_poll_tx_cq
1.49% [kernel] [k] ipt_do_table
1.38% [kernel] [k] inet_gro_receive
1.31% [kernel] [k] __netif_receive_skb_core
1.30% [kernel] [k] _raw_spin_lock
1.28% [kernel] [k] mlx5_eq_int
1.24% [kernel] [k] irq_entries_start
1.19% [kernel] [k] __build_skb
1.15% [kernel] [k] swiotlb_map_page
1.02% [kernel] [k] vlan_dev_hard_start_xmit
0.94% [kernel] [k] pfifo_fast_dequeue
0.92% [kernel] [k] ip_route_input_rcu
0.86% [kernel] [k] kmem_cache_alloc
0.80% [kernel] [k] mlx5e_xmit
0.79% [kernel] [k] dev_hard_start_xmit
0.78% [kernel] [k] _raw_spin_lock_irqsave
0.74% [kernel] [k] ip_forward
0.72% [kernel] [k] tasklet_action_common.isra.21
0.68% [kernel] [k] pfifo_fast_enqueue
0.67% [kernel] [k] netif_skb_features
0.66% [kernel] [k] skb_segment
0.60% [kernel] [k] skb_gro_receive
0.56% [kernel] [k] validate_xmit_skb.isra.142
0.53% [kernel] [k] skb_release_data
0.51% [kernel] [k] mlx5e_page_release
0.51% [kernel] [k] ip_rcv_core.isra.20.constprop.25
0.51% [kernel] [k] __qdisc_run
0.50% [kernel] [k] tcp4_gro_receive
0.49% [kernel] [k] page_frag_free
0.46% [kernel] [k] kmem_cache_free_bulk
0.43% [kernel] [k] kmem_cache_free
0.42% [kernel] [k] try_to_wake_up
0.39% [kernel] [k] _raw_spin_lock_irq
0.39% [kernel] [k] find_busiest_group
0.37% [kernel] [k] __memcpy
Remember those tests are now on two separate connectx5 connected to
two separate pcie x16 gen 3.0
That is strange... I still suspect some HW NIC issue, can you provide
ethtool stats info via tool:
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
$ ethtool_stats.pl --dev enp175s0 --dev enp216s0
The tool remove zero-stats counters and report per sec stats. It makes
it easier to spot that is relevant for the given workload.
yes mlnx have just too many counters that are always 0 for my case :)
Will try this also
Can you give output put from:
$ ethtool --show-priv-flag DEVICE
I want you to experiment with:
ethtool --show-priv-flags enp175s0
Private flags for enp175s0:
rx_cqe_moder : on
tx_cqe_moder : off
rx_cqe_compress : off
rx_striding_rq : on
rx_no_csum_complete: off
ethtool --set-priv-flags DEVICE rx_striding_rq off
ok i will first check on test server if this will reset my interface and
will not produce kernel panic :)
I think you already have played with 'rx_cqe_compress', right.
yes - and compress increasing number of irq's but doing not much for
bandwidth same limit 60-64Gbit/s total RX+TX on one 100G port
And what is weird - that limit is in overall symetric - cause if for
example 100G port is receiving 42G traffic and transmitting 20G traffic
- when i flood rx side with pktgen or other for example icmp traffic
1/2/3/4/5G - then receiving side increase with 1/2/3/4/5Gbit of traffic
but transmitting is going down for same lvl's