W dniu 01.11.2018 o 10:22, Jesper Dangaard Brouer pisze:
On Wed, 31 Oct 2018 23:20:01 +0100
Paweł Staszewski <pstaszew...@itcare.pl> wrote:
W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:
On 10/31/2018 02:57 PM, Paweł Staszewski wrote:
Hi
So maybee someone will be interested how linux kernel handles
normal traffic (not pktgen :) )
Pawel is this live production traffic?
Yes moved server from testlab to production to check (risking a little -
but this is traffic switched to backup router : ) )
I know Yoel (Cc) is very interested to know the real-life limitation of
Linux as a router, especially with VLANs like you use.
So yes this is real-life traffic , real users - normal mixed internet
traffic forwarded (including ddos-es :) )
Server HW configuration:
CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
Server software:
FRR - as routing daemon
enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node)
enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)
Maximum traffic that server can handle:
Bandwidth
bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
input: /proc/net/dev type: rate
\ iface Rx Tx Total
==============================================================================
enp175s0f1: 28.51 Gb/s 37.24 Gb/s 65.74
Gb/s
enp175s0f0: 38.07 Gb/s 28.44 Gb/s 66.51
Gb/s
------------------------------------------------------------------------------
total: 66.58 Gb/s 65.67 Gb/s 132.25
Gb/s
Actually rather impressive number for a Linux router.
Packets per second:
bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
input: /proc/net/dev type: rate
- iface Rx Tx Total
==============================================================================
enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s
enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s
------------------------------------------------------------------------------
total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s
Average packet size:
(28.51*10^9/8)/5248589 = 678.99 bytes
(38.07*10^9/8)/3557944 = 1337.49 bytes
After reaching that limits nics on the upstream side (more RX
traffic) start to drop packets
I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX side are increasing.
Was thinking that maybee reached some pcie x16 limit - but x16 8GT
is 126Gbit - and also when testing with pktgen i can reach more bw
and pps (like 4x more comparing to normal internet traffic)
And wondering if there is something that can be improved here.
Some more informations / counters / stats and perf top below:
Perf top flame graph:
https://uploadfiles.io/7zo6u
Thanks a lot for the flame graph!
System configuration(long):
cat /sys/devices/system/node/node1/cpulist
14-27,42-55
cat /sys/class/net/enp175s0f0/device/numa_node
1
cat /sys/class/net/enp175s0f1/device/numa_node
1
Hint grep can give you nicer output that cat:
$ grep -H . /sys/class/net/*/device/numa_node
Sure:
grep -H . /sys/class/net/*/device/numa_node
/sys/class/net/enp175s0f0/device/numa_node:1
/sys/class/net/enp175s0f1/device/numa_node:1
ip -s -d link ls dev enp175s0f0
6: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
mode DEFAULT group default qlen 8192
link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0
addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536
gso_max_segs 65535
RX: bytes packets errors dropped overrun mcast
184142375840858 141347715974 2 2806325 0 85050528
TX: bytes packets errors dropped carrier collsns
99270697277430 172227994003 0 0 0 0
ip -s -d link ls dev enp175s0f1
7: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
mode DEFAULT group default qlen 8192
link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0
addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536
gso_max_segs 65535
RX: bytes packets errors dropped overrun mcast
99686284170801 173507590134 61 669685 0 100304421
TX: bytes packets errors dropped carrier collsns
184435107970545 142383178304 0 0 0 0
You have increased the default (1000) qlen to 8192, why?
Was checking if higher txq will change anything
But no change for settings 1000,4096,8192
But yes i do not use there any traffic shaping like hfsc/hdb etc
- just default qdisc mq 0:
root pfifp_fast
tc qdisc show dev enp175s0f1
qdisc mq 0: root
qdisc pfifo_fast 0: parent :38 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1
1 1 1 1
qdisc pfifo_fast 0: parent :37 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1
1 1 1 1
qdisc pfifo_fast 0: parent :36 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1
1 1 1 1
...
...
And vlans are noqueue
tc -s -d qdisc show dev vlan1521
qdisc noqueue 0: root refcnt 2
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
Weird is that no counters increasing but there is traffic in/out on that
vlans
ip -s -d link ls dev vlan1521
87: vlan1521@enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0
vlan protocol 802.1Q id 1521 <REORDER_HDR> addrgenmode eui64
numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
RX: bytes packets errors dropped overrun mcast
562964218394 1639370761 0 0 0 0
TX: bytes packets errors dropped carrier collsns
1417648713052 618271312 0 0 0 0
What default qdisc do you run?... looking through your very detail main
email report (I do love the details you give!). You run
pfifo_fast_dequeue, thus this 8192 qlen is actually having effect.
I would like to know if and how much qdisc_dequeue bulking is happening
in this setup? Can you run:
perf-stat-hist -m 8192 -P2 qdisc:qdisc_dequeue packets
The perf-stat-hist is from Brendan Gregg's git-tree:
https://github.com/brendangregg/perf-tools
https://github.com/brendangregg/perf-tools/blob/master/misc/perf-stat-hist
./perf-stat-hist -m 8192 -P2 qdisc:qdisc_dequeue packets
Tracing qdisc:qdisc_dequeue, power-of-2, max 8192, until Ctrl-C...
^C
Range : Count Distribution
-> -1 : 0 | |
0 -> 0 : 43768349
|######################################|
1 -> 1 : 43895249
|######################################|
2 -> 3 : 352 |# |
4 -> 7 : 228 |# |
8 -> 15 : 135 |# |
16 -> 31 : 73 |# |
32 -> 63 : 7 |# |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> : 0 | |
./softnet.sh
cpu total dropped squeezed collision rps flow_limit
PerfTop: 108490 irqs/sec kernel:99.6% exact: 0.0% [4000Hz cycles],
(all, 56 CPUs)
------------------------------------------------------------------------------------------
26.78% [kernel] [k] queued_spin_lock_slowpath
This is highly suspect.
I agree! -- 26.78% spend in queued_spin_lock_slowpath. Hint if you see
_raw_spin_lock then it is likely not a contended lock, but if you see
queued_spin_lock_slowpath in a perf-report your workload is likely in
trouble.
A call graph (perf record -a -g sleep 1; perf report --stdio)
would tell what is going on.
perf report:
https://ufile.io/rqp0h
Thanks for the output (my 30" screen is just large enough to see the
full output). Together with the flame-graph, it is clear that this
lock happens in the page allocator code.
Section copied out:
mlx5e_poll_tx_cq
|
--16.34%--napi_consume_skb
|
|--12.65%--__free_pages_ok
| |
| --11.86%--free_one_page
| |
| |--10.10%--queued_spin_lock_slowpath
| |
| --0.65%--_raw_spin_lock
|
|--1.55%--page_frag_free
|
--1.44%--skb_release_data
Let me explain what (I think) happens. The mlx5 driver RX-page recycle
mechanism is not effective in this workload, and pages have to go
through the page allocator. The lock contention happens during mlx5
DMA TX completion cycle. And the page allocator cannot keep up at
these speeds.
One solution is extend page allocator with a bulk free API. (This have
been on my TODO list for a long time, but I don't have a
micro-benchmark that trick the driver page-recycle to fail). It should
fit nicely, as I can see that kmem_cache_free_bulk() does get
activated (bulk freeing SKBs), which means that DMA TX completion do
have a bulk of packets.
We can (and should) also improve the page recycle scheme in the driver.
After LPC, I have a project with Tariq and Ilias (Cc'ed) to improve the
page_pool, and we will (attempt) to generalize this, for both high-end
mlx5 and more low-end ARM64-boards (macchiatobin and espressobin).
The MM-people is in parallel working to improve the performance of
order-0 page returns. Thus, the explicit page bulk free API might
actually become less important. I actually think (Cc.) Aaron have a
patchset he would like you to test, which removes the (zone->)lock
you hit in free_one_page().
Ok - Thank You Jesper