Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

Paweł Staszewski Thu, 01 Nov 2018 03:34:32 -0700



W dniu 01.11.2018 o 10:22, Jesper Dangaard Brouer pisze:

On Wed, 31 Oct 2018 23:20:01 +0100
Paweł Staszewski <pstaszew...@itcare.pl> wrote:

W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:

On 10/31/2018 02:57 PM, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles
normal traffic (not pktgen :) )

Pawel is this live production traffic?

Yes moved server from testlab to production to check (risking a little -but this is traffic switched to backup router : ) )


I know Yoel (Cc) is very interested to know the real-life limitation of
Linux as a router, especially with VLANs like you use.

So yes this is real-life traffic , real users - normal mixed internettraffic forwarded (including ddos-es :) )

Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node)

enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)


Maximum traffic that server can handle:

Bandwidth

   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
    input: /proc/net/dev type: rate
    \         iface                   Rx Tx                Total
==============================================================================
         enp175s0f1:          28.51 Gb/s           37.24 Gb/s           65.74 
Gb/s
         enp175s0f0:          38.07 Gb/s           28.44 Gb/s           66.51 
Gb/s
------------------------------------------------------------------------------
              total:          66.58 Gb/s           65.67 Gb/s          132.25 
Gb/s

Actually rather impressive number for a Linux router.

Packets per second:

   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
    input: /proc/net/dev type: rate
    -         iface                   Rx Tx                Total
==============================================================================
         enp175s0f1:      5248589.00 P/s       3486617.75 P/s 8735207.00 P/s
         enp175s0f0:      3557944.25 P/s       5232516.00 P/s 8790460.00 P/s
------------------------------------------------------------------------------
              total:      8806533.00 P/s       8719134.00 P/s 17525668.00 P/s

Average packet size:
   (28.51*10^9/8)/5248589 =  678.99 bytes
   (38.07*10^9/8)/3557944 = 1337.49 bytes

After reaching that limits nics on the upstream side (more RX
traffic) start to drop packets


I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX side are increasing.

Was thinking that maybee reached some pcie x16 limit - but x16 8GT
is 126Gbit - and also when testing with pktgen i can reach more bw
and pps (like 4x more comparing to normal internet traffic)

And wondering if there is something that can be improved here.



Some more informations / counters / stats and perf top below:

Perf top flame graph:

https://uploadfiles.io/7zo6u

Thanks a lot for the flame graph!

System configuration(long):


cat /sys/devices/system/node/node1/cpulist
14-27,42-55
cat /sys/class/net/enp175s0f0/device/numa_node
1
cat /sys/class/net/enp175s0f1/device/numa_node
1

Hint grep can give you nicer output that cat:

$ grep -H . /sys/class/net/*/device/numa_node

Sure:
grep -H . /sys/class/net/*/device/numa_node
/sys/class/net/enp175s0f0/device/numa_node:1
/sys/class/net/enp175s0f1/device/numa_node:1




ip -s -d link ls dev enp175s0f0
6: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP 
mode DEFAULT group default qlen 8192
      link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 
addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 
gso_max_segs 65535
      RX: bytes  packets  errors  dropped overrun mcast
      184142375840858 141347715974 2       2806325 0       85050528
      TX: bytes  packets  errors  dropped carrier collsns
      99270697277430 172227994003 0       0       0       0

   ip -s -d link ls dev enp175s0f1
7: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP 
mode DEFAULT group default qlen 8192
      link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 
addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 
gso_max_segs 65535
      RX: bytes  packets  errors  dropped overrun mcast
      99686284170801 173507590134 61      669685  0       100304421
      TX: bytes  packets  errors  dropped carrier collsns
      184435107970545 142383178304 0       0       0       0

You have increased the default (1000) qlen to 8192, why?

Was checking if higher txq will change anything
But no change for settings 1000,4096,8192
But yes i do not use there any traffic shaping like hfsc/hdb etc
- just default qdisc mq 0:
root pfifp_fast
tc qdisc show dev enp175s0f1
qdisc mq 0: root

qdisc pfifo_fast 0: parent :38 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 11 1 1 1qdisc pfifo_fast 0: parent :37 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 11 1 1 1qdisc pfifo_fast 0: parent :36 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 11 1 1 1

...
...



And vlans are noqueue
tc -s -d qdisc show dev vlan1521
qdisc noqueue 0: root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

Weird is that no counters increasing but there is traffic in/out on thatvlans


ip -s -d link ls dev vlan1521

87: vlan1521@enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500qdisc noqueue state UP mode DEFAULT group default qlen 1000

    link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0

vlan protocol 802.1Q id 1521 <REORDER_HDR> addrgenmode eui64numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

    RX: bytes  packets  errors  dropped overrun mcast
    562964218394 1639370761 0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    1417648713052 618271312 0       0       0       0


What default qdisc do you run?... looking through your very detail main
email report (I do love the details you give!).  You run
pfifo_fast_dequeue, thus this 8192 qlen is actually having effect.

I would like to know if and how much qdisc_dequeue bulking is happening
in this setup?  Can you run:

  perf-stat-hist -m 8192 -P2 qdisc:qdisc_dequeue packets

The perf-stat-hist is from Brendan Gregg's git-tree:
  https://github.com/brendangregg/perf-tools
  https://github.com/brendangregg/perf-tools/blob/master/misc/perf-stat-hist

 ./perf-stat-hist -m 8192 -P2 qdisc:qdisc_dequeue packets
Tracing qdisc:qdisc_dequeue, power-of-2, max 8192, until Ctrl-C...
^C
            Range          : Count    Distribution
              -> -1        : 0 |                                      |

0 -> 0 : 43768349|######################################| 1 -> 1 : 43895249|######################################|

            2 -> 3         : 352 |#                                     |
            4 -> 7         : 228 |#                                     |
            8 -> 15        : 135 |#                                     |
           16 -> 31        : 73 |#                                     |
           32 -> 63        : 7 |#                                     |
           64 -> 127       : 0 |                                      |
          128 -> 255       : 0 |                                      |
          256 -> 511       : 0 |                                      |
          512 -> 1023      : 0 |                                      |
         1024 -> 2047      : 0 |                                      |
         2048 -> 4095      : 0 |                                      |
         4096 -> 8191      : 0 |                                      |
         8192 ->           : 0 |                                      |

./softnet.sh
cpu      total    dropped   squeezed  collision        rps flow_limit




     PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz cycles],  
(all, 56 CPUs)
------------------------------------------------------------------------------------------

      26.78%  [kernel]       [k] queued_spin_lock_slowpath

This is highly suspect.

I agree! -- 26.78% spend in queued_spin_lock_slowpath.  Hint if you see
_raw_spin_lock then it is likely not a contended lock, but if you see
queued_spin_lock_slowpath in a perf-report your workload is likely in
trouble.

A call graph (perf record -a -g sleep 1; perf report --stdio)
would tell what is going on.

perf report:
https://ufile.io/rqp0h

Thanks for the output (my 30" screen is just large enough to see the
full output).  Together with the flame-graph, it is clear that this
lock happens in the page allocator code.

Section copied out:

   mlx5e_poll_tx_cq
   |
    --16.34%--napi_consume_skb
              |
              |--12.65%--__free_pages_ok
              |          |
              |           --11.86%--free_one_page
              |                     |
              |                     |--10.10%--queued_spin_lock_slowpath
              |                     |
              |                      --0.65%--_raw_spin_lock
              |
              |--1.55%--page_frag_free
              |
               --1.44%--skb_release_data


Let me explain what (I think) happens.  The mlx5 driver RX-page recycle
mechanism is not effective in this workload, and pages have to go
through the page allocator.  The lock contention happens during mlx5
DMA TX completion cycle.  And the page allocator cannot keep up at
these speeds.

One solution is extend page allocator with a bulk free API.  (This have
been on my TODO list for a long time, but I don't have a
micro-benchmark that trick the driver page-recycle to fail).  It should
fit nicely, as I can see that kmem_cache_free_bulk() does get
activated (bulk freeing SKBs), which means that DMA TX completion do
have a bulk of packets.

We can (and should) also improve the page recycle scheme in the driver.
After LPC, I have a project with Tariq and Ilias (Cc'ed) to improve the
page_pool, and we will (attempt) to generalize this, for both high-end
mlx5 and more low-end ARM64-boards (macchiatobin and espressobin).

The MM-people is in parallel working to improve the performance of
order-0 page returns.  Thus, the explicit page bulk free API might
actually become less important.  I actually think (Cc.) Aaron have a
patchset he would like you to test, which removes the (zone->)lock
you hit in free_one_page().

Ok - Thank You Jesper

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

Reply via email to