Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

Paweł Staszewski Fri, 02 Nov 2018 17:19:16 -0700



W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze:

On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:

W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:

On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles
normal
traffic (not pktgen :) )


Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
local
numa
node)

enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
numa
node)


Maximum traffic that server can handle:

Bandwidth

    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
     input: /proc/net/dev type: rate
     \         iface                   Rx Tx                Total
=================================================================
====
=========
          enp175s0f1:          28.51 Gb/s           37.24
Gb/s
65.74 Gb/s
          enp175s0f0:          38.07 Gb/s           28.44
Gb/s
66.51 Gb/s
---------------------------------------------------------------
----
-----------
               total:          66.58 Gb/s           65.67
Gb/s
132.25 Gb/s


Packets per second:

    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
     input: /proc/net/dev type: rate
     -         iface                   Rx Tx                Total
=================================================================
====
=========
          enp175s0f1:      5248589.00 P/s       3486617.75 P/s
8735207.00 P/s
          enp175s0f0:      3557944.25 P/s       5232516.00 P/s
8790460.00 P/s
---------------------------------------------------------------
----
-----------
               total:      8806533.00 P/s       8719134.00 P/s
17525668.00 P/s


After reaching that limits nics on the upstream side (more RX
traffic)
start to drop packets


I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX
side are increasing.

Where do you see 40 Gb/s ? you showed that both ports on the same
NIC (
same pcie link) are doing  66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
132.25
Gb/s which aligns with your pcie link limit, what am i missing ?

hmm yes that was my concern also - cause cant find anywhere
informations
about that bandwidth is uni or bidirectional - so if 126Gbit for x16
8GT
is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
bw
on both ports

i think it is bidir

So yes - we are hitting there other problem i think pcie is mostprobabbly bidirectional max bw 126Gbit so RX 126Gbit and at same time TXshould be 126Gbit

This can explain maybee also why cpuload is rising rapidly from
120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
so
there can be some error in reading them when offloading (gro/gso/tso)
on
nic's is enabled that is why

Was thinking that maybee reached some pcie x16 limit - but x16
8GT
is
126Gbit - and also when testing with pktgen i can reach more bw
and
pps
(like 4x more comparing to normal internet traffic)

Are you forwarding when using pktgen as well or you just testing
the RX
side pps ?

Yes pktgen was tested on single port RX
Can check also forwarding to eliminate pciex limits

So this explains why you have more RX pps, since tx is idle and pcie
will be free to do only rx.

[...]

ethtool -S enp175s0f1
NIC statistics:
        rx_packets: 173730800927
        rx_bytes: 99827422751332
        tx_packets: 142532009512
        tx_bytes: 184633045911222
        tx_tso_packets: 25989113891
        tx_tso_bytes: 132933363384458
        tx_tso_inner_packets: 0
        tx_tso_inner_bytes: 0
        tx_added_vlan_packets: 74630239613
        tx_nop: 2029817748
        rx_lro_packets: 0
        rx_lro_bytes: 0
        rx_ecn_mark: 0
        rx_removed_vlan_packets: 173730800927
        rx_csum_unnecessary: 0
        rx_csum_none: 434357
        rx_csum_complete: 173730366570
        rx_csum_unnecessary_inner: 0
        rx_xdp_drop: 0
        rx_xdp_redirect: 0
        rx_xdp_tx_xmit: 0
        rx_xdp_tx_full: 0
        rx_xdp_tx_err: 0
        rx_xdp_tx_cqe: 0
        tx_csum_none: 38260960853
        tx_csum_partial: 36369278774
        tx_csum_partial_inner: 0
        tx_queue_stopped: 1
        tx_queue_dropped: 0
        tx_xmit_more: 748638099
        tx_recover: 0
        tx_cqes: 73881645031
        tx_queue_wake: 1
        tx_udp_seg_rem: 0
        tx_cqe_err: 0
        tx_xdp_xmit: 0
        tx_xdp_full: 0
        tx_xdp_err: 0
        tx_xdp_cqes: 0
        rx_wqe_err: 0
        rx_mpwqe_filler_cqes: 0
        rx_mpwqe_filler_strides: 0
        rx_buff_alloc_err: 0
        rx_cqe_compress_blks: 0
        rx_cqe_compress_pkts: 0

If this is a pcie bottleneck it might be useful to  enable CQE
compression (to reduce PCIe completion descriptors transactions)
you should see the above rx_cqe_compress_pkts increasing when
enabled.

$ ethtool  --set-priv-flags enp175s0f1 rx_cqe_compress on
$ ethtool --show-priv-flags enp175s0f1
Private flags for p6p1:
rx_cqe_moder       : on
cqe_moder          : off
rx_cqe_compress    : on
...

try this on both interfaces.

Done
ethtool --show-priv-flags enp175s0f1
Private flags for enp175s0f1:
rx_cqe_moder       : on
tx_cqe_moder       : off
rx_cqe_compress    : on
rx_striding_rq     : off
rx_no_csum_complete: off

ethtool --show-priv-flags enp175s0f0
Private flags for enp175s0f0:
rx_cqe_moder       : on
tx_cqe_moder       : off
rx_cqe_compress    : on
rx_striding_rq     : off
rx_no_csum_complete: off

did it help reduce the load on the pcie  ? do you see more pps ?
what is the ratio between rx_cqe_compress_pkts and over all rx packets
?

[...]

ethtool -S enp175s0f0
NIC statistics:
        rx_packets: 141574897253
        rx_bytes: 184445040406258
        tx_packets: 172569543894
        tx_bytes: 99486882076365
        tx_tso_packets: 9367664195
        tx_tso_bytes: 56435233992948
        tx_tso_inner_packets: 0
        tx_tso_inner_bytes: 0
        tx_added_vlan_packets: 141297671626
        tx_nop: 2102916272
        rx_lro_packets: 0
        rx_lro_bytes: 0
        rx_ecn_mark: 0
        rx_removed_vlan_packets: 141574897252
        rx_csum_unnecessary: 0
        rx_csum_none: 23135854
        rx_csum_complete: 141551761398
        rx_csum_unnecessary_inner: 0
        rx_xdp_drop: 0
        rx_xdp_redirect: 0
        rx_xdp_tx_xmit: 0
        rx_xdp_tx_full: 0
        rx_xdp_tx_err: 0
        rx_xdp_tx_cqe: 0
        tx_csum_none: 127934791664

It is a good idea to look into this, tx is not requesting hw tx
csumming for a lot of packets, maybe you are wasting a lot of cpu
on
calculating csum, or maybe this is just the rx csum complete..

        tx_csum_partial: 13362879974
        tx_csum_partial_inner: 0
        tx_queue_stopped: 232561

TX queues are stalling, could be an indentation for the pcie
bottelneck.

        tx_queue_dropped: 0
        tx_xmit_more: 1266021946
        tx_recover: 0
        tx_cqes: 140031716469
        tx_queue_wake: 232561
        tx_udp_seg_rem: 0
        tx_cqe_err: 0
        tx_xdp_xmit: 0
        tx_xdp_full: 0
        tx_xdp_err: 0
        tx_xdp_cqes: 0
        rx_wqe_err: 0
        rx_mpwqe_filler_cqes: 0
        rx_mpwqe_filler_strides: 0
        rx_buff_alloc_err: 0
        rx_cqe_compress_blks: 0
        rx_cqe_compress_pkts: 0
        rx_page_reuse: 0
        rx_cache_reuse: 16625975793
        rx_cache_full: 54161465914
        rx_cache_empty: 258048
        rx_cache_busy: 54161472735
        rx_cache_waive: 0
        rx_congst_umr: 0
        rx_arfs_err: 0
        ch_events: 40572621887
        ch_poll: 40885650979
        ch_arm: 40429276692
        ch_aff_change: 0
        ch_eq_rearm: 0
        rx_out_of_buffer: 2791690
        rx_if_down_packets: 74
        rx_vport_unicast_packets: 141843476308
        rx_vport_unicast_bytes: 185421265403318
        tx_vport_unicast_packets: 172569484005
        tx_vport_unicast_bytes: 100019940094298
        rx_vport_multicast_packets: 85122935
        rx_vport_multicast_bytes: 5761316431
        tx_vport_multicast_packets: 6452
        tx_vport_multicast_bytes: 643540
        rx_vport_broadcast_packets: 22423624
        rx_vport_broadcast_bytes: 1390127090
        tx_vport_broadcast_packets: 22024
        tx_vport_broadcast_bytes: 1321440
        rx_vport_rdma_unicast_packets: 0
        rx_vport_rdma_unicast_bytes: 0
        tx_vport_rdma_unicast_packets: 0
        tx_vport_rdma_unicast_bytes: 0
        rx_vport_rdma_multicast_packets: 0
        rx_vport_rdma_multicast_bytes: 0
        tx_vport_rdma_multicast_packets: 0
        tx_vport_rdma_multicast_bytes: 0
        tx_packets_phy: 172569501577
        rx_packets_phy: 142871314588
        rx_crc_errors_phy: 0
        tx_bytes_phy: 100710212814151
        rx_bytes_phy: 187209224289564
        tx_multicast_phy: 6452
        tx_broadcast_phy: 22024
        rx_multicast_phy: 85122933
        rx_broadcast_phy: 22423623
        rx_in_range_len_errors_phy: 2
        rx_out_of_range_len_phy: 0
        rx_oversize_pkts_phy: 0
        rx_symbol_err_phy: 0
        tx_mac_control_phy: 0
        rx_mac_control_phy: 0
        rx_unsupported_op_phy: 0
        rx_pause_ctrl_phy: 0
        tx_pause_ctrl_phy: 0
        rx_discards_phy: 920161423

Ok, this port seem to be suffering more, RX is congested, maybe due
to
the pcie bottleneck.

Yes this side is receiving more traffic - second port is +10G more tx

[...]

Average:      17    0.00    0.00   16.60    0.00    0.00 52.10
0.00    0.00    0.00   31.30
Average:      18    0.00    0.00   13.90    0.00    0.00 61.20
0.00    0.00    0.00   24.90
Average:      19    0.00    0.00    9.99    0.00    0.00 70.33
0.00    0.00    0.00   19.68
Average:      20    0.00    0.00    9.00    0.00    0.00 73.00
0.00    0.00    0.00   18.00
Average:      21    0.00    0.00    8.70    0.00    0.00 73.90
0.00    0.00    0.00   17.40
Average:      22    0.00    0.00   15.42    0.00    0.00 58.56
0.00    0.00    0.00   26.03
Average:      23    0.00    0.00   10.81    0.00    0.00 71.67
0.00    0.00    0.00   17.52
Average:      24    0.00    0.00   10.00    0.00    0.00 71.80
0.00    0.00    0.00   18.20
Average:      25    0.00    0.00   11.19    0.00    0.00 71.13
0.00    0.00    0.00   17.68
Average:      26    0.00    0.00   11.00    0.00    0.00 70.80
0.00    0.00    0.00   18.20
Average:      27    0.00    0.00   10.01    0.00    0.00 69.57
0.00    0.00    0.00   20.42

The numa cores are not at 100% util, you have around 20% of idle on
each one.

Yes - no 100% cpu - but the difference between 80% and 100% is like
push
aditional 1-2Gbit/s

yes but, it doens't look like the bottleneck is the cpu, although it is
close to be :)..

Average:      28    0.00    0.00    0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      29    0.00    0.00    0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      30    0.00    0.00    0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      31    0.00    0.00    0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      32    0.00    0.00    0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      33    0.00    0.00    3.90    0.00    0.00 0.00
0.00
0.00    0.00   96.10
Average:      34    0.00    0.00    0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      35    0.00    0.00    0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      36    0.10    0.00    0.20    0.00    0.00 0.00
0.00
0.00    0.00   99.70
Average:      37    0.20    0.00    0.30    0.00    0.00 0.00
0.00
0.00    0.00   99.50
Average:      38    0.00    0.00    0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      39    0.00    0.00    2.60    0.00    0.00 0.00
0.00
0.00    0.00   97.40
Average:      40    0.00    0.00    0.90    0.00    0.00 0.00
0.00
0.00    0.00   99.10
Average:      41    0.10    0.00    0.50    0.00    0.00 0.00
0.00
0.00    0.00   99.40
Average:      42    0.00    0.00    9.91    0.00    0.00 70.67
0.00    0.00    0.00   19.42
Average:      43    0.00    0.00   15.90    0.00    0.00 57.50
0.00    0.00    0.00   26.60
Average:      44    0.00    0.00   12.20    0.00    0.00 66.20
0.00    0.00    0.00   21.60
Average:      45    0.00    0.00   12.00    0.00    0.00 67.50
0.00    0.00    0.00   20.50
Average:      46    0.00    0.00   12.90    0.00    0.00 65.50
0.00    0.00    0.00   21.60
Average:      47    0.00    0.00   14.59    0.00    0.00 60.84
0.00    0.00    0.00   24.58
Average:      48    0.00    0.00   13.59    0.00    0.00 61.74
0.00    0.00    0.00   24.68
Average:      49    0.00    0.00   18.36    0.00    0.00 53.29
0.00    0.00    0.00   28.34
Average:      50    0.00    0.00   15.32    0.00    0.00 58.86
0.00    0.00    0.00   25.83
Average:      51    0.00    0.00   17.60    0.00    0.00 55.20
0.00    0.00    0.00   27.20
Average:      52    0.00    0.00   15.92    0.00    0.00 56.06
0.00    0.00    0.00   28.03
Average:      53    0.00    0.00   13.00    0.00    0.00 62.30
0.00    0.00    0.00   24.70
Average:      54    0.00    0.00   13.20    0.00    0.00 61.50
0.00    0.00    0.00   25.30
Average:      55    0.00    0.00   14.59    0.00    0.00 58.64
0.00    0.00    0.00   26.77


ethtool -k enp175s0f0
Features for enp175s0f0:
rx-checksumming: on
tx-checksumming: on
           tx-checksum-ipv4: on
           tx-checksum-ip-generic: off [fixed]
           tx-checksum-ipv6: on
           tx-checksum-fcoe-crc: off [fixed]
           tx-checksum-sctp: off [fixed]
scatter-gather: on
           tx-scatter-gather: on
           tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
           tx-tcp-segmentation: on
           tx-tcp-ecn-segmentation: off [fixed]
           tx-tcp-mangleid-segmentation: off
           tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: on
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: on [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]

ethtool -c enp175s0f0
Coalesce parameters for enp175s0f0:
Adaptive RX: off  TX: on
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
dmac: 32703

rx-usecs: 256
rx-frames: 128
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 8
tx-frames: 128
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

ethtool -g enp175s0f0
Ring parameters for enp175s0f0:
Pre-set maximums:
RX:             8192
RX Mini:        0
RX Jumbo:       0
TX:             8192
Current hardware settings:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096

Also changed a little coalesce params - and best for this config are:
ethtool -c enp175s0f0
Coalesce parameters for enp175s0f0:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
dmac: 32573

rx-usecs: 40
rx-frames: 128
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 8
tx-frames: 8
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0


Less drops on RX side - and more pps in overall forwarded.

how much improvement ? maybe we can improve our adaptive rx coal to be
efficient for this work load.

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

Reply via email to