W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze:
On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:
W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:
On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:
Hi
So maybee someone will be interested how linux kernel handles
normal
traffic (not pktgen :) )
Server HW configuration:
CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
Server software:
FRR - as routing daemon
enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
local
numa
node)
enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
numa
node)
Maximum traffic that server can handle:
Bandwidth
bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
input: /proc/net/dev type: rate
\ iface Rx Tx Total
=================================================================
====
=========
enp175s0f1: 28.51 Gb/s 37.24
Gb/s
65.74 Gb/s
enp175s0f0: 38.07 Gb/s 28.44
Gb/s
66.51 Gb/s
---------------------------------------------------------------
----
-----------
total: 66.58 Gb/s 65.67
Gb/s
132.25 Gb/s
Packets per second:
bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
input: /proc/net/dev type: rate
- iface Rx Tx Total
=================================================================
====
=========
enp175s0f1: 5248589.00 P/s 3486617.75 P/s
8735207.00 P/s
enp175s0f0: 3557944.25 P/s 5232516.00 P/s
8790460.00 P/s
---------------------------------------------------------------
----
-----------
total: 8806533.00 P/s 8719134.00 P/s
17525668.00 P/s
After reaching that limits nics on the upstream side (more RX
traffic)
start to drop packets
I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX
side are increasing.
Where do you see 40 Gb/s ? you showed that both ports on the same
NIC (
same pcie link) are doing 66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
132.25
Gb/s which aligns with your pcie link limit, what am i missing ?
hmm yes that was my concern also - cause cant find anywhere
informations
about that bandwidth is uni or bidirectional - so if 126Gbit for x16
8GT
is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
bw
on both ports
i think it is bidir
So yes - we are hitting there other problem i think pcie is most
probabbly bidirectional max bw 126Gbit so RX 126Gbit and at same time TX
should be 126Gbit
This can explain maybee also why cpuload is rising rapidly from
120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
so
there can be some error in reading them when offloading (gro/gso/tso)
on
nic's is enabled that is why
Was thinking that maybee reached some pcie x16 limit - but x16
8GT
is
126Gbit - and also when testing with pktgen i can reach more bw
and
pps
(like 4x more comparing to normal internet traffic)
Are you forwarding when using pktgen as well or you just testing
the RX
side pps ?
Yes pktgen was tested on single port RX
Can check also forwarding to eliminate pciex limits
So this explains why you have more RX pps, since tx is idle and pcie
will be free to do only rx.
[...]
ethtool -S enp175s0f1
NIC statistics:
rx_packets: 173730800927
rx_bytes: 99827422751332
tx_packets: 142532009512
tx_bytes: 184633045911222
tx_tso_packets: 25989113891
tx_tso_bytes: 132933363384458
tx_tso_inner_packets: 0
tx_tso_inner_bytes: 0
tx_added_vlan_packets: 74630239613
tx_nop: 2029817748
rx_lro_packets: 0
rx_lro_bytes: 0
rx_ecn_mark: 0
rx_removed_vlan_packets: 173730800927
rx_csum_unnecessary: 0
rx_csum_none: 434357
rx_csum_complete: 173730366570
rx_csum_unnecessary_inner: 0
rx_xdp_drop: 0
rx_xdp_redirect: 0
rx_xdp_tx_xmit: 0
rx_xdp_tx_full: 0
rx_xdp_tx_err: 0
rx_xdp_tx_cqe: 0
tx_csum_none: 38260960853
tx_csum_partial: 36369278774
tx_csum_partial_inner: 0
tx_queue_stopped: 1
tx_queue_dropped: 0
tx_xmit_more: 748638099
tx_recover: 0
tx_cqes: 73881645031
tx_queue_wake: 1
tx_udp_seg_rem: 0
tx_cqe_err: 0
tx_xdp_xmit: 0
tx_xdp_full: 0
tx_xdp_err: 0
tx_xdp_cqes: 0
rx_wqe_err: 0
rx_mpwqe_filler_cqes: 0
rx_mpwqe_filler_strides: 0
rx_buff_alloc_err: 0
rx_cqe_compress_blks: 0
rx_cqe_compress_pkts: 0
If this is a pcie bottleneck it might be useful to enable CQE
compression (to reduce PCIe completion descriptors transactions)
you should see the above rx_cqe_compress_pkts increasing when
enabled.
$ ethtool --set-priv-flags enp175s0f1 rx_cqe_compress on
$ ethtool --show-priv-flags enp175s0f1
Private flags for p6p1:
rx_cqe_moder : on
cqe_moder : off
rx_cqe_compress : on
...
try this on both interfaces.
Done
ethtool --show-priv-flags enp175s0f1
Private flags for enp175s0f1:
rx_cqe_moder : on
tx_cqe_moder : off
rx_cqe_compress : on
rx_striding_rq : off
rx_no_csum_complete: off
ethtool --show-priv-flags enp175s0f0
Private flags for enp175s0f0:
rx_cqe_moder : on
tx_cqe_moder : off
rx_cqe_compress : on
rx_striding_rq : off
rx_no_csum_complete: off
did it help reduce the load on the pcie ? do you see more pps ?
what is the ratio between rx_cqe_compress_pkts and over all rx packets
?
[...]
ethtool -S enp175s0f0
NIC statistics:
rx_packets: 141574897253
rx_bytes: 184445040406258
tx_packets: 172569543894
tx_bytes: 99486882076365
tx_tso_packets: 9367664195
tx_tso_bytes: 56435233992948
tx_tso_inner_packets: 0
tx_tso_inner_bytes: 0
tx_added_vlan_packets: 141297671626
tx_nop: 2102916272
rx_lro_packets: 0
rx_lro_bytes: 0
rx_ecn_mark: 0
rx_removed_vlan_packets: 141574897252
rx_csum_unnecessary: 0
rx_csum_none: 23135854
rx_csum_complete: 141551761398
rx_csum_unnecessary_inner: 0
rx_xdp_drop: 0
rx_xdp_redirect: 0
rx_xdp_tx_xmit: 0
rx_xdp_tx_full: 0
rx_xdp_tx_err: 0
rx_xdp_tx_cqe: 0
tx_csum_none: 127934791664
It is a good idea to look into this, tx is not requesting hw tx
csumming for a lot of packets, maybe you are wasting a lot of cpu
on
calculating csum, or maybe this is just the rx csum complete..
tx_csum_partial: 13362879974
tx_csum_partial_inner: 0
tx_queue_stopped: 232561
TX queues are stalling, could be an indentation for the pcie
bottelneck.
tx_queue_dropped: 0
tx_xmit_more: 1266021946
tx_recover: 0
tx_cqes: 140031716469
tx_queue_wake: 232561
tx_udp_seg_rem: 0
tx_cqe_err: 0
tx_xdp_xmit: 0
tx_xdp_full: 0
tx_xdp_err: 0
tx_xdp_cqes: 0
rx_wqe_err: 0
rx_mpwqe_filler_cqes: 0
rx_mpwqe_filler_strides: 0
rx_buff_alloc_err: 0
rx_cqe_compress_blks: 0
rx_cqe_compress_pkts: 0
rx_page_reuse: 0
rx_cache_reuse: 16625975793
rx_cache_full: 54161465914
rx_cache_empty: 258048
rx_cache_busy: 54161472735
rx_cache_waive: 0
rx_congst_umr: 0
rx_arfs_err: 0
ch_events: 40572621887
ch_poll: 40885650979
ch_arm: 40429276692
ch_aff_change: 0
ch_eq_rearm: 0
rx_out_of_buffer: 2791690
rx_if_down_packets: 74
rx_vport_unicast_packets: 141843476308
rx_vport_unicast_bytes: 185421265403318
tx_vport_unicast_packets: 172569484005
tx_vport_unicast_bytes: 100019940094298
rx_vport_multicast_packets: 85122935
rx_vport_multicast_bytes: 5761316431
tx_vport_multicast_packets: 6452
tx_vport_multicast_bytes: 643540
rx_vport_broadcast_packets: 22423624
rx_vport_broadcast_bytes: 1390127090
tx_vport_broadcast_packets: 22024
tx_vport_broadcast_bytes: 1321440
rx_vport_rdma_unicast_packets: 0
rx_vport_rdma_unicast_bytes: 0
tx_vport_rdma_unicast_packets: 0
tx_vport_rdma_unicast_bytes: 0
rx_vport_rdma_multicast_packets: 0
rx_vport_rdma_multicast_bytes: 0
tx_vport_rdma_multicast_packets: 0
tx_vport_rdma_multicast_bytes: 0
tx_packets_phy: 172569501577
rx_packets_phy: 142871314588
rx_crc_errors_phy: 0
tx_bytes_phy: 100710212814151
rx_bytes_phy: 187209224289564
tx_multicast_phy: 6452
tx_broadcast_phy: 22024
rx_multicast_phy: 85122933
rx_broadcast_phy: 22423623
rx_in_range_len_errors_phy: 2
rx_out_of_range_len_phy: 0
rx_oversize_pkts_phy: 0
rx_symbol_err_phy: 0
tx_mac_control_phy: 0
rx_mac_control_phy: 0
rx_unsupported_op_phy: 0
rx_pause_ctrl_phy: 0
tx_pause_ctrl_phy: 0
rx_discards_phy: 920161423
Ok, this port seem to be suffering more, RX is congested, maybe due
to
the pcie bottleneck.
Yes this side is receiving more traffic - second port is +10G more tx
[...]
Average: 17 0.00 0.00 16.60 0.00 0.00 52.10
0.00 0.00 0.00 31.30
Average: 18 0.00 0.00 13.90 0.00 0.00 61.20
0.00 0.00 0.00 24.90
Average: 19 0.00 0.00 9.99 0.00 0.00 70.33
0.00 0.00 0.00 19.68
Average: 20 0.00 0.00 9.00 0.00 0.00 73.00
0.00 0.00 0.00 18.00
Average: 21 0.00 0.00 8.70 0.00 0.00 73.90
0.00 0.00 0.00 17.40
Average: 22 0.00 0.00 15.42 0.00 0.00 58.56
0.00 0.00 0.00 26.03
Average: 23 0.00 0.00 10.81 0.00 0.00 71.67
0.00 0.00 0.00 17.52
Average: 24 0.00 0.00 10.00 0.00 0.00 71.80
0.00 0.00 0.00 18.20
Average: 25 0.00 0.00 11.19 0.00 0.00 71.13
0.00 0.00 0.00 17.68
Average: 26 0.00 0.00 11.00 0.00 0.00 70.80
0.00 0.00 0.00 18.20
Average: 27 0.00 0.00 10.01 0.00 0.00 69.57
0.00 0.00 0.00 20.42
The numa cores are not at 100% util, you have around 20% of idle on
each one.
Yes - no 100% cpu - but the difference between 80% and 100% is like
push
aditional 1-2Gbit/s
yes but, it doens't look like the bottleneck is the cpu, although it is
close to be :)..
Average: 28 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 29 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 30 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 31 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 32 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 33 0.00 0.00 3.90 0.00 0.00 0.00
0.00
0.00 0.00 96.10
Average: 34 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 35 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 36 0.10 0.00 0.20 0.00 0.00 0.00
0.00
0.00 0.00 99.70
Average: 37 0.20 0.00 0.30 0.00 0.00 0.00
0.00
0.00 0.00 99.50
Average: 38 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 39 0.00 0.00 2.60 0.00 0.00 0.00
0.00
0.00 0.00 97.40
Average: 40 0.00 0.00 0.90 0.00 0.00 0.00
0.00
0.00 0.00 99.10
Average: 41 0.10 0.00 0.50 0.00 0.00 0.00
0.00
0.00 0.00 99.40
Average: 42 0.00 0.00 9.91 0.00 0.00 70.67
0.00 0.00 0.00 19.42
Average: 43 0.00 0.00 15.90 0.00 0.00 57.50
0.00 0.00 0.00 26.60
Average: 44 0.00 0.00 12.20 0.00 0.00 66.20
0.00 0.00 0.00 21.60
Average: 45 0.00 0.00 12.00 0.00 0.00 67.50
0.00 0.00 0.00 20.50
Average: 46 0.00 0.00 12.90 0.00 0.00 65.50
0.00 0.00 0.00 21.60
Average: 47 0.00 0.00 14.59 0.00 0.00 60.84
0.00 0.00 0.00 24.58
Average: 48 0.00 0.00 13.59 0.00 0.00 61.74
0.00 0.00 0.00 24.68
Average: 49 0.00 0.00 18.36 0.00 0.00 53.29
0.00 0.00 0.00 28.34
Average: 50 0.00 0.00 15.32 0.00 0.00 58.86
0.00 0.00 0.00 25.83
Average: 51 0.00 0.00 17.60 0.00 0.00 55.20
0.00 0.00 0.00 27.20
Average: 52 0.00 0.00 15.92 0.00 0.00 56.06
0.00 0.00 0.00 28.03
Average: 53 0.00 0.00 13.00 0.00 0.00 62.30
0.00 0.00 0.00 24.70
Average: 54 0.00 0.00 13.20 0.00 0.00 61.50
0.00 0.00 0.00 25.30
Average: 55 0.00 0.00 14.59 0.00 0.00 58.64
0.00 0.00 0.00 26.77
ethtool -k enp175s0f0
Features for enp175s0f0:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp-mangleid-segmentation: off
tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: on
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: on [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
ethtool -c enp175s0f0
Coalesce parameters for enp175s0f0:
Adaptive RX: off TX: on
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
dmac: 32703
rx-usecs: 256
rx-frames: 128
rx-usecs-irq: 0
rx-frames-irq: 0
tx-usecs: 8
tx-frames: 128
tx-usecs-irq: 0
tx-frames-irq: 0
rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0
rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0
ethtool -g enp175s0f0
Ring parameters for enp175s0f0:
Pre-set maximums:
RX: 8192
RX Mini: 0
RX Jumbo: 0
TX: 8192
Current hardware settings:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Also changed a little coalesce params - and best for this config are:
ethtool -c enp175s0f0
Coalesce parameters for enp175s0f0:
Adaptive RX: off TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
dmac: 32573
rx-usecs: 40
rx-frames: 128
rx-usecs-irq: 0
rx-frames-irq: 0
tx-usecs: 8
tx-frames: 8
tx-usecs-irq: 0
tx-frames-irq: 0
rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0
rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0
Less drops on RX side - and more pps in overall forwarded.
how much improvement ? maybe we can improve our adaptive rx coal to be
efficient for this work load.