On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote: > > W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze: > > On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote: > > > Hi > > > > > > So maybee someone will be interested how linux kernel handles > > > normal > > > traffic (not pktgen :) ) > > > > > > > > > Server HW configuration: > > > > > > CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz > > > > > > NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT) > > > > > > > > > Server software: > > > > > > FRR - as routing daemon > > > > > > enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to > > > local > > > numa > > > node) > > > > > > enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local > > > numa > > > node) > > > > > > > > > Maximum traffic that server can handle: > > > > > > Bandwidth > > > > > > bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help > > > input: /proc/net/dev type: rate > > > \ iface Rx Tx Total > > > ================================================================= > > > ==== > > > ========= > > > enp175s0f1: 28.51 Gb/s 37.24 > > > Gb/s > > > 65.74 Gb/s > > > enp175s0f0: 38.07 Gb/s 28.44 > > > Gb/s > > > 66.51 Gb/s > > > --------------------------------------------------------------- > > > ---- > > > ----------- > > > total: 66.58 Gb/s 65.67 > > > Gb/s > > > 132.25 Gb/s > > > > > > > > > Packets per second: > > > > > > bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help > > > input: /proc/net/dev type: rate > > > - iface Rx Tx Total > > > ================================================================= > > > ==== > > > ========= > > > enp175s0f1: 5248589.00 P/s 3486617.75 P/s > > > 8735207.00 P/s > > > enp175s0f0: 3557944.25 P/s 5232516.00 P/s > > > 8790460.00 P/s > > > --------------------------------------------------------------- > > > ---- > > > ----------- > > > total: 8806533.00 P/s 8719134.00 P/s > > > 17525668.00 P/s > > > > > > > > > After reaching that limits nics on the upstream side (more RX > > > traffic) > > > start to drop packets > > > > > > > > > I just dont understand that server can't handle more bandwidth > > > (~40Gbit/s is limit where all cpu's are 100% util) - where pps on > > > RX > > > side are increasing. > > > > > > > Where do you see 40 Gb/s ? you showed that both ports on the same > > NIC ( > > same pcie link) are doing 66.58 Gb/s (RX) + 65.67 Gb/s (TX) = > > 132.25 > > Gb/s which aligns with your pcie link limit, what am i missing ? > > hmm yes that was my concern also - cause cant find anywhere > informations > about that bandwidth is uni or bidirectional - so if 126Gbit for x16 > 8GT > is unidir - then bidir will be 126/2 ~68Gbit - which will fit total > bw > on both ports
i think it is bidir > This can explain maybee also why cpuload is rising rapidly from > 120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net - > so > there can be some error in reading them when offloading (gro/gso/tso) > on > nic's is enabled that is why > > > > > > Was thinking that maybee reached some pcie x16 limit - but x16 > > > 8GT > > > is > > > 126Gbit - and also when testing with pktgen i can reach more bw > > > and > > > pps > > > (like 4x more comparing to normal internet traffic) > > > > > > > Are you forwarding when using pktgen as well or you just testing > > the RX > > side pps ? > > Yes pktgen was tested on single port RX > Can check also forwarding to eliminate pciex limits > So this explains why you have more RX pps, since tx is idle and pcie will be free to do only rx. [...] > > > > > ethtool -S enp175s0f1 > > > NIC statistics: > > > rx_packets: 173730800927 > > > rx_bytes: 99827422751332 > > > tx_packets: 142532009512 > > > tx_bytes: 184633045911222 > > > tx_tso_packets: 25989113891 > > > tx_tso_bytes: 132933363384458 > > > tx_tso_inner_packets: 0 > > > tx_tso_inner_bytes: 0 > > > tx_added_vlan_packets: 74630239613 > > > tx_nop: 2029817748 > > > rx_lro_packets: 0 > > > rx_lro_bytes: 0 > > > rx_ecn_mark: 0 > > > rx_removed_vlan_packets: 173730800927 > > > rx_csum_unnecessary: 0 > > > rx_csum_none: 434357 > > > rx_csum_complete: 173730366570 > > > rx_csum_unnecessary_inner: 0 > > > rx_xdp_drop: 0 > > > rx_xdp_redirect: 0 > > > rx_xdp_tx_xmit: 0 > > > rx_xdp_tx_full: 0 > > > rx_xdp_tx_err: 0 > > > rx_xdp_tx_cqe: 0 > > > tx_csum_none: 38260960853 > > > tx_csum_partial: 36369278774 > > > tx_csum_partial_inner: 0 > > > tx_queue_stopped: 1 > > > tx_queue_dropped: 0 > > > tx_xmit_more: 748638099 > > > tx_recover: 0 > > > tx_cqes: 73881645031 > > > tx_queue_wake: 1 > > > tx_udp_seg_rem: 0 > > > tx_cqe_err: 0 > > > tx_xdp_xmit: 0 > > > tx_xdp_full: 0 > > > tx_xdp_err: 0 > > > tx_xdp_cqes: 0 > > > rx_wqe_err: 0 > > > rx_mpwqe_filler_cqes: 0 > > > rx_mpwqe_filler_strides: 0 > > > rx_buff_alloc_err: 0 > > > rx_cqe_compress_blks: 0 > > > rx_cqe_compress_pkts: 0 > > > > If this is a pcie bottleneck it might be useful to enable CQE > > compression (to reduce PCIe completion descriptors transactions) > > you should see the above rx_cqe_compress_pkts increasing when > > enabled. > > > > $ ethtool --set-priv-flags enp175s0f1 rx_cqe_compress on > > $ ethtool --show-priv-flags enp175s0f1 > > Private flags for p6p1: > > rx_cqe_moder : on > > cqe_moder : off > > rx_cqe_compress : on > > ... > > > > try this on both interfaces. > > Done > ethtool --show-priv-flags enp175s0f1 > Private flags for enp175s0f1: > rx_cqe_moder : on > tx_cqe_moder : off > rx_cqe_compress : on > rx_striding_rq : off > rx_no_csum_complete: off > > ethtool --show-priv-flags enp175s0f0 > Private flags for enp175s0f0: > rx_cqe_moder : on > tx_cqe_moder : off > rx_cqe_compress : on > rx_striding_rq : off > rx_no_csum_complete: off > did it help reduce the load on the pcie ? do you see more pps ? what is the ratio between rx_cqe_compress_pkts and over all rx packets ? [...] > > > ethtool -S enp175s0f0 > > > NIC statistics: > > > rx_packets: 141574897253 > > > rx_bytes: 184445040406258 > > > tx_packets: 172569543894 > > > tx_bytes: 99486882076365 > > > tx_tso_packets: 9367664195 > > > tx_tso_bytes: 56435233992948 > > > tx_tso_inner_packets: 0 > > > tx_tso_inner_bytes: 0 > > > tx_added_vlan_packets: 141297671626 > > > tx_nop: 2102916272 > > > rx_lro_packets: 0 > > > rx_lro_bytes: 0 > > > rx_ecn_mark: 0 > > > rx_removed_vlan_packets: 141574897252 > > > rx_csum_unnecessary: 0 > > > rx_csum_none: 23135854 > > > rx_csum_complete: 141551761398 > > > rx_csum_unnecessary_inner: 0 > > > rx_xdp_drop: 0 > > > rx_xdp_redirect: 0 > > > rx_xdp_tx_xmit: 0 > > > rx_xdp_tx_full: 0 > > > rx_xdp_tx_err: 0 > > > rx_xdp_tx_cqe: 0 > > > tx_csum_none: 127934791664 > > > > It is a good idea to look into this, tx is not requesting hw tx > > csumming for a lot of packets, maybe you are wasting a lot of cpu > > on > > calculating csum, or maybe this is just the rx csum complete.. > > > > > tx_csum_partial: 13362879974 > > > tx_csum_partial_inner: 0 > > > tx_queue_stopped: 232561 > > > > TX queues are stalling, could be an indentation for the pcie > > bottelneck. > > > > > tx_queue_dropped: 0 > > > tx_xmit_more: 1266021946 > > > tx_recover: 0 > > > tx_cqes: 140031716469 > > > tx_queue_wake: 232561 > > > tx_udp_seg_rem: 0 > > > tx_cqe_err: 0 > > > tx_xdp_xmit: 0 > > > tx_xdp_full: 0 > > > tx_xdp_err: 0 > > > tx_xdp_cqes: 0 > > > rx_wqe_err: 0 > > > rx_mpwqe_filler_cqes: 0 > > > rx_mpwqe_filler_strides: 0 > > > rx_buff_alloc_err: 0 > > > rx_cqe_compress_blks: 0 > > > rx_cqe_compress_pkts: 0 > > > rx_page_reuse: 0 > > > rx_cache_reuse: 16625975793 > > > rx_cache_full: 54161465914 > > > rx_cache_empty: 258048 > > > rx_cache_busy: 54161472735 > > > rx_cache_waive: 0 > > > rx_congst_umr: 0 > > > rx_arfs_err: 0 > > > ch_events: 40572621887 > > > ch_poll: 40885650979 > > > ch_arm: 40429276692 > > > ch_aff_change: 0 > > > ch_eq_rearm: 0 > > > rx_out_of_buffer: 2791690 > > > rx_if_down_packets: 74 > > > rx_vport_unicast_packets: 141843476308 > > > rx_vport_unicast_bytes: 185421265403318 > > > tx_vport_unicast_packets: 172569484005 > > > tx_vport_unicast_bytes: 100019940094298 > > > rx_vport_multicast_packets: 85122935 > > > rx_vport_multicast_bytes: 5761316431 > > > tx_vport_multicast_packets: 6452 > > > tx_vport_multicast_bytes: 643540 > > > rx_vport_broadcast_packets: 22423624 > > > rx_vport_broadcast_bytes: 1390127090 > > > tx_vport_broadcast_packets: 22024 > > > tx_vport_broadcast_bytes: 1321440 > > > rx_vport_rdma_unicast_packets: 0 > > > rx_vport_rdma_unicast_bytes: 0 > > > tx_vport_rdma_unicast_packets: 0 > > > tx_vport_rdma_unicast_bytes: 0 > > > rx_vport_rdma_multicast_packets: 0 > > > rx_vport_rdma_multicast_bytes: 0 > > > tx_vport_rdma_multicast_packets: 0 > > > tx_vport_rdma_multicast_bytes: 0 > > > tx_packets_phy: 172569501577 > > > rx_packets_phy: 142871314588 > > > rx_crc_errors_phy: 0 > > > tx_bytes_phy: 100710212814151 > > > rx_bytes_phy: 187209224289564 > > > tx_multicast_phy: 6452 > > > tx_broadcast_phy: 22024 > > > rx_multicast_phy: 85122933 > > > rx_broadcast_phy: 22423623 > > > rx_in_range_len_errors_phy: 2 > > > rx_out_of_range_len_phy: 0 > > > rx_oversize_pkts_phy: 0 > > > rx_symbol_err_phy: 0 > > > tx_mac_control_phy: 0 > > > rx_mac_control_phy: 0 > > > rx_unsupported_op_phy: 0 > > > rx_pause_ctrl_phy: 0 > > > tx_pause_ctrl_phy: 0 > > > rx_discards_phy: 920161423 > > > > Ok, this port seem to be suffering more, RX is congested, maybe due > > to > > the pcie bottleneck. > > Yes this side is receiving more traffic - second port is +10G more tx > [...] > > > Average: 17 0.00 0.00 16.60 0.00 0.00 52.10 > > > 0.00 0.00 0.00 31.30 > > > Average: 18 0.00 0.00 13.90 0.00 0.00 61.20 > > > 0.00 0.00 0.00 24.90 > > > Average: 19 0.00 0.00 9.99 0.00 0.00 70.33 > > > 0.00 0.00 0.00 19.68 > > > Average: 20 0.00 0.00 9.00 0.00 0.00 73.00 > > > 0.00 0.00 0.00 18.00 > > > Average: 21 0.00 0.00 8.70 0.00 0.00 73.90 > > > 0.00 0.00 0.00 17.40 > > > Average: 22 0.00 0.00 15.42 0.00 0.00 58.56 > > > 0.00 0.00 0.00 26.03 > > > Average: 23 0.00 0.00 10.81 0.00 0.00 71.67 > > > 0.00 0.00 0.00 17.52 > > > Average: 24 0.00 0.00 10.00 0.00 0.00 71.80 > > > 0.00 0.00 0.00 18.20 > > > Average: 25 0.00 0.00 11.19 0.00 0.00 71.13 > > > 0.00 0.00 0.00 17.68 > > > Average: 26 0.00 0.00 11.00 0.00 0.00 70.80 > > > 0.00 0.00 0.00 18.20 > > > Average: 27 0.00 0.00 10.01 0.00 0.00 69.57 > > > 0.00 0.00 0.00 20.42 > > > > The numa cores are not at 100% util, you have around 20% of idle on > > each one. > > Yes - no 100% cpu - but the difference between 80% and 100% is like > push > aditional 1-2Gbit/s > yes but, it doens't look like the bottleneck is the cpu, although it is close to be :).. > > > > > Average: 28 0.00 0.00 0.00 0.00 0.00 0.00 > > > 0.00 > > > 0.00 0.00 100.00 > > > Average: 29 0.00 0.00 0.00 0.00 0.00 0.00 > > > 0.00 > > > 0.00 0.00 100.00 > > > Average: 30 0.00 0.00 0.00 0.00 0.00 0.00 > > > 0.00 > > > 0.00 0.00 100.00 > > > Average: 31 0.00 0.00 0.00 0.00 0.00 0.00 > > > 0.00 > > > 0.00 0.00 100.00 > > > Average: 32 0.00 0.00 0.00 0.00 0.00 0.00 > > > 0.00 > > > 0.00 0.00 100.00 > > > Average: 33 0.00 0.00 3.90 0.00 0.00 0.00 > > > 0.00 > > > 0.00 0.00 96.10 > > > Average: 34 0.00 0.00 0.00 0.00 0.00 0.00 > > > 0.00 > > > 0.00 0.00 100.00 > > > Average: 35 0.00 0.00 0.00 0.00 0.00 0.00 > > > 0.00 > > > 0.00 0.00 100.00 > > > Average: 36 0.10 0.00 0.20 0.00 0.00 0.00 > > > 0.00 > > > 0.00 0.00 99.70 > > > Average: 37 0.20 0.00 0.30 0.00 0.00 0.00 > > > 0.00 > > > 0.00 0.00 99.50 > > > Average: 38 0.00 0.00 0.00 0.00 0.00 0.00 > > > 0.00 > > > 0.00 0.00 100.00 > > > Average: 39 0.00 0.00 2.60 0.00 0.00 0.00 > > > 0.00 > > > 0.00 0.00 97.40 > > > Average: 40 0.00 0.00 0.90 0.00 0.00 0.00 > > > 0.00 > > > 0.00 0.00 99.10 > > > Average: 41 0.10 0.00 0.50 0.00 0.00 0.00 > > > 0.00 > > > 0.00 0.00 99.40 > > > Average: 42 0.00 0.00 9.91 0.00 0.00 70.67 > > > 0.00 0.00 0.00 19.42 > > > Average: 43 0.00 0.00 15.90 0.00 0.00 57.50 > > > 0.00 0.00 0.00 26.60 > > > Average: 44 0.00 0.00 12.20 0.00 0.00 66.20 > > > 0.00 0.00 0.00 21.60 > > > Average: 45 0.00 0.00 12.00 0.00 0.00 67.50 > > > 0.00 0.00 0.00 20.50 > > > Average: 46 0.00 0.00 12.90 0.00 0.00 65.50 > > > 0.00 0.00 0.00 21.60 > > > Average: 47 0.00 0.00 14.59 0.00 0.00 60.84 > > > 0.00 0.00 0.00 24.58 > > > Average: 48 0.00 0.00 13.59 0.00 0.00 61.74 > > > 0.00 0.00 0.00 24.68 > > > Average: 49 0.00 0.00 18.36 0.00 0.00 53.29 > > > 0.00 0.00 0.00 28.34 > > > Average: 50 0.00 0.00 15.32 0.00 0.00 58.86 > > > 0.00 0.00 0.00 25.83 > > > Average: 51 0.00 0.00 17.60 0.00 0.00 55.20 > > > 0.00 0.00 0.00 27.20 > > > Average: 52 0.00 0.00 15.92 0.00 0.00 56.06 > > > 0.00 0.00 0.00 28.03 > > > Average: 53 0.00 0.00 13.00 0.00 0.00 62.30 > > > 0.00 0.00 0.00 24.70 > > > Average: 54 0.00 0.00 13.20 0.00 0.00 61.50 > > > 0.00 0.00 0.00 25.30 > > > Average: 55 0.00 0.00 14.59 0.00 0.00 58.64 > > > 0.00 0.00 0.00 26.77 > > > > > > > > > ethtool -k enp175s0f0 > > > Features for enp175s0f0: > > > rx-checksumming: on > > > tx-checksumming: on > > > tx-checksum-ipv4: on > > > tx-checksum-ip-generic: off [fixed] > > > tx-checksum-ipv6: on > > > tx-checksum-fcoe-crc: off [fixed] > > > tx-checksum-sctp: off [fixed] > > > scatter-gather: on > > > tx-scatter-gather: on > > > tx-scatter-gather-fraglist: off [fixed] > > > tcp-segmentation-offload: on > > > tx-tcp-segmentation: on > > > tx-tcp-ecn-segmentation: off [fixed] > > > tx-tcp-mangleid-segmentation: off > > > tx-tcp6-segmentation: on > > > udp-fragmentation-offload: off > > > generic-segmentation-offload: on > > > generic-receive-offload: on > > > large-receive-offload: off [fixed] > > > rx-vlan-offload: on > > > tx-vlan-offload: on > > > ntuple-filters: off > > > receive-hashing: on > > > highdma: on [fixed] > > > rx-vlan-filter: on > > > vlan-challenged: off [fixed] > > > tx-lockless: off [fixed] > > > netns-local: off [fixed] > > > tx-gso-robust: off [fixed] > > > tx-fcoe-segmentation: off [fixed] > > > tx-gre-segmentation: on > > > tx-gre-csum-segmentation: on > > > tx-ipxip4-segmentation: off [fixed] > > > tx-ipxip6-segmentation: off [fixed] > > > tx-udp_tnl-segmentation: on > > > tx-udp_tnl-csum-segmentation: on > > > tx-gso-partial: on > > > tx-sctp-segmentation: off [fixed] > > > tx-esp-segmentation: off [fixed] > > > tx-udp-segmentation: on > > > fcoe-mtu: off [fixed] > > > tx-nocache-copy: off > > > loopback: off [fixed] > > > rx-fcs: off > > > rx-all: off > > > tx-vlan-stag-hw-insert: on > > > rx-vlan-stag-hw-parse: off [fixed] > > > rx-vlan-stag-filter: on [fixed] > > > l2-fwd-offload: off [fixed] > > > hw-tc-offload: off > > > esp-hw-offload: off [fixed] > > > esp-tx-csum-hw-offload: off [fixed] > > > rx-udp_tunnel-port-offload: on > > > tls-hw-tx-offload: off [fixed] > > > tls-hw-rx-offload: off [fixed] > > > rx-gro-hw: off [fixed] > > > tls-hw-record: off [fixed] > > > > > > ethtool -c enp175s0f0 > > > Coalesce parameters for enp175s0f0: > > > Adaptive RX: off TX: on > > > stats-block-usecs: 0 > > > sample-interval: 0 > > > pkt-rate-low: 0 > > > pkt-rate-high: 0 > > > dmac: 32703 > > > > > > rx-usecs: 256 > > > rx-frames: 128 > > > rx-usecs-irq: 0 > > > rx-frames-irq: 0 > > > > > > tx-usecs: 8 > > > tx-frames: 128 > > > tx-usecs-irq: 0 > > > tx-frames-irq: 0 > > > > > > rx-usecs-low: 0 > > > rx-frame-low: 0 > > > tx-usecs-low: 0 > > > tx-frame-low: 0 > > > > > > rx-usecs-high: 0 > > > rx-frame-high: 0 > > > tx-usecs-high: 0 > > > tx-frame-high: 0 > > > > > > ethtool -g enp175s0f0 > > > Ring parameters for enp175s0f0: > > > Pre-set maximums: > > > RX: 8192 > > > RX Mini: 0 > > > RX Jumbo: 0 > > > TX: 8192 > > > Current hardware settings: > > > RX: 4096 > > > RX Mini: 0 > > > RX Jumbo: 0 > > > TX: 4096 > > > > > > > > > > > > > > > > > > > > Also changed a little coalesce params - and best for this config are: > ethtool -c enp175s0f0 > Coalesce parameters for enp175s0f0: > Adaptive RX: off TX: off > stats-block-usecs: 0 > sample-interval: 0 > pkt-rate-low: 0 > pkt-rate-high: 0 > dmac: 32573 > > rx-usecs: 40 > rx-frames: 128 > rx-usecs-irq: 0 > rx-frames-irq: 0 > > tx-usecs: 8 > tx-frames: 8 > tx-usecs-irq: 0 > tx-frames-irq: 0 > > rx-usecs-low: 0 > rx-frame-low: 0 > tx-usecs-low: 0 > tx-frame-low: 0 > > rx-usecs-high: 0 > rx-frame-high: 0 > tx-usecs-high: 0 > tx-frame-high: 0 > > > Less drops on RX side - and more pps in overall forwarded. > how much improvement ? maybe we can improve our adaptive rx coal to be efficient for this work load.