Hi Shahaf, On Mon, Jul 24, 2017 at 03:36:37PM +0300, Shahaf Shuler wrote: > Update the guides with: > * New supported features. > * Supported OFED and FW versions. > * Quick start guide. > * Performance tunning guide. > > Signed-off-by: Shahaf Shuler <shah...@mellanox.com> > Acked-by: Nelio Laranjeiro <nelio.laranje...@6wind.com>
Thanks, QSG and performance tuning are especially useful. I have several comments though (mostly nits), please see below. > --- > doc/guides/nics/mlx4.rst | 161 +++++++++++++++++++++++++++++++--- > doc/guides/nics/mlx5.rst | 220 > +++++++++++++++++++++++++++++++++++++++++------ > 2 files changed, 343 insertions(+), 38 deletions(-) > > diff --git a/doc/guides/nics/mlx4.rst b/doc/guides/nics/mlx4.rst > index f1f26d4f9..23e14e52a 100644 > --- a/doc/guides/nics/mlx4.rst > +++ b/doc/guides/nics/mlx4.rst > @@ -1,5 +1,6 @@ > .. BSD LICENSE > Copyright 2012-2015 6WIND S.A. > + Copyright 2015 Mellanox. I know several files got this wrong but the ending period is not necessary for once, it's actually part of the "6WIND S.A." name on the previous line. By the way, I intend to submit a patch soon to fix it in existing files with additional clean up on top. > > Redistribution and use in source and binary forms, with or without > modification, are permitted provided that the following conditions > @@ -76,6 +77,7 @@ Compiling librte_pmd_mlx4 causes DPDK to be linked against > libibverbs. > Features > -------- > > +- Multi arch support: x86 and Power8. Isn't "POWER8" always written all caps? Also see next comment. > - RSS, also known as RCA, is supported. In this mode the number of > configured RX queues must be a power of two. > - VLAN filtering is supported. > @@ -87,16 +89,7 @@ Features > - Inner L3/L4 (IP, TCP and UDP) TX/RX checksum offloading and validation. > - Outer L3 (IP) TX/RX checksum offloading and validation for VXLAN frames. > - Secondary process TX is supported. > - > -Limitations > ------------ > - > -- RSS hash key cannot be modified. > -- RSS RETA cannot be configured > -- RSS always includes L3 (IPv4/IPv6) and L4 (UDP/TCP). They cannot be > - dissociated. > -- Hardware counters are not implemented (they are software counters). > -- Secondary process RX is not supported. > +- Rx interrupts. > > Configuration > ------------- > @@ -244,8 +237,8 @@ DPDK and must be installed separately: > > Currently supported by DPDK: > > -- Mellanox OFED **4.0-2.0.0.0**. > -- Firmware version **2.40.7000**. > +- Mellanox OFED **4.1**. > +- Firmware version **2.36.5000** and above. > - Supported architectures: **x86_64** and **POWER8**. So x86_64 and POWER8 then? (not "x86" as in "32 bit") Actually I'm not sure architecture support can be considered a PMD feature given that DPDK itself inevitably supports a larger set. I suggest dropping the change made to the "Features" section above. > > Getting Mellanox OFED > @@ -273,6 +266,150 @@ Supported NICs > > * Mellanox(R) ConnectX(R)-3 Pro 40G MCX354A-FCC_Ax (2*40G) > > +Quick Start guide > +------------------ > + > +1. Download latest Mellanox OFED. For more info check the `prerequisites`_. > + > +2. Install the required libraries and kernel modules either by installing > + only the required set, or by installing the entire Mellanox OFED: > + > + For Bare metal use: > + > + .. code-block:: console > + > + ./mlnxofedinstall > + > + For SR-IOV Hypervisors use: > + > + .. code-block:: console > + > + ./mlnxofedinstall --enable-sriov -hypervisor > + > + For SR-IOV Virtual machine use: > + > + .. code-block:: console > + > + ./mlnxofedinstall --guest > + > +3. Verify the firmware is the correct one: > + > + .. code-block:: console > + > + ibv_devinfo > + > +4. Set all ports links to ethernet, follow instruction on the screen: ethernet => Ethernet > + > + .. code-block:: console > + > + connectx_port_config > + You might want to describe the manual method as well: PCI=0001:02:03.4 echo eth > "/sys/bus/pci/devices/$PCI/mlx4_port0" echo eth > "/sys/bus/pci/devices/$PCI/mlx4_port1" (actually I think this is what connectx_port_config does internally) > +5. In case of bare metal or Hypervisor, config the optimized steering mode > + by adding the following line to ``/etc/modprobe.d/mlx4_core.conf``: > + > + .. code-block:: console > + > + options mlx4_core log_num_mgm_entry_size=-7 > + > + .. note:: > + > + If VLAN filtering is used, set log_num_mgm_entry_size=-1. > + Performance degradation can occur on this case Missing period. > + > +6. Restart the driver: > + > + .. code-block:: console > + > + /etc/init.d/openibd restart > + or: > + > + .. code-block:: console > + > + service openibd restart > + > +7. Enable MLX4 PMD on the ``.config`` file: > + > + .. code-block:: console > + > + CONFIG_RTE_LIBRTE_MLX4_PMD=y > + Looks like this duplicates the note about CONFIG_RTE_LIBRTE_MLX4_PMD in the first section of this document. Maybe it should be removed. > +8. Compile DPDK and you are ready to go: > + > + .. code-block:: console > + > + make config T=<cpu arch, compiler, ..> > + make How about linking to the relevant build documentation instead of providing an example, otherwise we'll have to maintain it. > + > + Extra line (I think). The style in this file uses only one empty line to separate sections. > +Limitations and known issues > +---------------------------- > + > +- RSS hash key cannot be modified. > +- RSS RETA cannot be configured > +- RSS always includes L3 (IPv4/IPv6) and L4 (UDP/TCP). They cannot be > + dissociated. > +- Hardware counters are not implemented (they are software counters). > +- Secondary process RX is not supported. > + I suggest leaving this section unchanged and in its original spot to make the diff shorter. > +Performance tunning > +------------------- tunning => tuning > + > +1. Verify the optimized steering mode is configured Missing period or colon? > + > + .. code-block:: console > + > + cat /sys/module/mlx4_core/parameters/log_num_mgm_entry_size > + > +2. Use environment variable MLX4_INLINE_RECV_SIZE=64 to get maximum > + performance for 64B messages. > + > +3. Use the CPU near local NUMA node to which the PCIe adapter is connected, > + for better performance. For Virtual Machines (VM), verify that the right > CPU "Virtual Machines (VM)" => either "virtual machines" of "VMs", I think the reader understands what they are at this point. > + and NUMA node are pinned for the VM according to the above. Run And you should remove "for the VM". > + > + .. code-block:: console > + > + lstopo-no-graphics > + > + to identify the NUMA node to which the PCIe adapter is connected. > + > +4. If more than one adapter is used, and root complex capabilities enables > + to put both adapters on the same NUMA node without PCI bandwidth > degredation, degredation => degradation > + it is recommended to locate both adapters on the same NUMA node. > + This in order to forward packets from one to the other without > + NUMA performance penalty. > + > +5. Disable pause frames Missing period or colon. > + > + .. code-block:: console > + > + ethtool -A <netdev> rx off tx off > + > +6. Verify IO non-posted prefetch is disabled by default. This can be checked > + via the BIOS configuration. Please contact you server provider for more > + information about the settings. > + > +.. hint:: > + > + On Some machines, depends on the machine intergrator, it is > beneficial Some => some intergrator => integrator > + to set the PCI max read request parameter to 1K. This can be > + done in the following way: > + > + To query the read request size use: > + > + .. code-block:: console > + > + setpci -s <NIC PCI address> 68.w > + > + If the output is different than 3XXX, set it by: > + > + .. code-block:: console > + > + setpci -s <NIC PCI address> 68.w=3XXX > + > + The XXX can be different on different systems. Make sure to configure > + according to the setpci output. > + > Usage example > ------------- > > diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst > index a68b7adc0..8accd754b 100644 > --- a/doc/guides/nics/mlx5.rst > +++ b/doc/guides/nics/mlx5.rst > @@ -1,5 +1,6 @@ > .. BSD LICENSE > Copyright 2015 6WIND S.A. > + Copyright 2015 Mellanox. Same nit about the period. > > Redistribution and use in source and binary forms, with or without > modification, are permitted provided that the following conditions > @@ -64,6 +65,9 @@ physical memory (or memory that does not belong to the > current process). > This capability allows the PMD to coexist with kernel network interfaces > which remain functional, although they stop receiving unicast packets as > long as they share the same MAC address. > +This means legacy linux control tools (for example: ethtool, ifconfig and Extra space before "ethtool". > +more) can operate on the same network interfaces that owned by the DPDK > +application. > > Enabling librte_pmd_mlx5 causes DPDK applications to be linked against > libibverbs. > @@ -71,6 +75,7 @@ libibverbs. > Features > -------- > > +- Multi arch support: x86, Power8, ARMv8. I think this line should not be added, for the same reasons as mlx4. > - Multiple TX and RX queues. > - Support for scattered TX and RX frames. > - IPv4, IPv6, TCPv4, TCPv6, UDPv4 and UDPv6 RSS on any number of queues. > @@ -92,14 +97,8 @@ Features > - RSS hash result is supported. > - Hardware TSO. > - Hardware checksum TX offload for VXLAN and GRE. > - > -Limitations > ------------ > - > -- Inner RSS for VXLAN frames is not supported yet. > -- Port statistics through software counters only. > -- Hardware checksum RX offloads for VXLAN inner header are not supported yet. > -- Secondary process RX is not supported. Limitations should stay here for a shorter diff. > +- RX interrupts > +- Statistics query including Basic, Extended and per queue. > > Configuration > ------------- > @@ -156,13 +155,12 @@ Run-time configuration > - ``rxq_cqe_comp_en`` parameter [int] > > A nonzero value enables the compression of CQE on RX side. This feature > - allows to save PCI bandwidth and improve performance at the cost of a > - slightly higher CPU usage. Enabled by default. > + allows to save PCI bandwidth and improve performance. Enabled by default. > > Supported on: > > - - x86_64 with ConnectX4 and ConnectX4 LX > - - Power8 with ConnectX4 LX > + - x86_64 with ConnectX-4, ConnectX-4LX and ConnectX-5. > + - Power8 and ARMv8 with ConnectX-4LX and ConnectX-5. Power8 => POWER8, and how about "ConnectX-4LX" => "ConnectX-4 LX"? > > - ``txq_inline`` parameter [int] > > @@ -170,17 +168,26 @@ Run-time configuration > Can improve PPS performance when PCI back pressure is detected and may be > useful for scenarios involving heavy traffic on many queues. > > - It is not enabled by default (set to 0) since the additional software > - logic necessary to handle this mode can lower performance when back > + Since the additional software logic necessary to handle this mode this How about: Because additional software logic is necessary to handle this mode, this > + option should be used with care, as it can lower performance when back > pressure is not expected. > > - ``txqs_min_inline`` parameter [int] > > Enable inline send only when the number of TX queues is greater or equal > to this value. > - > This option should be used in combination with ``txq_inline`` above. Removing the empty line causes both lines to be coalesced into a single paragraph, if that's the intent you should move the contents of the second line at the end of the first one. > > + On ConnectX-4/ConnectX-4LX: How about "ConnectX-4, ConnectX-4 LX and ConnectX-5 without Enhanced MPW"? > + > + - disabled by default. in case ``txq_inline`` is set recommendation > is 4. How about: - Disabled by default. - In case ``txq_inline`` is set, recommendation is 4. > + > + On ConnectX-5: "On ConnectX-5 with Enhanced MPW enabled" > + > + - when Enhanced MPW is enabled, it is set to 8 by default. How about: - Set to 8 by default. > + - otherwise disabled by default. in case ``txq_inline`` is set > + use same values as ConnectX-4/ConnectX-4LX. With the above changes, no need for such duplication. > + > - ``txq_mpw_en`` parameter [int] > > A nonzero value enables multi-packet send (MPS) for ConnectX-4 Lx and > @@ -221,9 +228,7 @@ Run-time configuration > > A nonzero value enables hardware TSO. > When hardware TSO is enabled, packets marked with TCP segmentation > - offload will be divided into segments by the hardware. > - > - Disabled by default. > + offload will be divided into segments by the hardware. Disabled by default. Is coalescing on purpose? > > Prerequisites > ------------- > @@ -279,13 +284,13 @@ DPDK and must be installed separately: > > Currently supported by DPDK: > > -- Mellanox OFED version: **4.0-2.0.0.0** > +- Mellanox OFED version: **4.1**. > - firmware version: > > - - ConnectX-4: **12.18.2000** > - - ConnectX-4 Lx: **14.18.2000** > - - ConnectX-5: **16.19.1200** > - - ConnectX-5 Ex: **16.19.1200** > + - ConnectX-4: **12.20.1010** and above. > + - ConnectX-4 Lx: **14.20.1010** and above. > + - ConnectX-5: **16.20.1010** and above. > + - ConnectX-5 Ex: **16.20.1010** and above. > > Getting Mellanox OFED > ~~~~~~~~~~~~~~~~~~~~~ > @@ -330,10 +335,103 @@ Supported NICs > * Mellanox(R) ConnectX(R)-5 100G MCX556A-ECAT (2x100G) > * Mellanox(R) ConnectX(R)-5 Ex EN 100G MCX516A-CDAT (2x100G) > > -Known issues > ------------- > +Quick Start guide > +------------------ "Quick Start guide" => either "Quick start guide" or "Quick Start Guide" > + > +1. Download latest Mellanox OFED. For more info check the `prerequisites`_. > + > + > +2. Install the required libraries and kernel modules either by installing > + only the required set, or by installing the entire Mellanox OFED: > + > + .. code-block:: console > + > + ./mlnxofedinstall > + > +3. Verify the firmware is the correct one: > + > + .. code-block:: console > + > + ibv_devinfo > + > +4. Verify all ports links are set to Ethernet: > + > + .. code-block:: console > + > + mlxconfig -d <mst device> query | grep LINK_TYPE > + LINK_TYPE_P1 ETH(2) > + LINK_TYPE_P2 ETH(2) > + > + If the Links are not in the current protocol move the to Ethernet: Links => links the => them "the current protocol" is rather unclear, how about: Link types may have to be configured to Ethernet: > + > + .. code-block:: console > + > + mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3 > + > + * LINK_TYPE_P1=<1|2|3> , 1=Infiniband 2=Ethernet 3=VPI(auto-sense) > + > + For Hypervisors verify SR-IOV is enabled on the NIC: Hypervisors => hypervisors > + > + .. code-block:: console > + > + mlxconfig -d <mst device> query | grep SRIOV_EN > + SRIOV_EN True(1) > + > + If Needed, set enable the set the relevant fields: Needed => needed > > -* **Flow pattern without any specific vlan will match for vlan packets as > well.** > + .. code-block:: console > + > + mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16 > + mlxfwreset -d <mst device> reset > + > +5. Restart the driver: > + > + .. code-block:: console > + > + /etc/init.d/openibd restart > + or: > + > + .. code-block:: console > + > + service openibd restart > + > + If port link protocol was changed need to reset the fw as well: How about: If link type was changed, firmware must be reset as well: > + > + .. code-block:: console > + > + mlxfwreset -d <mst device> reset > + > + For Hypervisors, after reset write the sysfs number of Virtual Functions Hypervisors => hypervisors Virtual Functions => virtual functions (why all the caps?) > + needed for the PF. << Inserting an empty line might make sense here. > + The following is an example of a standard Linux kernel generated file that > + is available in the new kernels: You did not provide a specific kernel version. It's a rather old feature actually, and since it is documented for almost all other PMDs, how about: To dynamically instantiate a given number of virtual functions (VFs): > + > + .. code-block:: console > + > + echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs > + > + Extra empty line. > +6. Enable MLX5 PMD in the ``.config`` file : > + > + .. code-block:: console > + > + CONFIG_RTE_LIBRTE_MLX5_PMD=y > + > +7. Compile DPDK and you are ready to go: > + > + .. code-block:: console > + > + make config T=<cpu arch, compiler, ..> > + make Same comments for 6. and 7. as their mlx4 counterparts. > + > +Limitations and Known issues > +---------------------------- > + > +- Inner RSS for VXLAN frames is not supported yet. > +- Port statistics through software counters only. > +- Hardware checksum RX offloads for VXLAN inner header are not supported yet. > +- Secondary process RX is not supported. > +- Flow pattern without any specific vlan will match for vlan packets as well: I suggest leaving this section in its original spot. > > When VLAN spec is not specified in the pattern, the matching rule will be > created with VLAN as a wild card. > Meaning, the flow rule:: > @@ -350,6 +448,76 @@ Known issues > > Will match any ipv4 packet (VLAN included). > > +Performance tunning > +------------------- tunning => tuning > + > +1. Configure aggressive CQE Zipping for maximum performance Missing period or colon. > + > + .. code-block:: console > + > + mlxconfig -d <mst device> s CQE_COMPRESSION=1 > + > + To set it back to the default CQE Zipping mode use Missing period or colon. > + > + .. code-block:: console > + > + mlxconfig -d <mst device> s CQE_COMPRESSION=0 > + > +2. In case of Virtualization: Virtualization => virtualization > + > + - Make sure that Hypervisor kernel is 3.16 or newer. Hypervisor => hypervisor > + - Configure boot with "iommu=pt". How about `` `` instead of ""? > + - Use 1G huge pages. > + - Make sure to allocate a VM on huge pages. > + - Make sure to set CPU pinning. > + > +3. Use the CPU near local NUMA node to which the PCIe adapter is connected, > + for better performance. For Virtual Machines (VM), verify that the right > CPU "Virtual Machines (VM)" => either "virtual machines" of "VMs", I think the reader understands what they are at this point. > + and NUMA node are pinned for the VM according to the above. Run And you should remove "for the VM". > + > + .. code-block:: console > + > + lstopo-no-graphics > + > + to identify the NUMA node to which the PCIe adapter is connected. > + > +4. If more than one adapter is used, and root complex capabilities enables > + to put both adapters on the same NUMA node without PCI bandwidth > degredation, degredation => degradation > + it is recommended to locate both adapters on the same NUMA node. > + This in order to forward packets from one to the other without > + NUMA performance penalty. > + > +5. Disable pause frames Missing period or colon. > + > + .. code-block:: console > + > + ethtool -A <netdev> rx off tx off > + > +6. Verify IO non-posted prefetch is disabled by default. This can be checked > + via the BIOS configuration. Please contact you server provider for more > + information about the settings. > + > +.. hint:: > + > + On Some machines, depends on the machine intergrator, it is > beneficial Some => some intergrator => integrator > + to set the PCI max read request parameter to 1K. This can be > + done in the following way: > + > + To query the read request size use: > + > + .. code-block:: console > + > + setpci -s <NIC PCI address> 68.w > + > + If the output is different than 3XXX, set it by: > + > + .. code-block:: console > + > + setpci -s <NIC PCI address> 68.w=3XXX > + > + The XXX can be different on different systems. Make sure to configure > + according to the setpci output. > + > Notes for testpmd > ----------------- > > -- > 2.12.0 > -- Adrien Mazarguil 6WIND