The packet rate with 64 bytes packet size over 100 Gbps line can reach 148.8 
million packets per second. The ConnectX NICs descriptor part to specify the 
packet data receiving buffer is 16 bytes size. So, the required PCIe bandwidth 
just to read descriptors by the NIC from the host memory is 148.8M*16B = 2.27GB 
per second, this is 1/6 of the PCIe x16 Gen 3 slot total bandwidth. To mitigate 
this requirement the Multi-Packet Receiving Queue (MPRQ) feature is provided by 
Mellanox NICs, with this feature the descriptor specifies the single linear 
buffer accepting the multiple packets into strides within this buffer. The 
current implementation of mlx5 PMD allows the packet to be received into the 
single stride only, packet can't be placed into multiple adjacent strides. It 
means the stride size must be large enough to store the packets up to MTU size. 
The maximal stride size is limited by hardware capabilities, for example, 
ConnectX-5 supports strides up to 8KB. Hence, if MPRQ feature is enabled the 
maximal supported MTU is limited by maximal stride size (minus space for the 
HEAD_ROOM).

The MPRQ feature is crucial to support the full line rate with small size 
packets over the fast lines, it must be enabled if the full line rate is 
desired. In order to support the MTU exceeding the stride size, the MPRQ 
feature should be updated to allow a packet to take more than one stride, 
receiving packet into multiple adjacent strides should be implemented.

The reason preventing the packet to be received into multiple strides is that 
the data buffer must be preceded with some HEAD_ROOM space. In the current 
implementation, the HEAD_ROOM space is borrowed by PMD from the tail of the 
preceding stride. If packet took multiple strides it would happen the tail of 
stride is overwritten with packet data and the memory can't be borrowed to 
provide the HEAD_ROOM space for the next packet. There three ways to resolve 
the issue are proposed:

1. To copy the part of packet data to dedicated mbuf in order to free the 
memory needed for the next packet HEAD_ROOM. Actual copying is needed for the 
range of packets sizes when the tail of the stride is occupied by received 
data. For example, for stride size 8KB, HEAD_ROOM size 128B, and MTU 9000B the 
data copying would happen for the packet size range 8064-8192 bytes. Then the 
dedicated mbuf should be linked to the mbuf chain in order to build a 
multi-segment packet, the first mbuf points to the stride as an external 
buffer, and the second mbuf contains the copied data, the tail of the stride is 
free to be used as HEAD_ROOM of the next packet.

2. The provide HEAD_ROOM as dedicated mbuf being linked as the first one into 
the packet mbuf chain. Not all applications and DPDK routines support this 
approach. For example, rte_vlan_insert() assumes the HEAD_ROOM immediately 
precedes the packet data, hence this solution does not look to be appropriate.

3. The abovementioned approaches suppose application and PMDs support the 
multi-segment packets, if not - we should copy the entire data packet to the 
single mbuf.

To configure one of the approaches above the new devarg is proposed: 
mprq_log_stride_size - specifies the desired stride size (log2). If this 
parameter not specified the mlx5 PMD tries to support MPRQ in existing fashion, 
in compatibility mode. Otherwise, the overlapping data copy is engaged, the 
mode depends on whether multi-segment packet support is enabled. If there is no 
scattering enabled the approach (3) is engaged.

Signed-off-by: Viacheslav Ovsiienko <viachesl...@mellanox.com>

Reply via email to