On 02/22/2017 02:37 AM, Yuanhan Liu wrote:
On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
This patch aligns the Virtio-net header on a cache-line boundary to
optimize cache utilization, as it puts the Virtio-net header (which
is always accessed) on the same cache line as the packet header.
For example with an application that forwards packets at L2 level,
a single cache-line will be accessed with this patch, instead of
two before.
I'm assuming you were testing pkt size <= (64 - hdr_size)?
No, I tested with 64 bytes packets only.
I run some more tests this morning with different packet sizes,
and also with changing the mbuf size on guest side to have multi-
buffers packets:
+-------+--------+--------+-------------------------+
| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
+-------+--------+--------+-------------------------+
| 64 | 2048 | 11.05 | 11.78 |
| 128 | 2048 | 10.66 | 11.48 |
| 256 | 2048 | 10.47 | 11.21 |
| 512 | 2048 | 10.22 | 10.88 |
| 1024 | 2048 | 7.65 | 7.84 |
| 1500 | 2048 | 6.25 | 6.45 |
| 2000 | 2048 | 5.31 | 5.43 |
| 2048 | 2048 | 5.32 | 4.25 |
| 1500 | 512 | 3.89 | 3.98 |
| 2048 | 512 | 1.96 | 2.02 |
+-------+--------+--------+-------------------------+
Overall we can see it is always beneficial.
The only case we see a drop is the 2048/2048 case, which is explained
because it needs two buffers as the vnet header + pkt does not fit in
2048 bytes.
It could be fixed by aligning vnet header to the cacheline before,
inside the headroom.
In case of multi-buffers packets, next segments will be aligned on
a cache-line boundary, instead of cache-line boundary minus size of
vnet header before.
The another thing is, this patch always makes the pkt data cache
unaligned for the first packet, which makes Zhihong's optimization
on memcpy (for big packet) useless.
commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f
Author: Zhihong Wang <zhihong.w...@intel.com>
Date: Tue Dec 6 20:31:06 2016 -0500
eal: optimize aligned memcpy on x86
This patch optimizes rte_memcpy for well aligned cases, where both
dst and src addr are aligned to maximum MOV width. It introduces a
dedicated function called rte_memcpy_aligned to handle the aligned
cases with simplified instruction stream. The existing rte_memcpy
cases with simplified instruction stream. The existing rte_memcpy
is renamed as rte_memcpy_generic. The selection between them 2 is
done at the entry of rte_memcpy.
The existing rte_memcpy is for generic cases, it handles unaligned
copies and make store aligned, it even makes load aligned for micro
architectures like Ivy Bridge. However alignment handling comes at
a price: It adds extra load/store instructions, which can cause
complications sometime.
DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example:
The copy is aligned, and remote, and there is header write along
which is also remote. In this case the memcpy instruction stream
should be simplified, to reduce extra load/store, therefore reduce
the probability of load/store buffer full caused pipeline stall, to
let the actual memcpy instructions be issued and let H/W prefetcher
goes to work as early as possible.
This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
from 64 to 1500 bytes.
The test can also be conducted without NIC, by setting loopback
traffic between Virtio and Vhost. For example, modify the macro
TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h,
rebuild and start testpmd in both host and guest, then "start" on
one side and "start tx_first 32" on the other.
I did run some loopback test with large packet also, an I see a small
gain with my patch (fwd io on both ends):
+-------+--------+--------+-------------------------+
| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
+-------+--------+--------+-------------------------+
| 1500 | 2048 | 4.05 | 4.14 |
+-------+--------+--------+-------------------------+
Signed-off-by: Zhihong Wang <zhihong.w...@intel.com>
Reviewed-by: Yuanhan Liu <yuanhan....@linux.intel.com>
Tested-by: Lei Yao <lei.a....@intel.com>
Does this need to be cache-line aligned?
I also tried to align pkt on 16bytes boundary, basically putting header
at HEADROOM + 4 bytes offset, but I didn't measured any gain on
Haswell, and even a drop on SandyBridge.
I understand your point regarding aligned memcpy, but I'm surprised I
don't see its expected superiority with my benchmarks.
Any thoughts?
Cheers,
Maxime
Signed-off-by: Maxime Coquelin <maxime.coque...@redhat.com>
---
Hi,
I send this patch as RFC because I get strange results on SandyBridge.
For micro-benchmarks, I measure a +6% gain on Haswell, but I get a big
performance drop on SandyBridge (~-18%).
When running PVP benchmark on SandyBridge, I measure a +4% performance
gain though.
So I'd like to call for testing on this patch, especially PVP-like testing
on newer architectures.
Regarding SandyBridge, I would be interrested to know whether we should
take the performance drop into account, as we for example had one patch in
last release that cause a performance drop on SB we merged anyway.
Sorry, would you remind me which patch it is?
--yliu