On Wed, Mar 01, 2017 at 08:36:24AM +0100, Maxime Coquelin wrote: > > > On 02/23/2017 06:49 AM, Yuanhan Liu wrote: > >On Wed, Feb 22, 2017 at 10:36:36AM +0100, Maxime Coquelin wrote: > >> > >> > >>On 02/22/2017 02:37 AM, Yuanhan Liu wrote: > >>>On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote: > >>>>This patch aligns the Virtio-net header on a cache-line boundary to > >>>>optimize cache utilization, as it puts the Virtio-net header (which > >>>>is always accessed) on the same cache line as the packet header. > >>>> > >>>>For example with an application that forwards packets at L2 level, > >>>>a single cache-line will be accessed with this patch, instead of > >>>>two before. > >>> > >>>I'm assuming you were testing pkt size <= (64 - hdr_size)? > >> > >>No, I tested with 64 bytes packets only. > > > >Oh, my bad, I overlooked it. While you were saying "a single cache > >line", I was thinking putting the virtio net hdr and the "whole" > >packet data in single cache line, which is not possible for pkt > >size 64B. > > > >>I run some more tests this morning with different packet sizes, > >>and also with changing the mbuf size on guest side to have multi- > >>buffers packets: > >> > >>+-------+--------+--------+-------------------------+ > >>| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align | > >>+-------+--------+--------+-------------------------+ > >>| 64 | 2048 | 11.05 | 11.78 | > >>| 128 | 2048 | 10.66 | 11.48 | > >>| 256 | 2048 | 10.47 | 11.21 | > >>| 512 | 2048 | 10.22 | 10.88 | > >>| 1024 | 2048 | 7.65 | 7.84 | > >>| 1500 | 2048 | 6.25 | 6.45 | > >>| 2000 | 2048 | 5.31 | 5.43 | > >>| 2048 | 2048 | 5.32 | 4.25 | > >>| 1500 | 512 | 3.89 | 3.98 | > >>| 2048 | 512 | 1.96 | 2.02 | > >>+-------+--------+--------+-------------------------+ > > > >Could you share more info, say is it a PVP test? Is mergeable on? > >What's the fwd mode? > > No, this is not PVP benchmark, I have neither another server nor a packet > generator connected to my Haswell machine back-to-back. > > This is simple micro-benchmark, vhost PMD in txonly, Virtio PMD in > rxonly. In this configuration, mergeable is ON and no offload disabled > in QEMU cmdline.
Okay, I see. So the boost, as you have stated, comes from saving two cache line access to one. Before that, vhost write 2 cache lines, while the virtio pmd reads 2 cache lines: one for reading the header, another one for reading the ether header, for updating xstats (there is no ether access in the fwd mode you tested). > That's why I would be interested in more testing on recent hardware > with PVP benchmark. Is it something that could be run in Intel lab? I think Yao Lei could help on that? But as stated, I think it may break the performance for bit packets. And I also won't expect big boost even for 64B in PVP test, judging that it's only 6% boost in micro bechmarking. --yliu > > I did some more trials, and I think that most of the gain seen in this > microbenchmark could happen in fact on vhost side. > Indeed, I monitored the number of packets dequeued at each .rx_pkt_burst() > call, and I can see there are packets in the vq only once every 20 > calls. On Vhost side, monitoring shows that it always succeeds to write > its burts, i.e. the vq is never full. > > >>>>In case of multi-buffers packets, next segments will be aligned on > >>>>a cache-line boundary, instead of cache-line boundary minus size of > >>>>vnet header before. > >>> > >>>The another thing is, this patch always makes the pkt data cache > >>>unaligned for the first packet, which makes Zhihong's optimization > >>>on memcpy (for big packet) useless. > >>> > >>> commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f > >>> Author: Zhihong Wang <zhihong.w...@intel.com> > >>> Date: Tue Dec 6 20:31:06 2016 -0500 > >> > >>I did run some loopback test with large packet also, an I see a small gain > >>with my patch (fwd io on both ends): > >> > >>+-------+--------+--------+-------------------------+ > >>| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align | > >>+-------+--------+--------+-------------------------+ > >>| 1500 | 2048 | 4.05 | 4.14 | > >>+-------+--------+--------+-------------------------+ > > > >Wierd, that basically means Zhihong's patch doesn't work? Could you add > >one more colum here: what's the data when roll back to the point without > >Zhihong's commit? > > I add this to my ToDo list, don't expect results before next week. > > >>> > >>> Signed-off-by: Zhihong Wang <zhihong.w...@intel.com> > >>> Reviewed-by: Yuanhan Liu <yuanhan....@linux.intel.com> > >>> Tested-by: Lei Yao <lei.a....@intel.com> > >> > >>Does this need to be cache-line aligned? > > > >Nope, the alignment size is different with different platforms. AVX512 > >needs a 64B alignment, while AVX2 needs 32B alignment. > > > >>I also tried to align pkt on 16bytes boundary, basically putting header > >>at HEADROOM + 4 bytes offset, but I didn't measured any gain on > >>Haswell, > > > >The fast rte_memcpy path (when dst & src is well aligned) on Haswell > >(with AVX2) requires 32B alignment. Even the 16B boundary would make > >it into the slow path. From this point of view, the extra pad does > >not change anything. Thus, no gain is expected. > > > >>and even a drop on SandyBridge. > > > >That's weird, SandyBridge requries the 16B alignment, meaning the extra > >pad should put it into fast path of rte_memcpy, whereas the performance > >is worse. > > Thanks for the info, I will run more tests to explain this. > > Cheers, > Maxime > > > > --yliu > > > >>I understand your point regarding aligned memcpy, but I'm surprised I > >>don't see its expected superiority with my benchmarks. > >>Any thoughts?