Hi, Maxime: > -----Original Message----- > From: Maxime Coquelin [mailto:maxime.coque...@redhat.com] > Sent: Tuesday, December 6, 2016 4:30 PM > To: Yang, Zhiyong <zhiyong.y...@intel.com>; dev@dpdk.org > Cc: yuanhan....@linux.intel.com; Richardson, Bruce > <bruce.richard...@intel.com>; Ananyev, Konstantin > <konstantin.anan...@intel.com>; Pierre Pfister (ppfister) > <ppfis...@cisco.com> > Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset > and related test > > > > On 12/06/2016 07:33 AM, Yang, Zhiyong wrote: > > Hi, Maxime: > > > >> -----Original Message----- > >> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com] > >> Sent: Friday, December 2, 2016 6:01 PM > >> To: Yang, Zhiyong <zhiyong.y...@intel.com>; dev@dpdk.org > >> Cc: yuanhan....@linux.intel.com; Richardson, Bruce > >> <bruce.richard...@intel.com>; Ananyev, Konstantin > >> <konstantin.anan...@intel.com> > >> Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset > >> and related test > >> > >> Hi Zhiyong, > >> > >> On 12/05/2016 09:26 AM, Zhiyong Yang wrote: > >>> DPDK code has met performance drop badly in some case when calling > >>> glibc function memset. Reference to discussions about memset in > >>> http://dpdk.org/ml/archives/dev/2016-October/048628.html > >>> It is necessary to introduce more high efficient function to fix it. > >>> One important thing about rte_memset is that we can get clear > >>> control on what instruction flow is used. > >>> > >>> This patchset introduces rte_memset to bring more high efficient > >>> implementation, and will bring obvious perf improvement, especially > >>> for small N bytes in the most application scenarios. > >>> > >>> Patch 1 implements rte_memset in the file rte_memset.h on IA > >>> platform The file supports three types of instruction sets including > >>> sse & avx (128bits), avx2(256bits) and avx512(512bits). rte_memset > >>> makes use of vectorization and inline function to improve the perf > >>> on IA. In addition, cache line and memory alignment are fully taken > >>> into > >> consideration. > >>> > >>> Patch 2 implements functional autotest to validates the function > >>> whether to work in a right way. > >>> > >>> Patch 3 implements performance autotest separately in cache and > memory. > >>> > >>> Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring > >>> 3%~4% performance improvements on IA platform from virtio/vhost > >>> non-mergeable loopback testing. > >>> > >>> Zhiyong Yang (4): > >>> eal/common: introduce rte_memset on IA platform > >>> app/test: add functional autotest for rte_memset > >>> app/test: add performance autotest for rte_memset > >>> lib/librte_vhost: improve vhost perf using rte_memset > >>> > >>> app/test/Makefile | 3 + > >>> app/test/test_memset.c | 158 +++++++++ > >>> app/test/test_memset_perf.c | 348 > +++++++++++++++++++ > >>> doc/guides/rel_notes/release_17_02.rst | 11 + > >>> .../common/include/arch/x86/rte_memset.h | 376 > >> +++++++++++++++++++++ > >>> lib/librte_eal/common/include/generic/rte_memset.h | 51 +++ > >>> lib/librte_vhost/virtio_net.c | 18 +- > >>> 7 files changed, 958 insertions(+), 7 deletions(-) create mode > >>> 100644 app/test/test_memset.c create mode 100644 > >>> app/test/test_memset_perf.c create mode 100644 > >>> lib/librte_eal/common/include/arch/x86/rte_memset.h > >>> create mode 100644 > >> lib/librte_eal/common/include/generic/rte_memset.h > >>> > >> > >> Thanks for the series, idea looks good to me. > >> > >> Wouldn't be worth to also use rte_memset in Virtio PMD (not > >> compiled/tested)? : > >> > > > > I think rte_memset maybe can bring some benefit here, but , I'm not > > clear how to enter the branch and test it. :) > > Indeed, you will need Pierre's patch: > [dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set > > Thanks, > Maxime > > Thank you Maxime. I can see a little, but not obviously performance improvement here. You know, memset(hdr, 0, head_size); only consumes fewer cycles for virtio pmd. head_size only 10 or 12 bytes. I optimize rte_memset perf further for N=8~15 bytes. The main purpose of Introducing rte_memset is that we can use it to avoid perf drop issue instead of glibc memset on some platform, I think.
> > > >> diff --git a/drivers/net/virtio/virtio_rxtx.c > >> b/drivers/net/virtio/virtio_rxtx.c > >> index 22d97a4..a5f70c4 100644 > >> --- a/drivers/net/virtio/virtio_rxtx.c > >> +++ b/drivers/net/virtio/virtio_rxtx.c > >> @@ -287,7 +287,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, > >> struct rte_mbuf *cookie, > >> rte_pktmbuf_prepend(cookie, head_size); > >> /* if offload disabled, it is not zeroed below, do it now > >> */ > >> if (offload == 0) > >> - memset(hdr, 0, head_size); > >> + rte_memset(hdr, 0, head_size); > >> } else if (use_indirect) { > >> /* setup tx ring slot to point to indirect > >> * descriptor list stored in reserved region. > >> > >> Cheers, > >> Maxime