Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions

2017-01-08 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monja...@6wind.com]
> Sent: Tuesday, January 3, 2017 4:41 AM
> To: Wang, Zhihong ; Ravi Kerur
> 
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp
> functions
> 
> 2016-06-07 11:09, Wang, Zhihong:
> > From: Ravi Kerur [mailto:rke...@gmail.com]
> > > Zhilong, Thomas,
> > >
> > > If there is enough interest within DPDK community I can work on adding
> support
> > > for 'unaligned access' and 'test cases' for it. Please let me know either
> way.
> >
> > Hi Ravi,
> >
> > This rte_memcmp is proved with better performance than glibc's in aligned
> > cases, I think it has good value to DPDK lib.
> >
> > Though we don't have memcmp in critical pmd data path, it offers a better
> > choice for applications who do.
> 
> Re-thinking about this series, could it be some values to have a rte_memcmp
> implementation?

I think this series (rte_memcmp included) could help:

 1. Potentially better performance in hot paths.

 2. Agile for tuning.

 3. Avoid performance complications -- unusual but possible,
like the glibc memset issue I met while working on vhost
enqueue.

> What is the value compared to glibc one? Why not working on glibc?

As to working on glibc, wider design consideration and test
coverage might be needed, and we'll face different release
cycles, can we have the same agility? Also working with old
glibc could be a problem.



Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions

2017-01-10 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monja...@6wind.com]
> Sent: Monday, January 9, 2017 7:09 PM
> To: Wang, Zhihong 
> Cc: Ravi Kerur ; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp
> functions
> 
> 2017-01-09 05:29, Wang, Zhihong:
> > From: Thomas Monjalon [mailto:thomas.monja...@6wind.com]
> > > 2016-06-07 11:09, Wang, Zhihong:
> > > > From: Ravi Kerur [mailto:rke...@gmail.com]
> > > > > Zhilong, Thomas,
> > > > >
> > > > > If there is enough interest within DPDK community I can work on
> adding
> > > support
> > > > > for 'unaligned access' and 'test cases' for it. Please let me know 
> > > > > either
> > > way.
> > > >
> > > > Hi Ravi,
> > > >
> > > > This rte_memcmp is proved with better performance than glibc's in
> aligned
> > > > cases, I think it has good value to DPDK lib.
> > > >
> > > > Though we don't have memcmp in critical pmd data path, it offers a
> better
> > > > choice for applications who do.
> > >
> > > Re-thinking about this series, could it be some values to have a
> rte_memcmp
> > > implementation?
> >
> > I think this series (rte_memcmp included) could help:
> >
> >  1. Potentially better performance in hot paths.
> >
> >  2. Agile for tuning.
> >
> >  3. Avoid performance complications -- unusual but possible,
> > like the glibc memset issue I met while working on vhost
> > enqueue.
> >
> > > What is the value compared to glibc one? Why not working on glibc?
> >
> > As to working on glibc, wider design consideration and test
> > coverage might be needed, and we'll face different release
> > cycles, can we have the same agility? Also working with old
> > glibc could be a problem.
> 
> Probably we need both: add the optimized version in DPDK while working
> on a glibc optimization.
> This strategy could be applicable to memcpy, memcmp and memset.

This does help in the long run if turned out feasible.


Re: [dpdk-dev] [PATCH] vhost: optimize vhost memcpy

2016-12-05 Thread Wang, Zhihong
> I like this function a lot, since it's really simple and straightforward!
> Moreover, it performs better.
> 
> But, I don't quite like how this function is proposed:
> 
> - rte_movX are more like internal help functions that should be used only
>   in corresponding rte_memcpy.h file.
> 
> - It's a good optimization, however, it will not benefit for other use
>   cases, though vhost is the most typical case here.
> 
> - The optimization proves to be good for X86, but think there is no
>   guarantee it may behave well for other platforms, say ARM.
> 
> I still would suggest you to go this way: move this function into x86's
> rte_memcpy.h and call it when the data is well aligned.


Do you mean to add something like rte_memcpy_aligned() in 
lib/librte_eal/common/include/generic/rte_memcpy.h?

I thought of this way before, and didn't choose it because it requires
changes in eal. But it would be a clean solution, I'd certainly like
to implement it this way if people are okay with it.


Thanks
Zhihong


> 
>   --yliu


Re: [dpdk-dev] [PATCH] vhost: optimize vhost memcpy

2016-12-06 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan@linux.intel.com]
> Sent: Monday, December 5, 2016 6:37 PM
> To: Wang, Zhihong 
> Cc: dev@dpdk.org; Thomas Monjalon 
> Subject: Re: [PATCH] vhost: optimize vhost memcpy
> 
> On Mon, Dec 05, 2016 at 10:27:00AM +, Wang, Zhihong wrote:
> > > I like this function a lot, since it's really simple and straightforward!
> > > Moreover, it performs better.
> > >
> > > But, I don't quite like how this function is proposed:
> > >
> > > - rte_movX are more like internal help functions that should be used only
> > >   in corresponding rte_memcpy.h file.
> > >
> > > - It's a good optimization, however, it will not benefit for other use
> > >   cases, though vhost is the most typical case here.
> > >
> > > - The optimization proves to be good for X86, but think there is no
> > >   guarantee it may behave well for other platforms, say ARM.
> > >
> > > I still would suggest you to go this way: move this function into x86's
> > > rte_memcpy.h and call it when the data is well aligned.
> >
> >
> > Do you mean to add something like rte_memcpy_aligned() in
> > lib/librte_eal/common/include/generic/rte_memcpy.h?
> 
> Yes, but this one is not supposed to be exported as a public API.
> It should be called inside rte_memcpy (when data is well aligned).
> In this way, only rte_memcpy is exposed, and nothing else should
> be changed.

Yes I agree this is a better way to introduce this patch, I'll send out v2.

> 
>   --yliu
> >
> > I thought of this way before, and didn't choose it because it requires
> > changes in eal. But it would be a clean solution, I'd certainly like
> > to implement it this way if people are okay with it.
> >
> >
> > Thanks
> > Zhihong
> >
> >
> > >
> > >   --yliu


Re: [dpdk-dev] [PATCH v2 6/8] mbuf: use 2 bytes for port and nb segments

2017-07-04 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Olivier MATZ
> Sent: Tuesday, April 18, 2017 9:03 PM
> To: Yuanhan Liu 
> Cc: dev@dpdk.org; Ananyev, Konstantin ;
> Richardson, Bruce ;
> m...@smartsharesystems.com; Chilikin, Andrey ;
> jblu...@infradead.org; nelio.laranje...@6wind.com;
> arybche...@solarflare.com; thomas.monja...@6wind.com;
> jerin.ja...@caviumnetworks.com
> Subject: Re: [dpdk-dev] [PATCH v2 6/8] mbuf: use 2 bytes for port and nb
> segments
> 
> Hi Yuanhan,
> 
> On Thu, 6 Apr 2017 13:45:23 +0800, Yuanhan Liu
>  wrote:
> > Hi Olivier,
> >
> > On Tue, Apr 04, 2017 at 06:28:05PM +0200, Olivier Matz wrote:
> > > Change the size of m->port and m->nb_segs to 16 bits.
> >
> > But all the ethdev APIs are still using 8 bits. 16 bits won't really
> > take effect without updating those APIs. Any plans?
> >
> > --yliu
> 
> Yes, there is some work in ethdev, drivers and in example apps to
> make the change effective. I think we could define a specific type for
> a port number, maybe rte_eth_port_num_t. Using this type could be a
> first step (for 17.08) before switching to 16 bits (17.11?).
> 
> I'll do the change and send a rfc.

Ping ;) Is this still in your plan?

Thanks
Zhihong

> 
> Regards,
> Olivier


[dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support to the TX path

2016-10-27 Thread Wang, Zhihong
Hi Maxime,

Seems indirect desc feature is causing serious performance
degradation on Haswell platform, about 20% drop for both
mrg=on and mrg=off (--txqflags=0xf00, non-vector version),
both iofwd and macfwd.

I'm using RC2, and the CPU is Xeon E5-2699 v3 @ 2.30GHz.

Could you please verify if this is true in your test?


Thanks
Zhihong

> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Monday, October 17, 2016 10:15 PM
> To: Yuanhan Liu 
> Cc: Wang, Zhihong ; Xie, Huawei
> ; dev at dpdk.org; vkaplans at redhat.com;
> mst at redhat.com; stephen at networkplumber.org
> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support
> to the TX path
> 
> 
> 
> On 10/17/2016 03:21 PM, Yuanhan Liu wrote:
> > On Mon, Oct 17, 2016 at 01:23:23PM +0200, Maxime Coquelin wrote:
> >>> On my side, I just setup 2 Windows 2016 VMs, and confirm the issue.
> >>> I'll continue the investigation early next week.
> >>
> >> The root cause is identified.
> >> When INDIRECT_DESC feature is negotiated, Windows guest uses indirect
> >> for both Tx and Rx descriptors, whereas Linux guests (Virtio PMD &
> >> virtio-net kernel driver) use indirect only for Tx.
> >> I'll implement indirect support for the Rx path in vhost lib, but the
> >> change will be too big for -rc release.
> >> I propose in the mean time to disable INDIRECT_DESC feature in vhost
> >> lib, we can still enable it locally for testing.
> >>
> >> Yuanhan, is it ok for you?
> >
> > That's okay.
> I'll send a patch to disable it then.
> 
> >
> >>
> >>> Has anyone already tested Windows guest with vhost-net, which also
> has
> >>> indirect descs support?
> >>
> >> I tested and confirm it works with vhost-net.
> >
> > I'm a bit confused then. IIRC, vhost-net also doesn't support indirect
> > for Rx path, right?
> 
> No, it does support it actually.
> I thought it didn't support too, I misread the Kernel implementation of
> vhost-net and virtio-net. Acutally, virtio-net makes use of indirect
> in Rx path when mergeable buffers is disabled.
> 
> The confusion certainly comes from me, sorry about that.
> 
> Maxime


[dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support to the TX path

2016-10-27 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Thursday, October 27, 2016 5:55 PM
> To: Wang, Zhihong ; Yuanhan Liu
> ; stephen at networkplumber.org; Pierre
> Pfister (ppfister) 
> Cc: Xie, Huawei ; dev at dpdk.org;
> vkaplans at redhat.com; mst at redhat.com
> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support
> to the TX path
> 
> 
> 
> On 10/27/2016 11:10 AM, Maxime Coquelin wrote:
> > Hi Zhihong,
> >
> > On 10/27/2016 11:00 AM, Wang, Zhihong wrote:
> >> Hi Maxime,
> >>
> >> Seems indirect desc feature is causing serious performance
> >> degradation on Haswell platform, about 20% drop for both
> >> mrg=on and mrg=off (--txqflags=0xf00, non-vector version),
> >> both iofwd and macfwd.
> > I tested PVP (with macswap on guest) and Txonly/Rxonly on an Ivy Bridge
> > platform, and didn't faced such a drop.
> > Have you tried to pass indirect_desc=off to qemu cmdline to see if you
> > recover the performance?
> >
> > Yuanhan, which platform did you use when you tested it with zero copy?
> >
> >>
> >> I'm using RC2, and the CPU is Xeon E5-2699 v3 @ 2.30GHz.
> >>
> >> Could you please verify if this is true in your test?
> > I'll try -rc1/-rc2 on my platform, and let you know.
> As a first test, I tried again Txonly from the guest to the host (Rxonly),
> where Tx indirect descriptors are used, on my E5-2665 @2.40GHz:
> v16.11-rc1: 10.81Mpps
> v16.11-rc2: 10.91Mpps
> 
> -rc2 is even slightly better in my case.
> Could you please run the same test on your platform?

I mean to use rc2 as both host and guest, and compare the
perf between indirect=0 and indirect=1.

I use PVP traffic, tried both testpmd and OvS as the forwarding
engine in host, and testpmd in guest.

Thanks
Zhihong

> 
> And could you provide me more info on your fwd bench?
> Do you use dpdk-pktgen on host, or you do fwd on howt with a real NIC
> also?
> 
> Thanks,
> Maxime
> > Thanks,
> > Maxime
> >
> >>
> >>
> >> Thanks
> >> Zhihong
> >>
> >>> -Original Message-
> >>> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> >>> Sent: Monday, October 17, 2016 10:15 PM
> >>> To: Yuanhan Liu 
> >>> Cc: Wang, Zhihong ; Xie, Huawei
> >>> ; dev at dpdk.org; vkaplans at redhat.com;
> >>> mst at redhat.com; stephen at networkplumber.org
> >>> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors
> >>> support
> >>> to the TX path
> >>>
> >>>
> >>>
> >>> On 10/17/2016 03:21 PM, Yuanhan Liu wrote:
> >>>> On Mon, Oct 17, 2016 at 01:23:23PM +0200, Maxime Coquelin wrote:
> >>>>>> On my side, I just setup 2 Windows 2016 VMs, and confirm the issue.
> >>>>>> I'll continue the investigation early next week.
> >>>>>
> >>>>> The root cause is identified.
> >>>>> When INDIRECT_DESC feature is negotiated, Windows guest uses
> indirect
> >>>>> for both Tx and Rx descriptors, whereas Linux guests (Virtio PMD &
> >>>>> virtio-net kernel driver) use indirect only for Tx.
> >>>>> I'll implement indirect support for the Rx path in vhost lib, but the
> >>>>> change will be too big for -rc release.
> >>>>> I propose in the mean time to disable INDIRECT_DESC feature in vhost
> >>>>> lib, we can still enable it locally for testing.
> >>>>>
> >>>>> Yuanhan, is it ok for you?
> >>>>
> >>>> That's okay.
> >>> I'll send a patch to disable it then.
> >>>
> >>>>
> >>>>>
> >>>>>> Has anyone already tested Windows guest with vhost-net, which
> also
> >>> has
> >>>>>> indirect descs support?
> >>>>>
> >>>>> I tested and confirm it works with vhost-net.
> >>>>
> >>>> I'm a bit confused then. IIRC, vhost-net also doesn't support indirect
> >>>> for Rx path, right?
> >>>
> >>> No, it does support it actually.
> >>> I thought it didn't support too, I misread the Kernel implementation of
> >>> vhost-net and virtio-net. Acutally, virtio-net makes use of indirect
> >>> in Rx path when mergeable buffers is disabled.
> >>>
> >>> The confusion certainly comes from me, sorry about that.
> >>>
> >>> Maxime


[dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support to the TX path

2016-10-28 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Thursday, October 27, 2016 6:46 PM
> To: Maxime Coquelin 
> Cc: Wang, Zhihong ;
> stephen at networkplumber.org; Pierre Pfister (ppfister)
> ; Xie, Huawei ; dev at 
> dpdk.org;
> vkaplans at redhat.com; mst at redhat.com
> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support
> to the TX path
> 
> On Thu, Oct 27, 2016 at 12:35:11PM +0200, Maxime Coquelin wrote:
> >
> >
> > On 10/27/2016 12:33 PM, Yuanhan Liu wrote:
> > >On Thu, Oct 27, 2016 at 11:10:34AM +0200, Maxime Coquelin wrote:
> > >>Hi Zhihong,
> > >>
> > >>On 10/27/2016 11:00 AM, Wang, Zhihong wrote:
> > >>>Hi Maxime,
> > >>>
> > >>>Seems indirect desc feature is causing serious performance
> > >>>degradation on Haswell platform, about 20% drop for both
> > >>>mrg=on and mrg=off (--txqflags=0xf00, non-vector version),
> > >>>both iofwd and macfwd.
> > >>I tested PVP (with macswap on guest) and Txonly/Rxonly on an Ivy
> Bridge
> > >>platform, and didn't faced such a drop.
> > >
> > >I was actually wondering that may be the cause. I tested it with
> > >my IvyBridge server as well, I saw no drop.
> > >
> > >Maybe you should find a similar platform (Haswell) and have a try?
> > Yes, that's why I asked Zhihong whether he could test Txonly in guest to
> > see if issue is reproducible like this.
> 
> I have no Haswell box, otherwise I could do a quick test for you. IIRC,
> he tried to disable the indirect_desc feature, then the performance
> recovered. So, it's likely the indirect_desc is the culprit here.
> 
> > I will be easier for me to find an Haswell machine if it has not to be
> > connected back to back to and HW/SW packet generator.

In fact simple loopback test will also do, without pktgen.

Start testpmd in both host and guest, and do "start" in one
and "start tx_first 32" in another.

Perf drop is about 24% in my test.

> 
> Makes sense.
> 
>   --yliu
> >
> > Thanks,
> > Maxime
> >
> > >
> > >   --yliu
> > >
> > >>Have you tried to pass indirect_desc=off to qemu cmdline to see if you
> > >>recover the performance?
> > >>
> > >>Yuanhan, which platform did you use when you tested it with zero copy?
> > >>
> > >>>
> > >>>I'm using RC2, and the CPU is Xeon E5-2699 v3 @ 2.30GHz.
> > >>>
> > >>>Could you please verify if this is true in your test?
> > >>I'll try -rc1/-rc2 on my platform, and let you know.
> > >>
> > >>Thanks,
> > >>Maxime
> > >>
> > >>>
> > >>>
> > >>>Thanks
> > >>>Zhihong
> > >>>
> > >>>>-Original Message-
> > >>>>From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> > >>>>Sent: Monday, October 17, 2016 10:15 PM
> > >>>>To: Yuanhan Liu 
> > >>>>Cc: Wang, Zhihong ; Xie, Huawei
> > >>>>; dev at dpdk.org; vkaplans at redhat.com;
> > >>>>mst at redhat.com; stephen at networkplumber.org
> > >>>>Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors
> support
> > >>>>to the TX path
> > >>>>
> > >>>>
> > >>>>
> > >>>>On 10/17/2016 03:21 PM, Yuanhan Liu wrote:
> > >>>>>On Mon, Oct 17, 2016 at 01:23:23PM +0200, Maxime Coquelin wrote:
> > >>>>>>>On my side, I just setup 2 Windows 2016 VMs, and confirm the
> issue.
> > >>>>>>>I'll continue the investigation early next week.
> > >>>>>>
> > >>>>>>The root cause is identified.
> > >>>>>>When INDIRECT_DESC feature is negotiated, Windows guest uses
> indirect
> > >>>>>>for both Tx and Rx descriptors, whereas Linux guests (Virtio PMD &
> > >>>>>>virtio-net kernel driver) use indirect only for Tx.
> > >>>>>>I'll implement indirect support for the Rx path in vhost lib, but the
> > >>>>>>change will be too big for -rc release.
> > >>>>>>I propose in the mean time to disable INDIRECT_DESC feature in
> vhost
> > >>>>>>lib, we can still enable it locally for testing.
> > >>>>>>
> > >>>>>>Yuanhan, is it ok for you?
> > >>>>>
> > >>>>>That's okay.
> > >>>>I'll send a patch to disable it then.
> > >>>>
> > >>>>>
> > >>>>>>
> > >>>>>>>Has anyone already tested Windows guest with vhost-net, which
> also
> > >>>>has
> > >>>>>>>indirect descs support?
> > >>>>>>
> > >>>>>>I tested and confirm it works with vhost-net.
> > >>>>>
> > >>>>>I'm a bit confused then. IIRC, vhost-net also doesn't support indirect
> > >>>>>for Rx path, right?
> > >>>>
> > >>>>No, it does support it actually.
> > >>>>I thought it didn't support too, I misread the Kernel implementation of
> > >>>>vhost-net and virtio-net. Acutally, virtio-net makes use of indirect
> > >>>>in Rx path when mergeable buffers is disabled.
> > >>>>
> > >>>>The confusion certainly comes from me, sorry about that.
> > >>>>
> > >>>>Maxime


[dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support to the TX path

2016-10-31 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Friday, October 28, 2016 3:42 PM
> To: Wang, Zhihong ; Yuanhan Liu
> 
> Cc: stephen at networkplumber.org; Pierre Pfister (ppfister)
> ; Xie, Huawei ; dev at 
> dpdk.org;
> vkaplans at redhat.com; mst at redhat.com
> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support
> to the TX path
> 
> 
> 
> On 10/28/2016 02:49 AM, Wang, Zhihong wrote:
> >
> >> > -Original Message-
> >> > From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> >> > Sent: Thursday, October 27, 2016 6:46 PM
> >> > To: Maxime Coquelin 
> >> > Cc: Wang, Zhihong ;
> >> > stephen at networkplumber.org; Pierre Pfister (ppfister)
> >> > ; Xie, Huawei ;
> dev at dpdk.org;
> >> > vkaplans at redhat.com; mst at redhat.com
> >> > Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors
> support
> >> > to the TX path
> >> >
> >> > On Thu, Oct 27, 2016 at 12:35:11PM +0200, Maxime Coquelin wrote:
> >>> > >
> >>> > >
> >>> > > On 10/27/2016 12:33 PM, Yuanhan Liu wrote:
> >>>> > > >On Thu, Oct 27, 2016 at 11:10:34AM +0200, Maxime Coquelin
> wrote:
> >>>>> > > >>Hi Zhihong,
> >>>>> > > >>
> >>>>> > > >>On 10/27/2016 11:00 AM, Wang, Zhihong wrote:
> >>>>>> > > >>>Hi Maxime,
> >>>>>> > > >>>
> >>>>>> > > >>>Seems indirect desc feature is causing serious performance
> >>>>>> > > >>>degradation on Haswell platform, about 20% drop for both
> >>>>>> > > >>>mrg=on and mrg=off (--txqflags=0xf00, non-vector version),
> >>>>>> > > >>>both iofwd and macfwd.
> >>>>> > > >>I tested PVP (with macswap on guest) and Txonly/Rxonly on an
> Ivy
> >> > Bridge
> >>>>> > > >>platform, and didn't faced such a drop.
> >>>> > > >
> >>>> > > >I was actually wondering that may be the cause. I tested it with
> >>>> > > >my IvyBridge server as well, I saw no drop.
> >>>> > > >
> >>>> > > >Maybe you should find a similar platform (Haswell) and have a try?
> >>> > > Yes, that's why I asked Zhihong whether he could test Txonly in guest
> to
> >>> > > see if issue is reproducible like this.
> >> >
> >> > I have no Haswell box, otherwise I could do a quick test for you. IIRC,
> >> > he tried to disable the indirect_desc feature, then the performance
> >> > recovered. So, it's likely the indirect_desc is the culprit here.
> >> >
> >>> > > I will be easier for me to find an Haswell machine if it has not to be
> >>> > > connected back to back to and HW/SW packet generator.
> > In fact simple loopback test will also do, without pktgen.
> >
> > Start testpmd in both host and guest, and do "start" in one
> > and "start tx_first 32" in another.
> >
> > Perf drop is about 24% in my test.
> >
> 
> Thanks, I never tried this test.
> I managed to find an Haswell platform (Intel(R) Xeon(R) CPU E5-2699 v3
> @ 2.30GHz), and can reproduce the problem with the loop test you
> mention. I see a performance drop about 10% (8.94Mpps/8.08Mpps).
> Out of curiosity, what are the numbers you get with your setup?

Hi Maxime,

Let's align our test case to RC2, mrg=on, loopback, on Haswell.
My results below:
 1. indirect=1: 5.26 Mpps
 2. indirect=0: 6.54 Mpps

It's about 24% drop.

> 
> As I never tried this test, I run it again on my Sandy Bridge setup, and
> I also see a performance regression, this time of 4%.
> 
> If I understand correctly the test, only 32 packets are allocated,
> corresponding to a single burst, which is less than the queue size.
> So it makes sense that the performance is lower with this test case.

Actually it's 32 burst, so 1024 packets in total, enough to
fill the queue.

Thanks
Zhihong

> 
> Thanks,
> Maxime


[dpdk-dev] [dpdk-stable] [PATCH v4 1/6] vhost: fix windows vm hang

2016-09-05 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Monday, September 5, 2016 1:25 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; maxime.coquelin at redhat.com;
> yuanhan.liu at linux.intel.com; thomas.monjalon at 6wind.com;
> stable at dpdk.org
> Subject: Re: [dpdk-stable] [PATCH v4 1/6] vhost: fix windows vm hang
> 
> On Mon, Aug 29, 2016 at 11:35:59PM -0400, Zhihong Wang wrote:
> > This patch fixes a Windows VM compatibility issue in DPDK 16.07 vhost
> code,
> > which causes the guest to hang once any packets are enqueued when
> mrg_rxbuf
> > is turned on.
> 
> This commit log lacks two important pieces: why does the hang happen and
> how does your patch fix it.

Okay, I'll add it in v5.

> 
> > How to test?
> >
> >  1. Start testpmd in the host with a vhost port.
> >
> >  2. Start a Windows VM image with qemu and connect to the vhost port.
> >
> >  3. Start io forwarding with tx_first in host testpmd.
> >
> > For 16.07 code, the Windows VM will hang once any packets are enqueued.
> >
> > Cc: 
> > Signed-off-by: Zhihong Wang 
> > ---
> >  lib/librte_vhost/vhost_rxtx.c | 17 -
> >  1 file changed, 12 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> > index 08a73fd..5806f99 100644
> > --- a/lib/librte_vhost/vhost_rxtx.c
> > +++ b/lib/librte_vhost/vhost_rxtx.c
> > @@ -384,6 +384,8 @@ copy_mbuf_to_desc_mergeable(struct virtio_net
> *dev, struct vhost_virtqueue *vq,
> > uint16_t start_idx = vq->last_used_idx;
> > uint16_t cur_idx = start_idx;
> > uint64_t desc_addr;
> > +   uint32_t desc_chain_head;
> > +   uint32_t desc_chain_len;
> 
> What's the point of introducing "desc_chain_len"? It has the same value
> of desc_offset.

No it's not, desc_offset is the offset of the current desc only.
That's where the old code goes wrong.

If you take a look at the virtio spec:

/* le32 is used here for ids for padding reasons. */
struct vring_used_elem {
/* Index of start of used descriptor chain. */
le32 id;
/* Total length of the descriptor chain which was written to. */
le32 len;
};

> 
>   --yliu


[dpdk-dev] [PATCH v4 2/6] vhost: rewrite enqueue

2016-09-07 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Wednesday, September 7, 2016 1:33 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; maxime.coquelin at redhat.com;
> thomas.monjalon at 6wind.com
> Subject: Re: [PATCH v4 2/6] vhost: rewrite enqueue
> 
> Hmmm, yet another email didn't send out successfully. Resend.
> 
> BTW, please work out v5 on top of the latest next-virtio tree.
> 
> Thanks.

Okay. Thanks.

> 
>   --yliu
> 
> On Mon, Sep 05, 2016 at 02:39:25PM +0800, Yuanhan Liu wrote:
> 
> On Mon, Aug 29, 2016 at 11:36:00PM -0400, Zhihong Wang wrote:
> > This patch implements the vhost logic from scratch into a single function
> > designed for high performance and better maintainability.
> >
> > This is the baseline version of the new code, more optimization will be
> > added in the following patches in this patch set.
> >
> > ---
> > Changes in v4:
> >
> >  1. Refactor the code for clearer logic.
> >
> >  2. Add PRINT_PACKET for debugging.
> >
> > ---
> > Changes in v3:
> >
> >  1. Rewrite enqueue and delete the obsolete in the same patch.
> 
> Change log should go >
> 
> > Signed-off-by: Zhihong Wang 
> > ---
> 
> ... here, after the SoB.
> 
> >  lib/librte_vhost/vhost_rxtx.c | 525 
> > -
> -
> >  1 file changed, 145 insertions(+), 380 deletions(-)
> >
> > diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> > index 5806f99..629e8ae 100644
> > --- a/lib/librte_vhost/vhost_rxtx.c
> > +++ b/lib/librte_vhost/vhost_rxtx.c
> > @@ -91,7 +91,7 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx,
> uint32_t qp_nb)
> > return (is_tx ^ (idx & 1)) == 0 && idx < qp_nb * VIRTIO_QNUM;
> >  }
> >
> > -static void
> > +static inline void __attribute__((always_inline))
> >  virtio_enqueue_offload(struct rte_mbuf *m_buf, struct virtio_net_hdr
> *net_hdr)
> >  {
> > if (m_buf->ol_flags & PKT_TX_L4_MASK) {
> > @@ -112,6 +112,10 @@ virtio_enqueue_offload(struct rte_mbuf *m_buf,
> struct virtio_net_hdr *net_hdr)
> > cksum));
> > break;
> > }
> > +   } else {
> > +   net_hdr->flags = 0;
> > +   net_hdr->csum_start = 0;
> > +   net_hdr->csum_offset = 0;
> > }
> >
> > if (m_buf->ol_flags & PKT_TX_TCP_SEG) {
> > @@ -122,437 +126,198 @@ virtio_enqueue_offload(struct rte_mbuf
> *m_buf, struct virtio_net_hdr *net_hdr)
> > net_hdr->gso_size = m_buf->tso_segsz;
> > net_hdr->hdr_len = m_buf->l2_len + m_buf->l3_len
> > + m_buf->l4_len;
> > +   } else {
> > +   net_hdr->gso_type = 0;
> > +   net_hdr->hdr_len = 0;
> > +   net_hdr->gso_size = 0;
> > }
> >  }
> >
> > -static inline void
> > -copy_virtio_net_hdr(struct virtio_net *dev, uint64_t desc_addr,
> > -   struct virtio_net_hdr_mrg_rxbuf hdr)
> > +static inline void __attribute__((always_inline))
> > +update_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > +   uint32_t desc_chain_head, uint32_t desc_chain_len)
> >  {
> > -   if (dev->vhost_hlen == sizeof(struct virtio_net_hdr_mrg_rxbuf))
> > -   *(struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)desc_addr =
> hdr;
> > -   else
> > -   *(struct virtio_net_hdr *)(uintptr_t)desc_addr = hdr.hdr;
> > +   uint32_t used_idx_round = vq->last_used_idx & (vq->size - 1);
> 
> I'd suggest to use "used_idx", instead of "used_idx_round".
> 
> > +
> > +   vq->used->ring[used_idx_round].id = desc_chain_head;
> > +   vq->used->ring[used_idx_round].len = desc_chain_len;
> > +   vhost_log_used_vring(dev, vq, offsetof(struct vring_used,
> > +   ring[used_idx_round]),
> > +   sizeof(vq->used->ring[used_idx_round]));
> >  }
> >
> > -static inline int __attribute__((always_inline))
> > -copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > - struct rte_mbuf *m, uint16_t desc_idx)
> > +static inline uint32_t __attribute__((always_inline))
> > +enqueue_packet(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > +   uint16_t avail_idx, struct rte_mbuf *mbuf,
> > +   uint

[dpdk-dev] [PATCH v5 2/6] vhost: rewrite enqueue

2016-09-14 Thread Wang, Zhihong
> > +   desc_current =
> > +   vq->avail->ring[(vq->last_used_idx)
> &
> > +   (vq->size - 1)];
> > +   desc_chain_head = desc_current;
> > +   desc = &vq->desc[desc_current];
> > +   desc_addr = gpa_to_vva(dev, desc->addr);
> > +   if (unlikely(!desc_addr))
> > +   goto error;
> >
> > -   desc = &vq->desc[desc->next];
> > -   desc_addr = gpa_to_vva(dev, desc->addr);
> > -   if (unlikely(!desc_addr))
> > -   return -1;
> > -
> > -   desc_offset = 0;
> > -   desc_avail  = desc->len;
> > +   desc_chain_len = 0;
> > +   desc_offset = 0;
> As I commented on v3, there is code duplication between next flag, and
> mrg buf cases:
> desc_offset = 0;
> 
> and:
> 
> desc = &vq->desc[desc_current];
> desc_addr = gpa_to_vva(dev, desc->addr);
> if (unlikely(!desc_addr))
>  goto error;
> 

Do you mean to add something like:

static inline int __attribute__((always_inline))
get_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
uint32_t desc_idx, struct vring_desc **desc,
uint64_t *desc_addr)
{
*desc = &vq->desc[desc_idx];
*desc_addr = gpa_to_vva(dev, (*desc)->addr);
if (unlikely(!(*desc_addr)))
return -1;

return 0;
}


> Regards,
> Maxime


[dpdk-dev] [PATCH v5 2/6] vhost: rewrite enqueue

2016-09-14 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Tuesday, September 13, 2016 12:27 AM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: yuanhan.liu at linux.intel.com; thomas.monjalon at 6wind.com
> Subject: Re: [PATCH v5 2/6] vhost: rewrite enqueue
> 
> 
> 
> On 09/09/2016 05:39 AM, Zhihong Wang wrote:
> >
> > +static inline void __attribute__((always_inline))
> > +notify_guest(struct virtio_net *dev, struct vhost_virtqueue *vq)
> > +{
> > rte_smp_wmb();
> > -
> > -   *(volatile uint16_t *)&vq->used->idx += count;
> > -   vq->last_used_idx += count;
> > -   vhost_log_used_vring(dev, vq,
> > -   offsetof(struct vring_used, idx),
> > -   sizeof(vq->used->idx));
> > -
> > -   /* flush used->idx update before we read avail->flags. */
> Please don't remove comments if not justified.
> Here the comment is important, as it explains why the barrier is needed.

Okay.

> > +   *(volatile uint16_t *)&vq->used->idx = vq->last_used_idx;
> > +   vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
> > +   sizeof(vq->used->idx));
> > rte_mb();
> > -
> > -   /* Kick the guest if necessary. */
> > if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
> > && (vq->callfd >= 0))
> > eventfd_write(vq->callfd, (eventfd_t)1);
> > -   return count;
> >  }


[dpdk-dev] [PATCH v5 5/6] vhost: batch update used ring

2016-09-14 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Monday, September 12, 2016 11:46 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: yuanhan.liu at linux.intel.com; thomas.monjalon at 6wind.com
> Subject: Re: [PATCH v5 5/6] vhost: batch update used ring
> 
> 
> 
> On 09/09/2016 05:39 AM, Zhihong Wang wrote:
> > This patch enables batch update of the used ring for better efficiency.
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> > Changes in v4:
> >
> >  1. Free shadow used ring in the right place.
> >
> >  2. Add failure check for shadow used ring malloc.
> >
> >  lib/librte_vhost/vhost.c  | 20 --
> >  lib/librte_vhost/vhost.h  |  4 +++
> >  lib/librte_vhost/vhost_user.c | 31 +
> >  lib/librte_vhost/virtio_net.c | 64
> +++
> >  4 files changed, 101 insertions(+), 18 deletions(-)
> >
> > diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
> > index 46095c3..cb31cdd 100644
> > --- a/lib/librte_vhost/vhost.c
> > +++ b/lib/librte_vhost/vhost.c
> > @@ -119,10 +119,26 @@ cleanup_device(struct virtio_net *dev, int
> destroy)
> >  static void
> >  free_device(struct virtio_net *dev)
> >  {
> > +   struct vhost_virtqueue *vq_0;
> > +   struct vhost_virtqueue *vq_1;
> > uint32_t i;
> >
> > -   for (i = 0; i < dev->virt_qp_nb; i++)
> > -   rte_free(dev->virtqueue[i * VIRTIO_QNUM]);
> > +   for (i = 0; i < dev->virt_qp_nb; i++) {
> > +   vq_0 = dev->virtqueue[i * VIRTIO_QNUM];
> > +   if (vq_0->shadow_used_ring) {
> > +   rte_free(vq_0->shadow_used_ring);
> > +   vq_0->shadow_used_ring = NULL;
> > +   }
> > +
> > +   vq_1 = dev->virtqueue[i * VIRTIO_QNUM + 1];
> > +   if (vq_1->shadow_used_ring) {
> > +   rte_free(vq_1->shadow_used_ring);
> > +   vq_1->shadow_used_ring = NULL;
> > +   }
> > +
> > +   /* malloc together, free together */
> > +   rte_free(vq_0);
> > +   }
> >
> > rte_free(dev);
> >  }
> > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > index 9707dfc..381dc27 100644
> > --- a/lib/librte_vhost/vhost.h
> > +++ b/lib/librte_vhost/vhost.h
> > @@ -85,6 +85,10 @@ struct vhost_virtqueue {
> >
> > /* Physical address of used ring, for logging */
> > uint64_tlog_guest_addr;
> > +
> > +   /* Shadow used ring for performance */
> > +   struct vring_used_elem  *shadow_used_ring;
> > +   uint32_tshadow_used_idx;
> >  } __rte_cache_aligned;
> >
> >  /* Old kernels have no such macro defined */
> > diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
> > index eee99e9..d7cf1ed 100644
> > --- a/lib/librte_vhost/vhost_user.c
> > +++ b/lib/librte_vhost/vhost_user.c
> > @@ -193,7 +193,21 @@ static int
> >  vhost_user_set_vring_num(struct virtio_net *dev,
> >  struct vhost_vring_state *state)
> >  {
> > -   dev->virtqueue[state->index]->size = state->num;
> > +   struct vhost_virtqueue *vq;
> > +
> > +   vq = dev->virtqueue[state->index];
> > +   vq->size = state->num;
> > +   if (!vq->shadow_used_ring) {
> > +   vq->shadow_used_ring = rte_malloc(NULL,
> > +   vq->size * sizeof(struct vring_used_elem),
> > +   RTE_CACHE_LINE_SIZE);
> > +   if (!vq->shadow_used_ring) {
> > +   RTE_LOG(ERR, VHOST_CONFIG,
> > +   "Failed to allocate memory"
> > +   " for shadow used ring.\n");
> > +   return -1;
> > +   }
> > +   }
> >
> > return 0;
> >  }
> > @@ -611,14 +625,21 @@ static int
> >  vhost_user_get_vring_base(struct virtio_net *dev,
> >   struct vhost_vring_state *state)
> >  {
> > +   struct vhost_virtqueue *vq;
> > +
> > /* We have to stop the queue (virtio) if it is running. */
> > if (dev->flags & VIRTIO_DEV_RUNNING) {
> > dev->flags &= ~VIRTIO_DEV_RUNNING;
> > notify_ops->destroy_device(dev->vid);
> > }
> >
> > +   vq = dev->virtqueue[state->index];
> > /* H

[dpdk-dev] [PATCH v5 5/6] vhost: batch update used ring

2016-09-18 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Sunday, September 18, 2016 10:56 AM
> To: Maxime Coquelin 
> Cc: Wang, Zhihong ; dev at dpdk.org;
> thomas.monjalon at 6wind.com
> Subject: Re: [PATCH v5 5/6] vhost: batch update used ring
> 
> On Thu, Sep 15, 2016 at 06:38:06PM +0200, Maxime Coquelin wrote:
> > >>>+static inline void __attribute__((always_inline))
> > >>>+flush_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > >>>+uint32_t used_idx_start)
> > >>>+{
> > >>>+if (used_idx_start + vq->shadow_used_idx < vq->size) {
> > >>>+rte_memcpy(&vq->used->ring[used_idx_start],
> > >>>+&vq->shadow_used_ring[0],
> > >>>+vq->shadow_used_idx *
> > >>>+sizeof(struct vring_used_elem));
> > >>>+vhost_log_used_vring(dev, vq,
> > >>>+offsetof(struct vring_used,
> > >>>+ring[used_idx_start]),
> > >>>+vq->shadow_used_idx *
> > >>>+sizeof(struct vring_used_elem));
> > >>>+} else {
> > >>>+uint32_t part_1 = vq->size - used_idx_start;
> > >>>+uint32_t part_2 = vq->shadow_used_idx - part_1;
> > >>>+
> > >>>+rte_memcpy(&vq->used->ring[used_idx_start],
> > >>>+&vq->shadow_used_ring[0],
> > >>>+part_1 *
> > >>>+sizeof(struct vring_used_elem));
> > >>>+vhost_log_used_vring(dev, vq,
> > >>>+offsetof(struct vring_used,
> > >>>+ring[used_idx_start]),
> > >>>+part_1 *
> > >>>+sizeof(struct vring_used_elem));
> > >>>+rte_memcpy(&vq->used->ring[0],
> > >>>+&vq->shadow_used_ring[part_1],
> > >>>+part_2 *
> > >>>+sizeof(struct vring_used_elem));
> > >>>+vhost_log_used_vring(dev, vq,
> > >>>+offsetof(struct vring_used,
> > >>>+ring[0]),
> > >>>+part_2 *
> > >>>+sizeof(struct vring_used_elem));
> > >>>+}
> > >>> }
> > >>Is expanding the code done for performance purpose?
> > >
> > >Hi Maxime,
> > >
> > >Yes theoretically this has the least branch number.
> > >And I think the logic is simpler this way.
> > Ok, in that case, maybe you could create a function to
> > do the rte_memcpy and the vhost_log_used on a given range.
> 
> Agreed, that will be better; it could avoid repeating similar code
> block 3 times.

Okay. Thanks for the suggestion, Maxime and Yuanhan.

> 
> > I don't have a strong opinion on this, if Yuanhan is fine
> > with current code, that's ok for me.
> 
> From what I know, that's kind of DPDK prefered way, to expand code
> when necessary. For example, 9ec201f5d6e7 ("mbuf: provide bulk
> allocation").
> 
> So I'm fine with it.
> 
>   --yliu


[dpdk-dev] [PATCH v5 2/6] vhost: rewrite enqueue

2016-09-19 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Sunday, September 18, 2016 10:19 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; maxime.coquelin at redhat.com;
> thomas.monjalon at 6wind.com
> Subject: Re: [PATCH v5 2/6] vhost: rewrite enqueue
> 
> On Thu, Sep 08, 2016 at 11:39:24PM -0400, Zhihong Wang wrote:
> > This patch implements the vhost logic from scratch into a single function
> > designed for high performance and better maintainability.
> 
> As always, your commit log just states what have been done, but doesn't
> tell why such changes have been made. For example, you said "it's designed
> for high performance", then you'd better explain why your version would
> introduce high performance. You need a reason, as well as some numbers
> (percent change) to prove it: it's not that right to keep the numbers
> inside: I'm sure people outside intel are also willing and happy to know
> those numbers.
> 
> For this patch, I think it's more about the maintainability improvement
> but not performance: the performance tunning patches are done later
> after all.
> 
> Another example is, in patch 6, you said "It reduces CPU pipeline stall
> cycles significantly", but you didn't say why there is pipeline stall
> before and why your patch reduces it.
> 
> All those are important things that deserves some explanation. So, I'd
> ask you to re-visit all your patches in this set, to think what you
> could add to make the commit better and more informative.

Okay. I'll add more detailed commit log.

> 
> Besides that, I think this patchset looks fine to me. I may just need
> another time to look it more carefully, then I think I can merge (v6).
> 
> BTW, thanks for the great work!
> 
>   --yliu
> 
> > This is the baseline version of the new code, more optimization will be
> > added in the following patches in this patch set.
> >
> > Signed-off-by: Zhihong Wang 
> > ---


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-09-21 Thread Wang, Zhihong


> -Original Message-
> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> Sent: Wednesday, September 21, 2016 4:50 PM
> To: Maxime Coquelin 
> Cc: Wang, Zhihong ; dev at dpdk.org;
> yuanhan.liu at linux.intel.com
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> Hi Maxime,
> 
> On 22 August 2016 at 16:11, Maxime Coquelin
>  wrote:
> > Hi Zhihong,
> >
> > On 08/19/2016 07:43 AM, Zhihong Wang wrote:
> >>
> >> This patch set optimizes the vhost enqueue function.
> >>
> ...
> 
> >
> > My setup consists of one host running a guest.
> > The guest generates as much 64bytes packets as possible using
> 
> Have you tested with other different packet size?
> My testing shows that performance is dropping when packet size is more
> than 256.


Hi Jianbo,

Thanks for reporting this.

 1. Are you running the vector frontend with mrg_rxbuf=off?

 2. Could you please specify what CPU you're running? Is it Haswell
or Ivy Bridge?

 3. How many percentage of drop are you seeing?

This is expected by me because I've already found the root cause and
the way to optimize it, but since it missed the v0 deadline and
requires changes in eal/memcpy, I postpone it to the next release.

After the upcoming optimization the performance for packets larger
than 256 will be improved, and the new code will be much faster than
the current code.


Thanks
Zhihong


> 
> > pktgen-dpdk. The hosts forwards received packets back to the guest
> > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to
> > physical CPUs.
> >
> > I tested it with and without your v1 patch, with and without
> > rx-mergeable feature turned ON.
> > Results are the average of 8 runs of 60 seconds:
> >
> > Rx-Mergeable ON : 7.72Mpps
> > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps
> > Rx-Mergeable OFF: 10.52Mpps
> > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps
> >
> > Regards,
> > Maxime


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-09-22 Thread Wang, Zhihong


> -Original Message-
> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> Sent: Wednesday, September 21, 2016 8:54 PM
> To: Wang, Zhihong 
> Cc: Maxime Coquelin ; dev at dpdk.org;
> yuanhan.liu at linux.intel.com
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On 21 September 2016 at 17:27, Wang, Zhihong 
> wrote:
> >
> >
> >> -Original Message-
> >> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> >> Sent: Wednesday, September 21, 2016 4:50 PM
> >> To: Maxime Coquelin 
> >> Cc: Wang, Zhihong ; dev at dpdk.org;
> >> yuanhan.liu at linux.intel.com
> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> >>
> >> Hi Maxime,
> >>
> >> On 22 August 2016 at 16:11, Maxime Coquelin
> >>  wrote:
> >> > Hi Zhihong,
> >> >
> >> > On 08/19/2016 07:43 AM, Zhihong Wang wrote:
> >> >>
> >> >> This patch set optimizes the vhost enqueue function.
> >> >>
> >> ...
> >>
> >> >
> >> > My setup consists of one host running a guest.
> >> > The guest generates as much 64bytes packets as possible using
> >>
> >> Have you tested with other different packet size?
> >> My testing shows that performance is dropping when packet size is more
> >> than 256.
> >
> >
> > Hi Jianbo,
> >
> > Thanks for reporting this.
> >
> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> >
> >  2. Could you please specify what CPU you're running? Is it Haswell
> > or Ivy Bridge?
> >
> >  3. How many percentage of drop are you seeing?
> >
> > This is expected by me because I've already found the root cause and
> > the way to optimize it, but since it missed the v0 deadline and
> > requires changes in eal/memcpy, I postpone it to the next release.
> >
> > After the upcoming optimization the performance for packets larger
> > than 256 will be improved, and the new code will be much faster than
> > the current code.
> >
> 
> Sorry, I tested on an ARM server, but I wonder if there is the same
> issue for x86 platform.


For mrg_rxbuf=off path it might be slight drop for packets larger than
256B (~3% for 512B and ~1% for 1024B), no drop for other cases.

This is not a bug or issue, only we need to enhance memcpy to complete
the whole optimization, which should be done in a separated patch,
unfortunately it misses this release window.


> 
> >> > pktgen-dpdk. The hosts forwards received packets back to the guest
> >> > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to
> >> > physical CPUs.
> >> >
> >> > I tested it with and without your v1 patch, with and without
> >> > rx-mergeable feature turned ON.
> >> > Results are the average of 8 runs of 60 seconds:
> >> >
> >> > Rx-Mergeable ON : 7.72Mpps
> >> > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps
> >> > Rx-Mergeable OFF: 10.52Mpps
> >> > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps
> >> >


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-09-22 Thread Wang, Zhihong


> -Original Message-
> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> Sent: Thursday, September 22, 2016 1:48 PM
> To: Yuanhan Liu 
> Cc: Wang, Zhihong ; Maxime Coquelin
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On 22 September 2016 at 10:29, Yuanhan Liu 
> wrote:
> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
> >> >> > My setup consists of one host running a guest.
> >> >> > The guest generates as much 64bytes packets as possible using
> >> >>
> >> >> Have you tested with other different packet size?
> >> >> My testing shows that performance is dropping when packet size is
> more
> >> >> than 256.
> >> >
> >> >
> >> > Hi Jianbo,
> >> >
> >> > Thanks for reporting this.
> >> >
> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> >> >
> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
> 
> >> >  2. Could you please specify what CPU you're running? Is it Haswell
> >> > or Ivy Bridge?
> >> >
> It's an ARM server.
> 
> >> >  3. How many percentage of drop are you seeing?
> The testing result:
> size (bytes) improvement (%)
> 64   3.92
> 128 11.51
> 256  24.16
> 512  -13.79
> 1024-22.51
> 1500-12.22
> A correction is that performance is dropping if byte size is larger than 512.


Jianbo,

Could you please verify does this patch really cause enqueue perf to drop?

You can test the enqueue path only by set guest to do rxonly, and compare
the mpps by show port stats all in the guest.


Thanks
Zhihong

> 
> >> >
> >> > This is expected by me because I've already found the root cause and
> >> > the way to optimize it, but since it missed the v0 deadline and
> >> > requires changes in eal/memcpy, I postpone it to the next release.
> >> >
> >> > After the upcoming optimization the performance for packets larger
> >> > than 256 will be improved, and the new code will be much faster than
> >> > the current code.
> >> >
> >>
> >> Sorry, I tested on an ARM server, but I wonder if there is the same
> >> issue for x86 platform.
> >
> > Would you please provide more details? Say, answer the two left
> > questions from Zhihong?
> >
> > Thanks.
> >
> > --yliu


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-09-22 Thread Wang, Zhihong


> -Original Message-
> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> Sent: Thursday, September 22, 2016 5:02 PM
> To: Wang, Zhihong 
> Cc: Yuanhan Liu ; Maxime Coquelin
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On 22 September 2016 at 14:58, Wang, Zhihong 
> wrote:
> >
> >
> >> -Original Message-
> >> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> >> Sent: Thursday, September 22, 2016 1:48 PM
> >> To: Yuanhan Liu 
> >> Cc: Wang, Zhihong ; Maxime Coquelin
> >> ; dev at dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> >>
> >> On 22 September 2016 at 10:29, Yuanhan Liu
> 
> >> wrote:
> >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
> >> >> >> > My setup consists of one host running a guest.
> >> >> >> > The guest generates as much 64bytes packets as possible using
> >> >> >>
> >> >> >> Have you tested with other different packet size?
> >> >> >> My testing shows that performance is dropping when packet size is
> >> more
> >> >> >> than 256.
> >> >> >
> >> >> >
> >> >> > Hi Jianbo,
> >> >> >
> >> >> > Thanks for reporting this.
> >> >> >
> >> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> >> >> >
> >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
> >>
> >> >> >  2. Could you please specify what CPU you're running? Is it Haswell
> >> >> > or Ivy Bridge?
> >> >> >
> >> It's an ARM server.
> >>
> >> >> >  3. How many percentage of drop are you seeing?
> >> The testing result:
> >> size (bytes) improvement (%)
> >> 64   3.92
> >> 128 11.51
> >> 256  24.16
> >> 512  -13.79
> >> 1024-22.51
> >> 1500-12.22
> >> A correction is that performance is dropping if byte size is larger than 
> >> 512.
> >
> >
> > Jianbo,
> >
> > Could you please verify does this patch really cause enqueue perf to drop?
> >
> > You can test the enqueue path only by set guest to do rxonly, and compare
> > the mpps by show port stats all in the guest.
> >
> >
> Tested with testpmd, host: txonly, guest: rxonly
> size (bytes) improvement (%)
> 644.12
> 128   6
> 256   2.65
> 512   -1.12
> 1024 -7.02



I think your number is little bit hard to understand for me, this patch's
optimization contains 2 parts:

 1. ring operation: works for both mrg_rxbuf on and off

 2. remote write ordering: works for mrg_rxbuf=on only

So, for mrg_rxbuf=off, if this patch is good for 64B packets, then it
shouldn't do anything bad for larger packets.

This is the gain on x86 platform: host iofwd between nic and vhost,
guest rxonly.

nic2vm  enhancement
64  21.83%
128 16.97%
256 6.34%
512 0.01%
10240.00%

I suspect there's some complication in ARM's micro-arch.

Could you try v6 and apply all patches except the the last one:
[PATCH v6 6/6] vhost: optimize cache access

And see if there's still perf drop?


Thanks
Zhihong



[dpdk-dev] [PATCH v6 2/6] vhost: rewrite enqueue

2016-09-22 Thread Wang, Zhihong


> -Original Message-
> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> Sent: Thursday, September 22, 2016 5:58 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Maxime Coquelin ;
> Yuanhan Liu ; Thomas Monjalon
> 
> Subject: Re: [dpdk-dev] [PATCH v6 2/6] vhost: rewrite enqueue
> 
> On 20 September 2016 at 10:00, Zhihong Wang 
> wrote:
> > This patch implements the vhost logic from scratch into a single function
> > to improve maintainability. This is the baseline version of the new code,
> > more optimization will be added in the following patches in this patch set.
> >
> > In the existing code there're 2 callbacks for vhost enqueue:
> >
> >  *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.
> >
> >  *  virtio_dev_rx for mrg_rxbuf turned off cases.
> >
> > Having 2 callback paths increases maintenance effort. Also, the
> performance
> > of the existing code is not optimal, especially when the mrg_rxbuf feature
> > turned on.
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> > Changes in v6:
> .
> 
> > -/*
> > - * Returns -1 on fail, 0 on success
> > - */
> > -static inline int
> > -reserve_avail_buf_mergeable(struct vhost_virtqueue *vq, uint32_t size,
> > -   uint16_t *end, struct buf_vector *buf_vec)
> > +uint16_t
> > +rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
> > +   struct rte_mbuf **pkts, uint16_t count)
> >  {
> > -   uint16_t cur_idx;
> > +   struct vhost_virtqueue *vq;
> > +   struct virtio_net *dev;
> > +   uint32_t is_mrg_rxbuf = 0;
> > +   uint32_t pkt_idx  = 0;
> > +   uint32_t pkt_left = count;
> 
> Is pkt_left really needed?

It's a matter of coding style since there's no underlying difference.
I prefer this way personally.

> 
> > uint16_t avail_idx;
> > -   uint32_t allocated = 0;
> > -   uint32_t vec_idx = 0;
> > -   uint16_t tries = 0;
> 


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-09-23 Thread Wang, Zhihong


> -Original Message-
> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> Sent: Thursday, September 22, 2016 10:42 PM
> To: Wang, Zhihong 
> Cc: Yuanhan Liu ; Maxime Coquelin
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On 22 September 2016 at 18:04, Wang, Zhihong 
> wrote:
> >
> >
> >> -Original Message-
> >> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> >> Sent: Thursday, September 22, 2016 5:02 PM
> >> To: Wang, Zhihong 
> >> Cc: Yuanhan Liu ; Maxime Coquelin
> >> ; dev at dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> >>
> >> On 22 September 2016 at 14:58, Wang, Zhihong 
> >> wrote:
> >> >
> >> >
> >> >> -----Original Message-
> >> >> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> >> >> Sent: Thursday, September 22, 2016 1:48 PM
> >> >> To: Yuanhan Liu 
> >> >> Cc: Wang, Zhihong ; Maxime Coquelin
> >> >> ; dev at dpdk.org
> >> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> >> >>
> >> >> On 22 September 2016 at 10:29, Yuanhan Liu
> >> 
> >> >> wrote:
> >> >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
> >> >> >> >> > My setup consists of one host running a guest.
> >> >> >> >> > The guest generates as much 64bytes packets as possible using
> >> >> >> >>
> >> >> >> >> Have you tested with other different packet size?
> >> >> >> >> My testing shows that performance is dropping when packet size is
> >> >> more
> >> >> >> >> than 256.
> >> >> >> >
> >> >> >> >
> >> >> >> > Hi Jianbo,
> >> >> >> >
> >> >> >> > Thanks for reporting this.
> >> >> >> >
> >> >> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> >> >> >> >
> >> >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
> >> >>
> >> >> >> >  2. Could you please specify what CPU you're running? Is it Haswell
> >> >> >> > or Ivy Bridge?
> >> >> >> >
> >> >> It's an ARM server.
> >> >>
> >> >> >> >  3. How many percentage of drop are you seeing?
> >> >> The testing result:
> >> >> size (bytes) improvement (%)
> >> >> 64   3.92
> >> >> 128 11.51
> >> >> 256  24.16
> >> >> 512  -13.79
> >> >> 1024-22.51
> >> >> 1500-12.22
> >> >> A correction is that performance is dropping if byte size is larger 
> >> >> than 512.
> >> >
> >> >
> >> > Jianbo,
> >> >
> >> > Could you please verify does this patch really cause enqueue perf to 
> >> > drop?
> >> >
> >> > You can test the enqueue path only by set guest to do rxonly, and compare
> >> > the mpps by show port stats all in the guest.
> >> >
> >> >
> >> Tested with testpmd, host: txonly, guest: rxonly
> >> size (bytes) improvement (%)
> >> 644.12
> >> 128   6
> >> 256   2.65
> >> 512   -1.12
> >> 1024 -7.02
> >
> >
> >
> > I think your number is little bit hard to understand for me, this patch's
> > optimization contains 2 parts:
> >
> >  1. ring operation: works for both mrg_rxbuf on and off
> >
> >  2. remote write ordering: works for mrg_rxbuf=on only
> >
> > So, for mrg_rxbuf=off, if this patch is good for 64B packets, then it
> > shouldn't do anything bad for larger packets.
> >
> > This is the gain on x86 platform: host iofwd between nic and vhost,
> > guest rxonly.
> >
> > nic2vm  enhancement
> > 64  21.83%
> > 128 16.97%
> > 256 6.34%
> > 512 0.01%
> > 10240.00%
> >
> I bootup a VM with 2 virtual port, and stress the traffic between them.
> First, I stressed with pktgen-dpdk in VM, and did iofwd in host.
> Then, as you told, I did rxonly in VM, and txonly in host.
> 
> > I suspect there's some complication in ARM's micro-arch.
> >
> > Could you try v6 and apply all patches except the the last one:
> > [PATCH v6 6/6] vhost: optimize cache access
> >
> > And see if there's still perf drop?
> >
> The last patch can improve the performance. The drop is actually
> caused by the second patch.


This is expected because the 2nd patch is just a baseline and all optimization
patches are organized in the rest of this patch set.

I think you can do bottleneck analysis on ARM to see what's slowing down the
perf, there might be some micro-arch complications there, mostly likely in
memcpy.

Do you use glibc's memcpy? I suggest to hand-crafted it on your own.

Could you publish the mrg_rxbuf=on data also? Since it's more widely used
in terms of spec integrity.


Thanks
Zhihong


> 
> Jianbo


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-09-25 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Friday, September 23, 2016 9:41 PM
> To: Jianbo Liu 
> Cc: dev at dpdk.org; Wang, Zhihong ; Yuanhan Liu
> ; Maxime Coquelin
> 
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> 2016-09-23 18:41, Jianbo Liu:
> > On 23 September 2016 at 10:56, Wang, Zhihong 
> wrote:
> > .
> > > This is expected because the 2nd patch is just a baseline and all 
> > > optimization
> > > patches are organized in the rest of this patch set.
> > >
> > > I think you can do bottleneck analysis on ARM to see what's slowing down 
> > > the
> > > perf, there might be some micro-arch complications there, mostly likely in
> > > memcpy.
> > >
> > > Do you use glibc's memcpy? I suggest to hand-crafted it on your own.
> > >
> > > Could you publish the mrg_rxbuf=on data also? Since it's more widely used
> > > in terms of spec integrity.
> > >
> > I don't think it will be helpful for you, considering the differences
> > between x86 and arm.


Hi Jianbo,

This patch does help in ARM for small packets like 64B sized ones,
this actually proves the similarity between x86 and ARM in terms
of caching optimization in this patch.

My estimation is based on:

 1. The last patch are for mrg_rxbuf=on, and since you said it helps
perf, we can ignore it for now when we discuss mrg_rxbuf=off

 2. Vhost enqueue perf =
Ring overhead + Virtio header overhead + Data memcpy overhead

 3. This patch helps small packets traffic, which means it helps
ring + virtio header operations

 4. So, when you say perf drop when packet size larger than 512B,
this is most likely caused by memcpy in ARM not working well
with this patch

I'm not saying glibc's memcpy is not good enough, it's just that
this is a rather special use case. And since we see specialized
memcpy + this patch give better performance than other combinations
significantly on x86, we suggest to hand-craft a specialized memcpy
for it.

Of course on ARM this is still just my speculation, and we need to
either prove it or find the actual root cause.

It can be **REALLY HELPFUL** if you could help to test this patch on
ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful
to ARM at all, since mrg_rxbuf=on the more widely used cases.


Thanks
Zhihong


> > So please move on with this patchset...
> 
> Jianbo,
> I don't understand.
> You said that the 2nd patch is a regression:
> -   volatile uint16_t   last_used_idx;
> +   uint16_tlast_used_idx;
> 
> And the overrall series lead to performance regression
> for packets > 512 B, right?
> But we don't know wether you have tested the v6 or not.
> 
> Zhihong talked about some improvements possible in rte_memcpy.
> ARM64 is using libc memcpy in rte_memcpy.
> 
> Now you seem to give up.
> Does it mean you accept having a regression in 16.11 release?
> Are you working on rte_memcpy?


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-09-26 Thread Wang, Zhihong


> -Original Message-
> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> Sent: Monday, September 26, 2016 1:13 PM
> To: Wang, Zhihong 
> Cc: Thomas Monjalon ; dev at dpdk.org; Yuanhan
> Liu ; Maxime Coquelin
> 
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On 25 September 2016 at 13:41, Wang, Zhihong 
> wrote:
> >
> >
> >> -Original Message-
> >> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> >> Sent: Friday, September 23, 2016 9:41 PM
> >> To: Jianbo Liu 
> >> Cc: dev at dpdk.org; Wang, Zhihong ; Yuanhan Liu
> >> ; Maxime Coquelin
> >> 
> 
> > This patch does help in ARM for small packets like 64B sized ones,
> > this actually proves the similarity between x86 and ARM in terms
> > of caching optimization in this patch.
> >
> > My estimation is based on:
> >
> >  1. The last patch are for mrg_rxbuf=on, and since you said it helps
> > perf, we can ignore it for now when we discuss mrg_rxbuf=off
> >
> >  2. Vhost enqueue perf =
> > Ring overhead + Virtio header overhead + Data memcpy overhead
> >
> >  3. This patch helps small packets traffic, which means it helps
> > ring + virtio header operations
> >
> >  4. So, when you say perf drop when packet size larger than 512B,
> > this is most likely caused by memcpy in ARM not working well
> > with this patch
> >
> > I'm not saying glibc's memcpy is not good enough, it's just that
> > this is a rather special use case. And since we see specialized
> > memcpy + this patch give better performance than other combinations
> > significantly on x86, we suggest to hand-craft a specialized memcpy
> > for it.
> >
> > Of course on ARM this is still just my speculation, and we need to
> > either prove it or find the actual root cause.
> >
> > It can be **REALLY HELPFUL** if you could help to test this patch on
> > ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful
> > to ARM at all, since mrg_rxbuf=on the more widely used cases.
> >
> Actually it's worse than mrg_rxbuf=off.

I mean compare the perf of original vs. original + patch with
mrg_rxbuf turned on. Is there any perf improvement?



[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-09-26 Thread Wang, Zhihong


> -Original Message-
> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> Sent: Monday, September 26, 2016 1:39 PM
> To: Wang, Zhihong 
> Cc: Thomas Monjalon ; dev at dpdk.org; Yuanhan
> Liu ; Maxime Coquelin
> 
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On 26 September 2016 at 13:25, Wang, Zhihong 
> wrote:
> >
> >
> >> -Original Message-
> >> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> >> Sent: Monday, September 26, 2016 1:13 PM
> >> To: Wang, Zhihong 
> >> Cc: Thomas Monjalon ; dev at dpdk.org;
> Yuanhan
> >> Liu ; Maxime Coquelin
> >> 
> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> >>
> >> On 25 September 2016 at 13:41, Wang, Zhihong 
> >> wrote:
> >> >
> >> >
> >> >> -Original Message-----
> >> >> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> >> >> Sent: Friday, September 23, 2016 9:41 PM
> >> >> To: Jianbo Liu 
> >> >> Cc: dev at dpdk.org; Wang, Zhihong ; Yuanhan
> Liu
> >> >> ; Maxime Coquelin
> >> >> 
> >> 
> >> > This patch does help in ARM for small packets like 64B sized ones,
> >> > this actually proves the similarity between x86 and ARM in terms
> >> > of caching optimization in this patch.
> >> >
> >> > My estimation is based on:
> >> >
> >> >  1. The last patch are for mrg_rxbuf=on, and since you said it helps
> >> > perf, we can ignore it for now when we discuss mrg_rxbuf=off
> >> >
> >> >  2. Vhost enqueue perf =
> >> > Ring overhead + Virtio header overhead + Data memcpy overhead
> >> >
> >> >  3. This patch helps small packets traffic, which means it helps
> >> > ring + virtio header operations
> >> >
> >> >  4. So, when you say perf drop when packet size larger than 512B,
> >> > this is most likely caused by memcpy in ARM not working well
> >> > with this patch
> >> >
> >> > I'm not saying glibc's memcpy is not good enough, it's just that
> >> > this is a rather special use case. And since we see specialized
> >> > memcpy + this patch give better performance than other combinations
> >> > significantly on x86, we suggest to hand-craft a specialized memcpy
> >> > for it.
> >> >
> >> > Of course on ARM this is still just my speculation, and we need to
> >> > either prove it or find the actual root cause.
> >> >
> >> > It can be **REALLY HELPFUL** if you could help to test this patch on
> >> > ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful
> >> > to ARM at all, since mrg_rxbuf=on the more widely used cases.
> >> >
> >> Actually it's worse than mrg_rxbuf=off.
> >
> > I mean compare the perf of original vs. original + patch with
> > mrg_rxbuf turned on. Is there any perf improvement?
> >
> Yes, orig + patch + on is better than orig + on, but orig + patch + on
> is worse than orig + patch + off.


Hi Jianbo,

That's the way it is for virtio, if you compare the current enqueue,
the mrg on perf is even slower.

We should compare:

 1. mrg on: orig vs. orig + patch

 2. mrg off: orig vs. orig + patch

There's more memory touch and in the frontend that brings down the
performance when mrg is on.

Finally, even though mrg on is slower, it's still the mainstream use case
as far as I know.


Thanks
Zhihong



[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-09-27 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Tuesday, September 27, 2016 6:21 PM
> To: Jianbo Liu 
> Cc: Wang, Zhihong ; Maxime Coquelin
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On Thu, Sep 22, 2016 at 05:01:41PM +0800, Jianbo Liu wrote:
> > On 22 September 2016 at 14:58, Wang, Zhihong 
> wrote:
> > >
> > >
> > >> -Original Message-
> > >> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> > >> Sent: Thursday, September 22, 2016 1:48 PM
> > >> To: Yuanhan Liu 
> > >> Cc: Wang, Zhihong ; Maxime Coquelin
> > >> ; dev at dpdk.org
> > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> > >>
> > >> On 22 September 2016 at 10:29, Yuanhan Liu  > >> linux.intel.com>
> > >> wrote:
> > >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
> > >> >> >> > My setup consists of one host running a guest.
> > >> >> >> > The guest generates as much 64bytes packets as possible using
> > >> >> >>
> > >> >> >> Have you tested with other different packet size?
> > >> >> >> My testing shows that performance is dropping when packet size is
> > >> more
> > >> >> >> than 256.
> > >> >> >
> > >> >> >
> > >> >> > Hi Jianbo,
> > >> >> >
> > >> >> > Thanks for reporting this.
> > >> >> >
> > >> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> > >> >> >
> > >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
> > >>
> > >> >> >  2. Could you please specify what CPU you're running? Is it Haswell
> > >> >> > or Ivy Bridge?
> > >> >> >
> > >> It's an ARM server.
> > >>
> > >> >> >  3. How many percentage of drop are you seeing?
> > >> The testing result:
> > >> size (bytes) improvement (%)
> > >> 64   3.92
> > >> 128 11.51
> > >> 256  24.16
> > >> 512  -13.79
> > >> 1024-22.51
> > >> 1500-12.22
> > >> A correction is that performance is dropping if byte size is larger than 
> > >> 512.
> > >
> > >
> > > Jianbo,
> > >
> > > Could you please verify does this patch really cause enqueue perf to drop?
> > >
> > > You can test the enqueue path only by set guest to do rxonly, and compare
> > > the mpps by show port stats all in the guest.
> > >
> > >
> > Tested with testpmd, host: txonly, guest: rxonly
> > size (bytes) improvement (%)
> > 644.12
> > 128   6
> > 256   2.65
> > 512   -1.12
> > 1024 -7.02
> 
> There is a difference between Zhihong's code and the old I spotted in
> the first time: Zhihong removed the avail_idx prefetch. I understand
> the prefetch becomes a bit tricky when mrg-rx code path is considered;
> thus, I didn't comment on that.
> 
> That's one of the difference that, IMO, could drop a regression. I then
> finally got a chance to add it back.
> 
> A rough test shows it improves the performance of 1400B packet size greatly
> in the "txonly in host and rxonly in guest" case: +33% is the number I get
> with my test server (Ivybridge).

Thanks Yuanhan! I'll validate this on x86.

> 
> I guess this might/would help your case as well. Mind to have a test
> and tell me the results?
> 
> BTW, I made it in rush; I haven't tested the mrg-rx code path yet.
> 
> Thanks.
> 
>   --yliu


Re: [dpdk-dev] [PATCH v5 07/14] ring: make bulk and burst fn return vals consistent

2017-04-12 Thread Wang, Zhihong
Hi Bruce,

This patch changes the behavior and causes some existing code to
malfunction, e.g. bond_ethdev_stop() will get stuck here:

while (rte_ring_dequeue(port->rx_ring, &pkt) != -ENOENT)
rte_pktmbuf_free(pkt);

Another example in test/test/virtual_pmd.c: virtual_ethdev_stop().


Thanks
Zhihong

> -Original Message-
> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Wednesday, March 29, 2017 9:10 PM
> To: olivier.m...@6wind.com
> Cc: dev@dpdk.org; Richardson, Bruce 
> Subject: [dpdk-dev] [PATCH v5 07/14] ring: make bulk and burst fn return
> vals consistent
> 
> The bulk fns for rings returns 0 for all elements enqueued and negative
> for no space. Change that to make them consistent with the burst functions
> in returning the number of elements enqueued/dequeued, i.e. 0 or N.
> This change also allows the return value from enq/deq to be used directly
> without a branch for error checking.
> 
> Signed-off-by: Bruce Richardson 
> Reviewed-by: Yuanhan Liu 
> Acked-by: Olivier Matz 
> ---
>  doc/guides/rel_notes/release_17_05.rst |  11 +++
>  doc/guides/sample_app_ug/server_node_efd.rst   |   2 +-
>  examples/load_balancer/runtime.c   |  16 ++-
>  .../client_server_mp/mp_client/client.c|   8 +-
>  .../client_server_mp/mp_server/main.c  |   2 +-
>  examples/qos_sched/app_thread.c|   8 +-
>  examples/server_node_efd/node/node.c   |   2 +-
>  examples/server_node_efd/server/main.c |   2 +-
>  lib/librte_mempool/rte_mempool_ring.c  |  12 ++-
>  lib/librte_ring/rte_ring.h | 109 
> +++--
>  test/test-pipeline/pipeline_hash.c |   2 +-
>  test/test-pipeline/runtime.c   |   8 +-
>  test/test/test_ring.c  |  46 +
>  test/test/test_ring_perf.c |   8 +-
>  14 files changed, 106 insertions(+), 130 deletions(-)
> 
> diff --git a/doc/guides/rel_notes/release_17_05.rst
> b/doc/guides/rel_notes/release_17_05.rst
> index 084b359..6da2612 100644
> --- a/doc/guides/rel_notes/release_17_05.rst
> +++ b/doc/guides/rel_notes/release_17_05.rst
> @@ -137,6 +137,17 @@ API Changes
>* removed the build-time setting
> ``CONFIG_RTE_RING_PAUSE_REP_COUNT``
>* removed the function ``rte_ring_set_water_mark`` as part of a general
>  removal of watermarks support in the library.
> +  * changed the return value of the enqueue and dequeue bulk functions to
> +match that of the burst equivalents. In all cases, ring functions which
> +operate on multiple packets now return the number of elements
> enqueued
> +or dequeued, as appropriate. The updated functions are:
> +
> +- ``rte_ring_mp_enqueue_bulk``
> +- ``rte_ring_sp_enqueue_bulk``
> +- ``rte_ring_enqueue_bulk``
> +- ``rte_ring_mc_dequeue_bulk``
> +- ``rte_ring_sc_dequeue_bulk``
> +- ``rte_ring_dequeue_bulk``
> 
>  ABI Changes
>  ---
> diff --git a/doc/guides/sample_app_ug/server_node_efd.rst
> b/doc/guides/sample_app_ug/server_node_efd.rst
> index 9b69cfe..e3a63c8 100644
> --- a/doc/guides/sample_app_ug/server_node_efd.rst
> +++ b/doc/guides/sample_app_ug/server_node_efd.rst
> @@ -286,7 +286,7 @@ repeated infinitely.
> 
>  cl = &nodes[node];
>  if (rte_ring_enqueue_bulk(cl->rx_q, (void **)cl_rx_buf[node].buffer,
> -cl_rx_buf[node].count) != 0){
> +cl_rx_buf[node].count) != cl_rx_buf[node].count){
>  for (j = 0; j < cl_rx_buf[node].count; j++)
>  rte_pktmbuf_free(cl_rx_buf[node].buffer[j]);
>  cl->stats.rx_drop += cl_rx_buf[node].count;
> diff --git a/examples/load_balancer/runtime.c
> b/examples/load_balancer/runtime.c
> index 6944325..82b10bc 100644
> --- a/examples/load_balancer/runtime.c
> +++ b/examples/load_balancer/runtime.c
> @@ -146,7 +146,7 @@ app_lcore_io_rx_buffer_to_send (
>   (void **) lp->rx.mbuf_out[worker].array,
>   bsz);
> 
> - if (unlikely(ret == -ENOBUFS)) {
> + if (unlikely(ret == 0)) {
>   uint32_t k;
>   for (k = 0; k < bsz; k ++) {
>   struct rte_mbuf *m = lp-
> >rx.mbuf_out[worker].array[k];
> @@ -312,7 +312,7 @@ app_lcore_io_rx_flush(struct app_lcore_params_io
> *lp, uint32_t n_workers)
>   (void **) lp->rx.mbuf_out[worker].array,
>   lp->rx.mbuf_out[worker].n_mbufs);
> 
> - if (unlikely(ret < 0)) {
> + if (unlikely(ret == 0)) {
>   uint32_t k;
>   for (k = 0; k < lp->rx.mbuf_out[worker].n_mbufs; k
> ++) {
>   struct rte_mbuf *pkt_to_free = lp-
> >rx.mbuf_out[worker].array[k];
> @@ -349,9 +349,8 @@ app_lcore_io_tx(
>   (void **) &lp-
> >tx.mbuf_out[port].array[n_mbufs],
>  

Re: [dpdk-dev] [PATCH] config: make AVX and AVX512 configurable

2017-04-27 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:tho...@monjalon.net]
> Sent: Thursday, April 27, 2017 5:08 PM
> To: Wang, Zhihong ; Richardson, Bruce
> 
> Cc: dev@dpdk.org; yuanhan@linux.intel.com
> Subject: Re: [dpdk-dev] [PATCH] config: make AVX and AVX512 configurable
> 
> 27/04/2017 18:34, Zhihong Wang:
> > Making AVX and AVX512 configurable is useful for performance and power
> > testing.
> >
> > The similar kernel patch at https://patchwork.kernel.org/patch/9618883/.
> [...]
> > +#
> > +# Recognize/ignore the AVX/AVX512 CPU flags for performance/power
> testing
> > +#
> > +CONFIG_RTE_ENABLE_AVX=y
> > +CONFIG_RTE_ENABLE_AVX512=n
> 
> It is disabling AVX512 in default configuration.
> Please explain this behaviour change.

Though AVX512 rte_memcpy has been in DPDK for quite a while it's still
unproven in hardware with rich use cases. Mark it as experimental for
now, user can enable it for their own testing.

Will enable it with enough field tests and possible optimization.

Should I add the explanation in commit log, or comments in the source,
or both?


Re: [dpdk-dev] [PATCH] config: make AVX and AVX512 configurable

2017-04-27 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:tho...@monjalon.net]
> Sent: Thursday, April 27, 2017 5:20 PM
> To: Wang, Zhihong 
> Cc: Richardson, Bruce ; dev@dpdk.org;
> yuanhan@linux.intel.com
> Subject: Re: [dpdk-dev] [PATCH] config: make AVX and AVX512 configurable
> 
> 27/04/2017 11:18, Wang, Zhihong:
> > From: Thomas Monjalon [mailto:tho...@monjalon.net]
> > > 27/04/2017 18:34, Zhihong Wang:
> > > > Making AVX and AVX512 configurable is useful for performance and
> power
> > > > testing.
> > > >
> > > > The similar kernel patch at
> https://patchwork.kernel.org/patch/9618883/.
> > > [...]
> > > > +#
> > > > +# Recognize/ignore the AVX/AVX512 CPU flags for
> performance/power
> > > testing
> > > > +#
> > > > +CONFIG_RTE_ENABLE_AVX=y
> > > > +CONFIG_RTE_ENABLE_AVX512=n
> > >
> > > It is disabling AVX512 in default configuration.
> > > Please explain this behaviour change.
> >
> > Though AVX512 rte_memcpy has been in DPDK for quite a while it's still
> > unproven in hardware with rich use cases. Mark it as experimental for
> > now, user can enable it for their own testing.
> >
> > Will enable it with enough field tests and possible optimization.
> >
> > Should I add the explanation in commit log, or comments in the source,
> > or both?
> 
> Yes please, add the explanation in the commit log and experimental
> comment
> in the config.

Thanks a lot! It's included in v2.


Re: [dpdk-dev] [PATCH] vhost: support rx_queue_count

2017-05-23 Thread Wang, Zhihong


> -Original Message-
> From: Jens Freimann [mailto:jfrei...@redhat.com]
> Sent: Tuesday, May 23, 2017 7:54 PM
> To: Wang, Zhihong 
> Cc: dev@dpdk.org; yuanhan@linux.intel.com
> Subject: Re: [dpdk-dev] [PATCH] vhost: support rx_queue_count
> 
> On Mon, May 22, 2017 at 04:01:08PM -0400, Zhihong Wang wrote:
> > This patch implements the ops rx_queue_count for vhost PMD by adding
> > a helper function rte_vhost_rx_queue_count in vhost lib.
> >
> > The ops ops rx_queue_count gets vhost RX queue avail count and helps
> 
> s/ops ops/ops/ ?

Thanks a lot!

> 
> > to understand the queue fill level.
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> >  drivers/net/vhost/rte_eth_vhost.c  | 13 +
> >  lib/librte_vhost/rte_vhost.h   | 12 
> >  lib/librte_vhost/rte_vhost_version.map |  7 +++
> >  lib/librte_vhost/vhost.c   | 23 +++
> >  4 files changed, 55 insertions(+)
> >
> > diff --git a/drivers/net/vhost/rte_eth_vhost.c
> b/drivers/net/vhost/rte_eth_vhost.c
> > index 257bf6d..e3a3fe0 100644
> > --- a/drivers/net/vhost/rte_eth_vhost.c
> > +++ b/drivers/net/vhost/rte_eth_vhost.c
> > @@ -973,6 +973,18 @@ eth_link_update(struct rte_eth_dev *dev
> __rte_unused,
> > return 0;
> >  }
> >
> > +static uint32_t
> > +eth_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
> > +{
> > +   struct vhost_queue *vq;
> > +
> > +   vq = dev->data->rx_queues[rx_queue_id];
> > +   if (!vq)
> 
> nitpick, but it should be "if (vq == NULL" according to the coding style guide
> 
> > +   return 0;
> > +
> > +   return rte_vhost_rx_queue_count(vq->vid, vq->virtqueue_id);
> > +}
> > +
> >  static const struct eth_dev_ops ops = {
> > .dev_start = eth_dev_start,
> > .dev_stop = eth_dev_stop,
> > @@ -984,6 +996,7 @@ static const struct eth_dev_ops ops = {
> > .rx_queue_release = eth_queue_release,
> > .tx_queue_release = eth_queue_release,
> > .tx_done_cleanup = eth_tx_done_cleanup,
> > +   .rx_queue_count = eth_rx_queue_count,
> > .link_update = eth_link_update,
> > .stats_get = eth_stats_get,
> > .stats_reset = eth_stats_reset,
> > diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
> > index 605e47c..f64ed20 100644
> > --- a/lib/librte_vhost/rte_vhost.h
> > +++ b/lib/librte_vhost/rte_vhost.h
> > @@ -432,6 +432,18 @@ int rte_vhost_get_mem_table(int vid, struct
> rte_vhost_memory **mem);
> >  int rte_vhost_get_vhost_vring(int vid, uint16_t vring_idx,
> >   struct rte_vhost_vring *vring);
> >
> > +/**
> > + * Get vhost RX queue avail count.
> > + *
> > + * @param vid
> > + *  vhost device ID
> > + * @param qid
> > + *  virtio queue index in mq case
> > + * @return
> > + *  num of desc available
> > + */
> > +uint32_t rte_vhost_rx_queue_count(int vid, uint16_t qid);
> > +
> >  #ifdef __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/librte_vhost/rte_vhost_version.map
> b/lib/librte_vhost/rte_vhost_version.map
> > index 0785873..1e70495 100644
> > --- a/lib/librte_vhost/rte_vhost_version.map
> > +++ b/lib/librte_vhost/rte_vhost_version.map
> > @@ -45,3 +45,10 @@ DPDK_17.05 {
> > rte_vhost_log_write;
> >
> >  } DPDK_16.07;
> > +
> > +DPDK_17.08 {
> > +   global:
> > +
> > +   rte_vhost_rx_queue_count;
> > +
> > +} DPDK_17.05;
> > diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
> > index 0b19d2e..140d2ae 100644
> > --- a/lib/librte_vhost/vhost.c
> > +++ b/lib/librte_vhost/vhost.c
> > @@ -475,3 +475,26 @@ rte_vhost_log_used_vring(int vid, uint16_t
> vring_idx,
> >
> > vhost_log_used_vring(dev, vq, offset, len);
> >  }
> > +
> > +uint32_t
> > +rte_vhost_rx_queue_count(int vid, uint16_t qid)
> > +{
> > +   struct virtio_net *dev;
> > +   struct vhost_virtqueue *vq;
> > +
> > +   dev = get_device(vid);
> > +   if (!dev)
> 
> same here
> 
> > +   return 0;
> > +
> > +   if (unlikely(qid >= dev->nr_vring || (qid & 1) == 0)) {
> > +   RTE_LOG(ERR, VHOST_DATA, "(%d) %s: invalid virtqueue
> idx %d.\n",
> > +   dev->vid, __func__, qid);
> > +   return 0;
> > +   }
> > +
> > +   vq = dev->virtqueue[qid];
> 
> check for vq == NULL?
> 
> regards,
> Jens



Re: [dpdk-dev] [PATCH] vhost: support rx_queue_count

2017-05-25 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan@linux.intel.com]
> Sent: Wednesday, May 24, 2017 4:43 PM
> To: Jens Freimann 
> Cc: Wang, Zhihong ; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] vhost: support rx_queue_count
> 
> On Wed, May 24, 2017 at 10:36:01AM +0200, Jens Freimann wrote:
> > On Wed, May 24, 2017 at 04:14:19PM +0800, Yuanhan Liu wrote:
> > > On Tue, May 23, 2017 at 12:51:56PM +, Wang, Zhihong wrote:
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Jens Freimann [mailto:jfrei...@redhat.com]
> > > > > Sent: Tuesday, May 23, 2017 7:54 PM
> > > > > To: Wang, Zhihong 
> > > > > Cc: dev@dpdk.org; yuanhan@linux.intel.com
> > > > > Subject: Re: [dpdk-dev] [PATCH] vhost: support rx_queue_count
> > > > >
> > > > > On Mon, May 22, 2017 at 04:01:08PM -0400, Zhihong Wang wrote:
> > > > > > This patch implements the ops rx_queue_count for vhost PMD by
> adding
> > > > > > a helper function rte_vhost_rx_queue_count in vhost lib.
> > > > > >
> > > > > > The ops ops rx_queue_count gets vhost RX queue avail count and
> helps
> > > > >
> > > > > s/ops ops/ops/ ?
> > > >
> > > > Thanks a lot!
> > >
> > > Seems you overlooked other comments, such as:
> > >
> > > > > > +   vq = dev->virtqueue[qid];
> > > > >
> > > > > check for vq == NULL?
> > > > >
> > > > > regards,
> > > > > Jens
> > >
> > > Besides, Jens, I'm not a big fan of "dev == NULL" over "!dev". I accept
> > > both :)
> >
> > Personally I'm with you on this, but it says different in
> http://dpdk.org/doc/guides/contributing/coding_style.html
> > Anyway, up to you :)
> 
> Yes, I know that. To make every body happy, I would suggset you to
> follow the coding style. But if you have already choosen another way,
> I won't bother to ask you do the change. Unless, it becomes a must
> in future.

Thanks Yuanhan and Jens! Will address these comments in v3.

> 
>   --yliu


[dpdk-dev] [PATCH] Unlink existing unused sockets at start up

2015-12-18 Thread Wang, Zhihong
> On 17.12.2015 07:21, Zhihong Wang wrote:
> > This patch unlinks existing unused sockets (which cause new bindings to 
> > fail, e.g.
> vHost PMD) to ensure smooth startup.
> > In a lot of cases DPDK applications are terminated abnormally without proper
> resource release.
> 
> Original OVS related problem discussed previously here
> ( http://dpdk.org/ml/archives/dev/2015-December/030326.html ) fixed in OVS
> by
> 
> commit 9b5422a98f817b9f2a1f8224cab7e1a8d0bbba1f
> Author: Ilya Maximets 
> Date:   Wed Dec 16 15:32:21 2015 +0300
> 
> ovs-lib: Try to call exit before killing.
> 
> While killing OVS may not free all allocated resources.
> 
> Example:
> Socket for vhost-user port will stay in a system
> after 'systemctl stop openvswitch' and opening
> that port after restart will fail.
> 
> 
> So, the crash of application is the last point of discussion.
> 
> > Therefore, DPDK libs should be able to deal with unclean boot environment.
> 
> Why are you think that recovery after crash of application is a problem of
> underneath library?

Thanks for the information!

Yes ideally the underneath lib shouldn't meddle with the recovery logic.
But I do think we should at least put a warning in the lib function said the 
app should make the path available. This is another topic though :-)
Like we did in memcpy:
/**
 * Copy 16 bytes from one location to another,
 * locations should not overlap.
 */


> 
> Best regards, Ilya Maximets.
> 



[dpdk-dev] [PATCH] Unlink existing unused sockets at start up

2015-12-21 Thread Wang, Zhihong


> -Original Message-
> From: Ilya Maximets [mailto:i.maximets at samsung.com]
> Sent: Friday, December 18, 2015 2:18 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: p.fedin at samsung.com; yuanhan.liu at linux.intel.com; s.dyasly at 
> samsung.com;
> Xie, Huawei 
> Subject: Re: [PATCH] Unlink existing unused sockets at start up
> 
> On 18.12.2015 05:39, Wang, Zhihong wrote:
> 
> > Yes ideally the underneath lib shouldn't meddle with the recovery logic.
> > But I do think we should at least put a warning in the lib function
> > said the app should make the path available. This is another topic though 
> > :-)
> Like we did in memcpy:
> > /**
> >  * Copy 16 bytes from one location to another,
> >  * locations should not overlap.
> >  */
> >
> 
> Isn't it enough to have an error in the log?

Function comments and function code are different things and are both necessary.
Also why wait till error occurs when a comment can warn the developer?

> 
> lib/librte_vhost/vhost_user/vhost-net-user.c:130:
> RTE_LOG(ERR, VHOST_CONFIG, "fail to bind fd:%d, remove file:%s and try
> again.\n",
> 
> Best regards, Ilya Maximets.


[dpdk-dev] [PATCH 1/3] app/test-pmd: Handle SIGINT and SIGTERM in testpmd

2015-12-24 Thread Wang, Zhihong
> > +/* When we receive a INT signal, close all ports */ static void
> > +sigint_handler(__rte_unused int signum) {
> > +   unsigned portid;
> > +
> > +   printf("Preparing to exit...\n");
> 
> Better to notice user "Signal xxx received, reparing to exit... "

Can do that.

> 
> > +   FOREACH_PORT(portid, ports) {
> > +   if (port_id_is_invalid(portid, ENABLED_WARN))
> > +   continue;
> > +   printf("Stopping port %d...", portid);
> > +   rte_eth_dev_stop(portid);
> > +   rte_eth_dev_close(portid);
> > +   printf(" Done\n");
> > +   }
> > +   printf("Bye...\n");
> 
> Here why don't call pmd_test_exit()? Any issue with that func?

Yes should just call this one :)

> 
> Thanks,
> Michael
> > +   exit(0);
> > +}
> > +
> >  int
> >  main(int argc, char** argv)
> >  {
> > int  diag;
> > uint8_t port_id;
> >
> > +   signal(SIGINT, sigint_handler);
> > +   signal(SIGTERM, sigint_handler);
> > +
> > diag = rte_eal_init(argc, argv);
> > if (diag < 0)
> > rte_panic("Cannot init EAL\n");



[dpdk-dev] [PATCH 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in l3fwd

2015-12-25 Thread Wang, Zhihong
> > +/* When we receive a INT signal, close all ports */ static void
> > +sigint_handler(__rte_unused int signum) {
> > +   unsigned portid, nb_ports;
> > +
> > +   printf("Preparing to exit...\n");
> > +   nb_ports = rte_eth_dev_count();
> > +   for (portid = 0; portid < nb_ports; portid++) {
> > +   if ((enabled_port_mask & (1 << portid)) == 0) {
> > +   continue;
> > +   }
> > +   printf("Stopping port %d...", portid);
> > +   rte_eth_dev_stop(portid);
> > +   rte_eth_dev_close(portid);
> 
> Hmm, so your interrupt thread invokes dev_stop, while IO lcores keep calling
> rx_burst/tx_burst?
> For graceful shutdown on SIGINT, I suppose you first have to stop your IO 
> lcores
> first.
> Let say have a global var: 'stop' that every lcore has to check from time to 
> time (or
> something similar).

Thanks for the advice! This works once the program enters the forwarding phase.
Have to go the other way if it's still in initialization phase which can take 
quite some time.

/Zhihong

> Konstantin
> 
> > +   printf(" Done\n");
> > +   }
> > +   printf("Bye...\n");
> > +   exit(0);
> > +}
> > +
> >  int
> >  main(int argc, char **argv)
> >  {
> > @@ -2572,6 +2594,9 @@ main(int argc, char **argv)
> > uint32_t n_tx_queue, nb_lcores;
> > uint8_t portid, nb_rx_queue, queue, socketid;
> >
> > +   signal(SIGINT, sigint_handler);
> > +   signal(SIGTERM, sigint_handler);
> > +
> > /* init EAL */
> > ret = rte_eal_init(argc, argv);
> > if (ret < 0)
> > --
> > 2.5.0



[dpdk-dev] [PATCH 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in l3fwd

2015-12-25 Thread Wang, Zhihong
> On Wed, 23 Dec 2015 15:03:15 -0500
> Zhihong Wang  wrote:
> 
> > +/* When we receive a INT signal, close all ports */ static void
> > +sigint_handler(__rte_unused int signum) {
> > +   unsigned portid, nb_ports;
> > +
> > +   printf("Preparing to exit...\n");
> > +   nb_ports = rte_eth_dev_count();
> > +   for (portid = 0; portid < nb_ports; portid++) {
> > +   if ((enabled_port_mask & (1 << portid)) == 0) {
> > +   continue;
> > +   }
> > +   printf("Stopping port %d...", portid);
> > +   rte_eth_dev_stop(portid);
> > +   rte_eth_dev_close(portid);
> > +   printf(" Done\n");
> > +   }
> > +   printf("Bye...\n");
> > +   exit(0);
> > +}
> 
> Signal handlers should only set a flag, which is then checked by thread loops.
> Calling functions in DPDK from signal handlers is not safe.

I'll make changes in v2 to address this issue. Thanks for pointing out :)
In some cases signal handler have to do the exit though, like when the program 
is still doing memory initialization and will take some time.


[dpdk-dev] [PATCH v2 2/3] examples/l2fwd: Handle SIGINT and SIGTERM in l2fwd

2015-12-28 Thread Wang, Zhihong
Hi Stephen,

Really appreciate the detailed review!
Please see comments below.


> > +static int force_quit = -1;
> > +static int signo_quit = -1;
> 
> These need to be volatile otherwise you risk compiler optimizing away your
> checks.

Yes. Don't wanna take chances here.

> 
> Also, don't use -1/0 just use 0/1 for boolean or better yet the definition in
>  of bool and true/false.
> That way the code can read much nicer.

-1 when forwarding not started yet.
Can add a "static bool fwd_started;" to represent this to make it clearer.

> 
> >  #define RTE_LOGTYPE_L2FWD RTE_LOGTYPE_USER1
> >
> >  #define NB_MBUF   8192
> > @@ -284,6 +289,8 @@ l2fwd_main_loop(void)
> > }
> >
> > while (1) {
> > +   if (unlikely(force_quit != 0))
> > +   break;
> 
> Please maske this a proper while loop instead.

Exactly.

> 
> while (!force_quit) {
> 
> >
> > cur_tsc = rte_rdtsc();
> >
> > @@ -534,6 +541,45 @@ check_all_ports_link_status(uint8_t port_num,
> uint32_t port_mask)
> > }
> >  }
> >
> > +static void
> > +stop_ports(void)
> > +{
> > +   unsigned portid, nb_ports;
> > +
> > +   nb_ports = rte_eth_dev_count();
> > +   for (portid = 0; portid < nb_ports; portid++) {
> > +   if ((l2fwd_enabled_port_mask & (1 << portid)) == 0) {
> > +   continue;
> > +   }
> 
> No need for {} here.
> 
> > +   printf("Stopping port %d...", portid);
> > +   rte_eth_dev_stop(portid);
> > +   rte_eth_dev_close(portid);
> > +   printf(" Done\n");
> > +   }
> > +}
> > +
> > +static void
> > +signal_handler(__rte_unused int signum) {
> > +   if (signum == SIGINT || signum == SIGTERM) {
> 
> signum is used, dont give __rte_unused attribute.
> 
> >
> > /* launch per-lcore init on every lcore */
> > +   force_quit = 0;
> 
> What is gained by having tri-value here. Just initialize it as false.

As stated above.

> 
> 
> > rte_eal_mp_remote_launch(l2fwd_launch_one_lcore, NULL,
> CALL_MASTER);
> > RTE_LCORE_FOREACH_SLAVE(lcore_id) {
> > if (rte_eal_wait_lcore(lcore_id) < 0)
> > return -1;
> > }
> >
> > +   printf("Stopping forwarding... Done\n");
> > +   /* stop ports */
> > +   stop_ports();
> > +   printf("Bye...\n");
> > +   /* inform if there's a caller */
> > +   if (force_quit != 0) {
> > +   signal(signo_quit, SIG_DFL);
> > +   kill(getpid(), signo_quit);
> 
> The kill should not be needed.

The purpose is to make the program exit with the killed status.

> 
> It would be good if examples cleaned up allocations, that way they could be 
> used
> with valgrind for validation of drivers, etc.



[dpdk-dev] [PATCH v2 1/3] app/test-pmd: Handle SIGINT and SIGTERM in testpmd

2015-12-28 Thread Wang, Zhihong
> > -   cl = cmdline_stdin_new(main_ctx, "testpmd> ");
> > -   if (cl == NULL) {
> > +   testpmd_cl = cmdline_stdin_new(main_ctx, "testpmd> ");
> > +   if (testpmd_cl == NULL) {
> > return;
> > }
> 
> Style nit: don't need {} around single statement.
> 
> > +static void
> > +sigint_handler(__rte_unused int signum) {
> > +   if (signum == SIGINT || signum == SIGTERM) {
> 
> signmum is used, so don't want __rte_unused
> 

Thanks :) Will fix these in the next version.



[dpdk-dev] [PATCH v2 0/3] Handle SIGINT and SIGTERM in DPDK examples

2015-12-28 Thread Wang, Zhihong


> -Original Message-
> From: Qiu, Michael
> Sent: Monday, December 28, 2015 12:18 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: Ananyev, Konstantin ;
> stephen at networkplumber.org
> Subject: Re: [PATCH v2 0/3] Handle SIGINT and SIGTERM in DPDK examples
> 
> On 2015/12/25 17:40, Wang, Zhihong wrote:
> > This patch handles SIGINT and SIGTERM in testpmd, l2fwd, and l3fwd, make
> sure all ports are properly stopped and closed.
> > For virtual ports, the stop and close function may deal with resource 
> > cleanup,
> such as socket files unlinking.
> >
> > --
> > Changes in v2:
> >
> > 1. Make sure graceful exit for all running phases
> >
> > 2. Make sure program exits with the right status
> >
> > Zhihong Wang (3):
> >   app/test-pmd: Handle SIGINT and SIGTERM in testpmd
> >   examples/l2fwd: Handle SIGINT and SIGTERM in l2fwd
> >   examples/l3fwd: Handle SIGINT and SIGTERM in l3fwd
> >
> >  app/test-pmd/cmdline.c |  19 ++---
> >  app/test-pmd/testpmd.c |  38 ++---
> >  app/test-pmd/testpmd.h |   1 +
> >  examples/l2fwd/main.c  |  60 +++
> >  examples/l3fwd/main.c  | 110
> -
> >  5 files changed, 196 insertions(+), 32 deletions(-)
> >
> 
> Next time, you'd better not to top post for V2 :)

Gotcha :)

> 
> Acked-by: Michael Qiu 


[dpdk-dev] [PATCH v3 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in l3fwd

2015-12-30 Thread Wang, Zhihong
> > +static uint8_t
> > +start_ports(void)
> > +{
> > +   unsigned portid, nb_ports, avail_ports;
> > +   int ret;
> > +
> > +   nb_ports = rte_eth_dev_count();
> > +   avail_ports = 0;
> > +   for (portid = 0; portid < nb_ports; portid++) {
> > +   if ((enabled_port_mask & (1 << portid)) == 0)
> > +   continue;
> > +   avail_ports++;
> > +   port_started = true;
> 
> Why do you need it at each iteration?

Only become true when the first enabled port about to started. In case there's 
no port enabled at all.
In my opinion no need to optimize since it's not performance sensitive and the 
logic is correct :)


> 
> > +   printf("Starting port %d...", portid);
> > +   ret = rte_eth_dev_start(portid);
> > +   if (ret < 0)
> > +   rte_exit(EXIT_FAILURE,
> > +   "rte_eth_dev_start: err=%d, port=%d\n",
> > +   ret, portid);
> > +   /*
> > +* If enabled, put device in promiscuous mode.
> > +* This allows IO forwarding mode to forward packets
> > +* to itself through 2 cross-connected  ports of the
> > +* target machine.
> > +*/
> > +   if (promiscuous_on)
> > +   rte_eth_promiscuous_enable(portid);
> > +   printf(" Done\n");
> > +   }
> > +
> > +   return avail_ports;
> > +}

[...]

> > +static void
> > +signal_handler(int signum)
> > +{
> > +   if (signum == SIGINT || signum == SIGTERM) {
> > +   printf("\nSignal %d received, preparing to exit...\n",
> > +   signum);
> > +   if (port_started) {
> > +   printf("Ports started already...\n");
> > +   signo_quit = signum;
> > +   force_quit = true;
> > +   } else {
> 
> 
> Hmm, and what if signal_handler() would be executed not in the context of
> master lcore?
> Then there could be a raise condition, and you could end up here, while master
> lcore would be in the middle of start_ports()->rte_eth_dev_start().

Good point! Then we need rte_atomic16_cmpset() to avoid the race condition.


> Probably not a big deal, but why do you need this  if (port_started) {...} 
> else {...}
> at all?
> Why not just:

If no port has been started, then just kill itself.
This is for cases like when you just started it and then want to shut it down, 
it'll wait a long time for initialization (memory, etc.) before the force_quit 
signal take effect.


> 
> signal_handler(int signum)
> {
>   signo_quit = signum;
>   force_quit = true;
> }
> ?
> 
> Konstantin
> 
> > +   printf("Ports not started yet...\n");
> > +   printf("Bye...\n");
> > +   /* exit with the expected status */
> > +   signal(signum, SIG_DFL);
> > +   kill(getpid(), signum);
> > +   }
> > +   }
> > +}
> > +



[dpdk-dev] [PATCH v4 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in l3fwd

2015-12-31 Thread Wang, Zhihong
> > +#define PORT_IDLE 0
> > +#define PORT_INIT 1
> > +#define PORT_WORK 2
> > +#define PORT_STOP 3
> > +#define PORT_QUIT 4
> 
> Seems ok, but over-complicated.
> I think all you need is just IDLE, INIT, QUIT.

Yes for l2/l3fwd 3 states are enough.
I implement a full state machine so it can also serve as an example on how to 
do this in other cases, like where stop might be called before or during init.

> Konstantin




[dpdk-dev] [PATCH v3 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in l3fwd

2015-12-31 Thread Wang, Zhihong


> -Original Message-
> From: Ananyev, Konstantin
> Sent: Wednesday, December 30, 2015 7:30 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: stephen at networkplumber.org; Qiu, Michael 
> Subject: RE: [PATCH v3 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in
> l3fwd
> 
> 
> 
> > -Original Message-
> > From: Wang, Zhihong
> > Sent: Wednesday, December 30, 2015 3:15 AM
> > To: Ananyev, Konstantin; dev at dpdk.org
> > Cc: stephen at networkplumber.org; Qiu, Michael
> > Subject: RE: [PATCH v3 3/3] examples/l3fwd: Handle SIGINT and SIGTERM
> > in l3fwd
> >
> > > > +static uint8_t
> > > > +start_ports(void)
> > > > +{
> > > > +   unsigned portid, nb_ports, avail_ports;
> > > > +   int ret;
> > > > +
> > > > +   nb_ports = rte_eth_dev_count();
> > > > +   avail_ports = 0;
> > > > +   for (portid = 0; portid < nb_ports; portid++) {
> > > > +   if ((enabled_port_mask & (1 << portid)) == 0)
> > > > +   continue;
> > > > +   avail_ports++;
> > > > +   port_started = true;
> > >
> > > Why do you need it at each iteration?
> >
> > Only become true when the first enabled port about to started. In case 
> > there's
> no port enabled at all.
> > In my opinion no need to optimize since it's not performance sensitive
> > and the logic is correct :)
> >
> >
> > >
> > > > +   printf("Starting port %d...", portid);
> > > > +   ret = rte_eth_dev_start(portid);
> > > > +   if (ret < 0)
> > > > +   rte_exit(EXIT_FAILURE,
> > > > +   "rte_eth_dev_start: err=%d, 
> > > > port=%d\n",
> > > > +   ret, portid);
> > > > +   /*
> > > > +* If enabled, put device in promiscuous mode.
> > > > +* This allows IO forwarding mode to forward packets
> > > > +* to itself through 2 cross-connected  ports of the
> > > > +* target machine.
> > > > +*/
> > > > +   if (promiscuous_on)
> > > > +   rte_eth_promiscuous_enable(portid);
> > > > +   printf(" Done\n");
> > > > +   }
> > > > +
> > > > +   return avail_ports;
> > > > +}
> >
> > [...]
> >
> > > > +static void
> > > > +signal_handler(int signum)
> > > > +{
> > > > +   if (signum == SIGINT || signum == SIGTERM) {
> > > > +   printf("\nSignal %d received, preparing to exit...\n",
> > > > +   signum);
> > > > +   if (port_started) {
> > > > +   printf("Ports started already...\n");
> > > > +   signo_quit = signum;
> > > > +   force_quit = true;
> > > > +   } else {
> > >
> > >
> > > Hmm, and what if signal_handler() would be executed not in the
> > > context of master lcore?
> > > Then there could be a raise condition, and you could end up here,
> > > while master lcore would be in the middle of
> start_ports()->rte_eth_dev_start().
> >
> > Good point! Then we need rte_atomic16_cmpset() to avoid the race condition.
> >
> >
> > > Probably not a big deal, but why do you need this  if (port_started)
> > > {...} else {...} at all?
> > > Why not just:
> >
> > If no port has been started, then just kill itself.
> > This is for cases like when you just started it and then want to shut
> > it down, it'll wait a long time for initialization (memory, etc.) before the
> force_quit signal take effect.
> 
> Do you mean rte_eal_init()?
> Then why not to install non-default signal handlers after rte_eal_init()?
> Konstantin

Yes that does sounds better :)



> 
> >
> >
> > >
> > > signal_handler(int signum)
> > > {
> > >   signo_quit = signum;
> > >   force_quit = true;
> > > }
> > > ?
> > >
> > > Konstantin
> > >
> > > > +   printf("Ports not started yet...\n");
> > > > +   printf("Bye...\n");
> > > > +   /* exit with the expected status */
> > > > +   signal(signum, SIG_DFL);
> > > > +   kill(getpid(), signum);
> > > > +   }
> > > > +   }
> > > > +}
> > > > +



[dpdk-dev] [PATCH v4 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in l3fwd

2015-12-31 Thread Wang, Zhihong


> -Original Message-
> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> Sent: Thursday, December 31, 2015 10:09 AM
> To: Wang, Zhihong 
> Cc: Ananyev, Konstantin ; dev at dpdk.org; 
> Qiu,
> Michael 
> Subject: Re: [PATCH v4 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in
> l3fwd
> 
> On Thu, 31 Dec 2015 01:44:20 +
> "Wang, Zhihong"  wrote:
> 
> > > > +#define PORT_IDLE 0
> > > > +#define PORT_INIT 1
> > > > +#define PORT_WORK 2
> > > > +#define PORT_STOP 3
> > > > +#define PORT_QUIT 4
> > >
> > > Seems ok, but over-complicated.
> > > I think all you need is just IDLE, INIT, QUIT.
> >
> > Yes for l2/l3fwd 3 states are enough.
> > I implement a full state machine so it can also serve as an example on how 
> > to
> do this in other cases, like where stop might be called before or during init.
> 
> These are examples, it is better to have as little code as necessary to get 
> the job
> done. That makes the example clearer.  Adding extra unnecessary complexity
> just makes it harder to understand.


Thanks for the suggestions!
I'll send the v5 combining your comments and Konstantin's together to make it 
simpler.


Re: [dpdk-dev] [PATCH 0/2] maintainers: updates for Vhost and Virtio

2018-06-12 Thread Wang, Zhihong
Hi Maxime,

> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Tuesday, June 12, 2018 4:01 PM
> To: mtetsu...@gmail.com; Bie, Tiwei ; Wang, Zhihong
> ; dev@dpdk.org
> Cc: tho...@monjalon.net; Yigit, Ferruh ; Maxime
> Coquelin 
> Subject: [PATCH 0/2] maintainers: updates for Vhost and Virtio
> 
> Hi,
> 
> Since Jianfeng & Yuanhan resignation, I was the only active
> maintainer for Vhost lib and PMD, and I had no backup for
> managing the next-virtio tree.
> 
> Contacted offline, Tetsuya has kindly accepted to remove
> himself from the Vhost PMD maintainers as he didn't had time
> to be active recently. Tetsuya asked me to send the patch
> and add his Acked-by. I'd like to thank him for his
> contributions, and wish him the best for his current and
> next adventures!
> 
> I propose Tiwei and Zhihong to officially co-maintain both
> Vhost and Virtio components. They have been very helpful with
> their reviews in the last months, and know very well both the
> code and the spec. Being 3 co-maintainers would ensure better
> reviews while letting us time to develop new features.

Thanks for the proposal! I'm really glad to help co-maintain the
Virtio and Vhost components. ;)

Regards
-Zhihong

> 
> Also, I propose Tiwei to co-manage the next-virtio tree.
> 
> Thanks,
> Maxime
> 
> Maxime Coquelin (2):
>   maintainers: update Vhost PMD maintainership
>   maintainers: add Vhost and Virtio co-maintainers
> 
>  MAINTAINERS | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> --
> 2.14.3



Re: [dpdk-dev] [PATCH 2/2] maintainers: add Vhost and Virtio co-maintainers

2018-06-13 Thread Wang, Zhihong



> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Tuesday, June 12, 2018 4:01 PM
> To: mtetsu...@gmail.com; Bie, Tiwei ; Wang, Zhihong
> ; dev@dpdk.org
> Cc: tho...@monjalon.net; Yigit, Ferruh ; Maxime
> Coquelin 
> Subject: [PATCH 2/2] maintainers: add Vhost and Virtio co-maintainers
> 
> Add Tiwei and Zhihong as co-maintainers for the Vhost and
> Virtio components. They have done great contributions recently,
> and been very helpfull in helping to review Vhost and Virtio
> series.
> 
> Also, add Tiwei as backup for the Next-virtio tree.
> 
> Signed-off-by: Maxime Coquelin 

Thanks for the proposal! I'm really glad to help co-maintain the
Virtio and Vhost components.

Acked-by: Zhihong Wang 

> ---
>  MAINTAINERS | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1c28f6d38..14939f10a 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -41,6 +41,7 @@ T: git://dpdk.org/next/dpdk-next-net-mlx
> 
>  Next-virtio Tree
>  M: Maxime Coquelin 
> +M: Tiwei Bie 
>  T: git://dpdk.org/next/dpdk-next-virtio
> 
>  Next-crypto Tree
> @@ -654,6 +655,8 @@ F: doc/guides/nics/features/vmxnet3.ini
> 
>  Vhost-user
>  M: Maxime Coquelin 
> +M: Tiwei Bie 
> +M: Zhihong Wang 
>  T: git://dpdk.org/next/dpdk-next-virtio
>  F: lib/librte_vhost/
>  F: doc/guides/prog_guide/vhost_lib.rst
> @@ -665,6 +668,8 @@ F: examples/vhost_crypto/
> 
>  Vhost PMD
>  M: Maxime Coquelin 
> +M: Tiwei Bie 
> +M: Zhihong Wang 
>  T: git://dpdk.org/next/dpdk-next-virtio
>  F: drivers/net/vhost/
>  F: doc/guides/nics/vhost.rst
> @@ -673,6 +678,7 @@ F: doc/guides/nics/features/vhost.ini
>  Virtio PMD
>  M: Maxime Coquelin 
>  M: Tiwei Bie 
> +M: Zhihong Wang 
>  T: git://dpdk.org/next/dpdk-next-virtio
>  F: drivers/net/virtio/
>  F: doc/guides/nics/virtio.rst
> --
> 2.14.3



Re: [dpdk-dev] [PATCH] vhost: fix buffer length calculation

2018-07-18 Thread Wang, Zhihong



> -Original Message-
> From: Bie, Tiwei
> Sent: Tuesday, July 17, 2018 9:11 PM
> To: maxime.coque...@redhat.com; Wang, Zhihong
> ; dev@dpdk.org
> Cc: Wang, Yinan ; Yao, Lei A 
> Subject: [PATCH] vhost: fix buffer length calculation
> 
> Fixes: fd68b4739d2c ("vhost: use buffer vectors in dequeue path")
> 
> Reported-by: Yinan Wang 
> Signed-off-by: Tiwei Bie 
> ---
>  lib/librte_vhost/virtio_net.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 2b7ffcf92..07cc0c845 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -720,7 +720,8 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>   uint16_t hdr_vec_idx = 0;
> 
>   while (remain) {
> - len = remain;
> + len = RTE_MIN(remain,
> +
>   buf_vec[hdr_vec_idx].buf_len);
>   dst = buf_vec[hdr_vec_idx].buf_addr;
>   rte_memcpy((void *)(uintptr_t)dst,
>   (void *)(uintptr_t)src,
> @@ -747,7 +748,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>   hdr_addr = 0;
>   }
> 
> - cpy_len = RTE_MIN(buf_len, mbuf_avail);
> + cpy_len = RTE_MIN(buf_avail, mbuf_avail);
> 
>   if (likely(cpy_len > MAX_BATCH_LEN ||
>   vq->batch_copy_nb_elems >= vq-
> >size)) {
> @@ -1112,7 +1113,8 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>* in a contiguous virtual area.
>*/
>   while (remain) {
> - len = remain;
> + len = RTE_MIN(remain,
> + buf_vec[hdr_vec_idx].buf_len);
>   src = buf_vec[hdr_vec_idx].buf_addr;
>   rte_memcpy((void *)(uintptr_t)dst,
>  (void *)(uintptr_t)src, len);
> --
> 2.18.0

Acked-by: Zhihong Wang 

Thanks



Re: [dpdk-dev] [PATCH] doc: add deprecation notice on external memory support

2018-08-01 Thread Wang, Zhihong



> -Original Message-
> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Anatoly Burakov
> Sent: Wednesday, August 1, 2018 8:07 PM
> To: dev@dpdk.org
> Cc: Neil Horman ; Mcnamara, John
> ; Kovacevic, Marko
> ; tho...@monjalon.net; Wiles, Keith
> 
> Subject: [dpdk-dev] [PATCH] doc: add deprecation notice on external
> memory support
> 
> Due to the upcoming external memory support [1], some API and ABI
> changes will be required. In addition, although the changes called
> out in the deprecation notice are not yet present in form of code
> in the published RFC itself, they are based on consensus on the
> mailing list [2] on how to best implement this feature.
> 
> [1] http://patches.dpdk.org/project/dpdk/list/?series=453&state=*
> [2] https://mails.dpdk.org/archives/dev/2018-July/108002.html
> 
> Signed-off-by: Anatoly Burakov 
> ---
>  doc/guides/rel_notes/deprecation.rst | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/deprecation.rst
> b/doc/guides/rel_notes/deprecation.rst
> index 14714fe94..629154711 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -8,6 +8,21 @@ API and ABI deprecation notices are to be posted here.
>  Deprecation Notices
>  ---
> 
> +* eal: certain structures will change in EAL on account of upcoming external
> +  memory support. Aside from internal changes leading to an ABI break, the
> +  following externally visible changes will also be implemented:
> +
> +  - ``rte_memseg_list`` will change to include a boolean flag indicating
> +whether a particular memseg list is externally allocated. This will have
> +implications for any users of memseg-walk-related functions, as they will
> +now have to skip externally allocated segments in most cases if the 
> intent
> +is to only iterate over internal DPDK memory.
> +  - ``socket_id`` parameter across the entire DPDK will gain additional
> meaning,
> +as some socket ID's will now be representing externally allocated memory.
> No
> +changes will be required for existing code as backwards compatibility 
> will
> +be kept, and those who do not use this feature will not see these extra
> +socket ID's.
> +
>  * eal: both declaring and identifying devices will be streamlined in v18.08.
>New functions will appear to query a specific port from buses, classes of
>device and device drivers. Device declaration will be made coherent with
> the
> --
> 2.17.1

Acked-by: Wang, Zhihong 

Thanks


Re: [dpdk-dev] [PATCH] vhost: fix vDPA set features

2018-04-25 Thread Wang, Zhihong


> -Original Message-
> From: Wang, Xiao W
> Sent: Wednesday, April 25, 2018 10:18 AM
> To: maxime.coque...@redhat.com
> Cc: Tan, Jianfeng ; Wang, Zhihong
> ; dev@dpdk.org; Wang, Xiao W
> 
> Subject: [PATCH] vhost: fix vDPA set features
> 
> We should call set_features callback after setting features in virtio_net
> structure, otherwise vDPA driver cannot get the right features.
> 
> Fixes: 07718b4f87aa ("vhost: adapt library for selective datapath")
> 
> Signed-off-by: Xiao Wang 

Acked-by: Zhihong Wang 

Thanks for fixing it!


Re: [dpdk-dev] KNI performance is not what is claimed

2018-09-21 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Jay Rolette
> Sent: Friday, September 21, 2018 5:39 AM
> To: Stephen Hemminger 
> Cc: DPDK 
> Subject: Re: [dpdk-dev] KNI performance is not what is claimed
> 
> On Thu, Sep 20, 2018 at 3:16 PM Stephen Hemminger <
> step...@networkplumber.org> wrote:
> 
> > On Thu, 20 Sep 2018 15:02:53 -0500
> > Jay Rolette  wrote:
> >
> > > On Thu, Sep 20, 2018 at 1:11 PM Stephen Hemminger <
> > > step...@networkplumber.org> wrote:
> > >
> > > > I wonder if KNI is claiming performance that was never measured on
> > current
> > > > CPU, OS, DPDK.
> > > >
> > > > With single stream and TCP testing on IXGBE (DPDK), I see lowest
> > > > performance with KNI.
> > > >
> > > > Rx  Tx
> > > > KNI 3.2 Gbit/sec1.3 Gbit/sec
> > > > TAP 4.9 4.7
> > > > Virtio  5.6 8.6
> > > >
> > > > Perhaps for 18.11 we should change documentation to remove
> language
> > > > claiming
> > > > better performance with KNI, and then plan for future deprecation?
> > > >
> > >
> > > Do TAP and Virtio provide equivalent function to KNI? I can't speak for
> > any
> > > other products, but ours is dependent on KNI. The ability for control
> > plane
> > > applications to use normal Linux sockets with DPDK is key even if it
> > isn't
> > > performant.
> > >
> > > Hopefully the answer is "yes", in which case I'll happily port over to
> > > using one of the faster mechanisms.
> > >
> > > Thanks,
> > > Jay
> >
> > See:
> >
> > https://doc.dpdk.org/guides-
> 17.11/howto/virtio_user_as_exceptional_path.html
> 
> 
> Thanks. Looks like it's time to run some experiments again.

To do the test with the latest DPDK 17.11 LTS, you'll need the below
one-line fix (which was missed during backporting) to enable the
offloads:
https://git.dpdk.org/dpdk/commit/?id=bce7e9050f9b

You can also refer to this paper for more details:
https://dl.acm.org/citation.cfm?id=3098586

-Zhihong

> 
> Jay


Re: [dpdk-dev] [PATCH v2 0/2] Vhost: unitfy receive paths

2018-05-31 Thread Wang, Zhihong



> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Tuesday, May 29, 2018 5:45 PM
> To: dev@dpdk.org; Bie, Tiwei ; Wang, Zhihong
> 
> Cc: Maxime Coquelin 
> Subject: [PATCH v2 0/2] Vhost: unitfy receive paths
> 
> Hi,
> 
> This second version fixes the feature bit check in
> rxvq_is_mergeable(), and remove "mergeable" from rx funcs
> names. No difference is seen in the benchmarks
> 
> This series is preliminary work to ease the integration of
> packed ring layout support. But even without packed ring
> layout, the result is positive.
> 
> First patch unify both paths, and second one is a small
> optimization to avoid copying batch_copy_nb_elems VQ field
> to/from the stack.
> 
> With the series applied, I get modest performance gain for
> both mergeable and non-mergeable casesi (, and the gain of
> about 300 LoC is non negligible maintenance-wise.
> 
> Rx-mrg=off benchmarks:
> 
> ++---+-+-+--+
> |Run |  PVP  | Guest->Host | Host->Guest | Loopback |
> ++---+-+-+--+
> | v18.05-rc5 | 14.47 |   16.64 |   17.57 |13.15 |
> | + series   | 14.87 |   16.86 |   17.70 |13.30 |
> ++---+-+-+--+
> 
> Rx-mrg=on benchmarks:
> 
> ++--+-+-+--+
> |Run | PVP  | Guest->Host | Host->Guest | Loopback |
> ++--+-+-+--+
> | v18.05-rc5 | 9.38 |   13.78 |   16.70 |12.79 |
> | + series   | 9.38 |   13.80 |   17.49 |13.36 |
> ++--+-+-+--+
> 
> Note: Even without my series, the guest->host benchmark with
> mergeable buffers enabled looks suspicious as it should in
> theory be alsmost identical as when Rx mergeable buffers are
> disabled. To be investigated...
> 
> Maxime Coquelin (2):
>   vhost: unify Rx mergeable and non-mergeable paths
>   vhost: improve batched copies performance
> 
>  lib/librte_vhost/virtio_net.c | 376 
> +-
>  1 file changed, 37 insertions(+), 339 deletions(-)
> 

Acked-by: Zhihong Wang 

Thanks Maxime! This is really great to see. ;) We probably need the
same improvement for Virtio-pmd.

One comment on Virtio/Vhost performance analysis: No matter what type
of traffic is used (PVP, or Txonly-Rxonly, Loopback...), we need to
be clear on who we're testing, and give the other part excessive CPU
resources, otherwise we'll be testing whoever the slowest.

Since this patch is for Vhost, I suggest to run N (e.g. N = 4) Virtio
threads on N cores, and the corresponding N Vhost threads on a single
core, to do performance comparison. Do you think this makes sense?

For Guest -> Host, in my test I see Rx-mrg=on has negative impact on
Virtio side, probably because Virtio touches something that's not
touched when Rx-mrg=off.

Thanks
-Zhihong


[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-20 Thread Wang, Zhihong


> -Original Message-
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Monday, January 19, 2015 9:02 PM
> To: Wang, Zhihong
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang at intel.com wrote:
> > This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> > It also extends memcpy test coverage with unaligned cases and more test
> points.
> >
> > Optimization techniques are summarized below:
> >
> > 1. Utilize full cache bandwidth
> >
> > 2. Enforce aligned stores
> >
> > 3. Apply load address alignment based on architecture features
> >
> > 4. Make load/store address available as early as possible
> >
> > 5. General optimization techniques like inlining, branch reducing,
> > prefetch pattern access
> >
> > Zhihong Wang (4):
> >   Disabled VTA for memcpy test in app/test/Makefile
> >   Removed unnecessary test cases in test_memcpy.c
> >   Extended test coverage in test_memcpy_perf.c
> >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> > platforms
> >
> >  app/test/Makefile  |   6 +
> >  app/test/test_memcpy.c |  52 +-
> >  app/test/test_memcpy_perf.c| 238 +---
> >  .../common/include/arch/x86/rte_memcpy.h   | 664
> +++--
> >  4 files changed, 656 insertions(+), 304 deletions(-)
> >
> > --
> > 1.9.3
> >
> >
> Are you able to compile this with gcc 4.9.2?  The compilation of
> test_memcpy_perf is taking forever for me.  It appears hung.
> Neil


Neil,

Thanks for reporting this!
It should compile but will take quite some time if the CPU doesn't support 
AVX2, the reason is that:
1. The SSE & AVX memcpy implementation is more complicated than AVX2 version 
thus the compiler takes more time to compile and optimize
2. The new test_memcpy_perf.c contains 126 constants memcpy calls for better 
test case coverage, that's quite a lot

I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
1. The whole compile process takes 9'41" with the original test_memcpy_perf.c 
(63 + 63 = 126 constant memcpy calls)
2. It takes only 2'41" after I reduce the constant memcpy call number to 12 + 
12 = 24

I'll reduce memcpy call in the next version of patch.

Zhihong (John)


[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms

2015-01-21 Thread Wang, Zhihong


> -Original Message-
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Wednesday, January 21, 2015 3:16 AM
> To: Stephen Hemminger
> Cc: Wang, Zhihong; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in
> arch/x86/rte_memcpy.h for both SSE and AVX platforms
> 
> On Tue, Jan 20, 2015 at 09:15:38AM -0800, Stephen Hemminger wrote:
> > On Mon, 19 Jan 2015 09:53:34 +0800
> > zhihong.wang at intel.com wrote:
> >
> > > Main code changes:
> > >
> > > 1. Differentiate architectural features based on CPU flags
> > >
> > > a. Implement separated move functions for SSE/AVX/AVX2 to make
> > > full utilization of cache bandwidth
> > >
> > > b. Implement separated copy flow specifically optimized for
> > > target architecture
> > >
> > > 2. Rewrite the memcpy function "rte_memcpy"
> > >
> > > a. Add store aligning
> > >
> > > b. Add load aligning based on architectural features
> > >
> > > c. Put block copy loop into inline move functions for better
> > > control of instruction order
> > >
> > > d. Eliminate unnecessary MOVs
> > >
> > > 3. Rewrite the inline move functions
> > >
> > > a. Add move functions for unaligned load cases
> > >
> > > b. Change instruction order in copy loops for better pipeline
> > > utilization
> > >
> > > c. Use intrinsics instead of assembly code
> > >
> > > 4. Remove slow glibc call for constant copies
> > >
> > > Signed-off-by: Zhihong Wang 
> >
> > Dumb question: why not fix glibc memcpy instead?
> > What is special about rte_memcpy?
> >
> >
> Fair point.  Though, does glibc implement optimized memcpys per arch?  Or
> do they just rely on the __builtin's from gcc to get optimized variants?
> 
> Neil

Neil, Stephen,

Glibc has per arch implementation but is for general purpose, while rte_memcpy 
is more for small size & in cache memcpy, which is the DPDK case. This lead to 
different trade-offs and optimization techniques.
Also, glibc's update from version to version is also based on general 
judgments. We can say that glibc 2.18 is for Ivy Bridge and 2.20 is for 
Haswell, though not full accurate. But we need an implementation for both Sandy 
Bridge and Haswell.

For instance, glibc 2.18 has load aligning optimization for unaligned memcpy 
but doesn't support 256-bit mov; while glibc 2.20 add support for 256-bit mov, 
but remove load aligning optimization. This hurts unaligned memcpy performance 
a lot on architectures like Ivy Bridge. Glibc's reason is that the load 
aligning optimization doesn't help when src/dst isn't in cache, which could be 
the general case, but not the DPDK case.

Zhihong (John)


[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-21 Thread Wang, Zhihong


> -Original Message-
> From: Richardson, Bruce
> Sent: Wednesday, January 21, 2015 12:15 AM
> To: Neil Horman
> Cc: Wang, Zhihong; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> > >
> > >
> > > > -Original Message-
> > > > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > Sent: Monday, January 19, 2015 9:02 PM
> > > > To: Wang, Zhihong
> > > > Cc: dev at dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >
> > > > On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang at intel.com
> wrote:
> > > > > This patch set optimizes memcpy for DPDK for both SSE and AVX
> platforms.
> > > > > It also extends memcpy test coverage with unaligned cases and
> > > > > more test
> > > > points.
> > > > >
> > > > > Optimization techniques are summarized below:
> > > > >
> > > > > 1. Utilize full cache bandwidth
> > > > >
> > > > > 2. Enforce aligned stores
> > > > >
> > > > > 3. Apply load address alignment based on architecture features
> > > > >
> > > > > 4. Make load/store address available as early as possible
> > > > >
> > > > > 5. General optimization techniques like inlining, branch
> > > > > reducing, prefetch pattern access
> > > > >
> > > > > Zhihong Wang (4):
> > > > >   Disabled VTA for memcpy test in app/test/Makefile
> > > > >   Removed unnecessary test cases in test_memcpy.c
> > > > >   Extended test coverage in test_memcpy_perf.c
> > > > >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> > > > > platforms
> > > > >
> > > > >  app/test/Makefile  |   6 +
> > > > >  app/test/test_memcpy.c |  52 +-
> > > > >  app/test/test_memcpy_perf.c| 238 +---
> > > > >  .../common/include/arch/x86/rte_memcpy.h   | 664
> > > > +++--
> > > > >  4 files changed, 656 insertions(+), 304 deletions(-)
> > > > >
> > > > > --
> > > > > 1.9.3
> > > > >
> > > > >
> > > > Are you able to compile this with gcc 4.9.2?  The compilation of
> > > > test_memcpy_perf is taking forever for me.  It appears hung.
> > > > Neil
> > >
> > >
> > > Neil,
> > >
> > > Thanks for reporting this!
> > > It should compile but will take quite some time if the CPU doesn't support
> AVX2, the reason is that:
> > > 1. The SSE & AVX memcpy implementation is more complicated than
> AVX2
> > > version thus the compiler takes more time to compile and optimize 2.
> > > The new test_memcpy_perf.c contains 126 constants memcpy calls for
> > > better test case coverage, that's quite a lot
> > >
> > > I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> > > 1. The whole compile process takes 9'41" with the original
> > > test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes
> > > only 2'41" after I reduce the constant memcpy call number to 12 + 12
> > > = 24
> > >
> > > I'll reduce memcpy call in the next version of patch.
> > >
> > ok, thank you.  I'm all for optimzation, but I think a compile that
> > takes almost
> > 10 minutes for a single file is going to generate some raised eyebrows
> > when end users start tinkering with it
> >
> > Neil
> >
> > > Zhihong (John)
> > >
> Even two minutes is a very long time to compile, IMHO. The whole of DPDK
> doesn't take that long to compile right now, and that's with a couple of huge
> header files with routing tables in it. Any chance you could cut compile time
> down to a few seconds while still having reasonable tests?
> Also, when there is AVX2 present on the system, what is the compile time
> like for that code?
> 
>   /Bruce

Neil, Bruce,

Some data first.

Sandy Bridge without AVX2:
1. original w/ 10 constant memcpy: 2'25" 
2. patch w/ 12 constant memcpy: 2'41" 
3. patch w/ 63 constant memcpy: 9'41" 

Haswell with AVX2:
1. original w/ 10 constant memcpy: 1'57" 
2. patch w/ 12 constant memcpy: 1'56" 
3. patch w/ 63 constant memcpy: 3'16" 

Also, to address Bruce's question, we have to reduce test case to cut down 
compile time. Because we use:
1. intrinsics instead of assembly for better flexibility and can utilize more 
compiler optimization 
2. complex function body for better performance 
3. inlining 
This increases compile time.
But I think it'd be okay to do that as long as we can select a fair set of test 
points.

It'd be great if you could give some suggestion, say, 12 points.

Zhihong (John)






[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-23 Thread Wang, Zhihong


> -Original Message-
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Wednesday, January 21, 2015 8:38 PM
> To: Ananyev, Konstantin
> Cc: Wang, Zhihong; Richardson, Bruce; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> On Wed, Jan 21, 2015 at 12:02:57PM +, Ananyev, Konstantin wrote:
> >
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang, Zhihong
> > > Sent: Wednesday, January 21, 2015 3:44 AM
> > > To: Richardson, Bruce; Neil Horman
> > > Cc: dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > >
> > >
> > > > -Original Message-----
> > > > From: Richardson, Bruce
> > > > Sent: Wednesday, January 21, 2015 12:15 AM
> > > > To: Neil Horman
> > > > Cc: Wang, Zhihong; dev at dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >
> > > > On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > > On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> > > > > >
> > > > > >
> > > > > > > -Original Message-
> > > > > > > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > > > > Sent: Monday, January 19, 2015 9:02 PM
> > > > > > > To: Wang, Zhihong
> > > > > > > Cc: dev at dpdk.org
> > > > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > > > >
> > > > > > > On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > > > > zhihong.wang at intel.com
> > > > wrote:
> > > > > > > > This patch set optimizes memcpy for DPDK for both SSE and
> > > > > > > > AVX
> > > > platforms.
> > > > > > > > It also extends memcpy test coverage with unaligned cases
> > > > > > > > and more test
> > > > > > > points.
> > > > > > > >
> > > > > > > > Optimization techniques are summarized below:
> > > > > > > >
> > > > > > > > 1. Utilize full cache bandwidth
> > > > > > > >
> > > > > > > > 2. Enforce aligned stores
> > > > > > > >
> > > > > > > > 3. Apply load address alignment based on architecture
> > > > > > > > features
> > > > > > > >
> > > > > > > > 4. Make load/store address available as early as possible
> > > > > > > >
> > > > > > > > 5. General optimization techniques like inlining, branch
> > > > > > > > reducing, prefetch pattern access
> > > > > > > >
> > > > > > > > Zhihong Wang (4):
> > > > > > > >   Disabled VTA for memcpy test in app/test/Makefile
> > > > > > > >   Removed unnecessary test cases in test_memcpy.c
> > > > > > > >   Extended test coverage in test_memcpy_perf.c
> > > > > > > >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> and AVX
> > > > > > > > platforms
> > > > > > > >
> > > > > > > >  app/test/Makefile  |   6 +
> > > > > > > >  app/test/test_memcpy.c |  52 +-
> > > > > > > >  app/test/test_memcpy_perf.c| 238 
> > > > > > > > +---
> > > > > > > >  .../common/include/arch/x86/rte_memcpy.h   | 664
> > > > > > > +++--
> > > > > > > >  4 files changed, 656 insertions(+), 304 deletions(-)
> > > > > > > >
> > > > > > > > --
> > > > > > > > 1.9.3
> > > > > > > >
> > > > > > > >
> > > > > > > Are you able to compile this with gcc 4.9.2?  The
> > > > > > > compilation of test_memcpy_perf is taking forever for me.  It
> appears hung.
> > > > > > > Neil
> > > > > >
> > > > > >
> > > > > > Neil,
> > > > > >
> > > > >

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-23 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> Sent: Wednesday, January 21, 2015 9:26 PM
> To: Marc Sune
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> >
> > On 21/01/15 14:02, Bruce Richardson wrote:
> > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > >>>>-Original Message-
> > >>>>From: Richardson, Bruce
> > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > >>>>To: Neil Horman
> > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >>>>
> > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> > >>>>>>>-Original Message-
> > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > >>>>>>>To: Wang, Zhihong
> > >>>>>>>Cc: dev at dpdk.org
> > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >>>>>>>
> > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > >>>>>>>zhihong.wang at intel.com
> > >>>>wrote:
> > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and AVX
> > >>>>platforms.
> > >>>>>>>>It also extends memcpy test coverage with unaligned cases and
> > >>>>>>>>more test
> > >>>>>>>points.
> > >>>>>>>>Optimization techniques are summarized below:
> > >>>>>>>>
> > >>>>>>>>1. Utilize full cache bandwidth
> > >>>>>>>>
> > >>>>>>>>2. Enforce aligned stores
> > >>>>>>>>
> > >>>>>>>>3. Apply load address alignment based on architecture features
> > >>>>>>>>
> > >>>>>>>>4. Make load/store address available as early as possible
> > >>>>>>>>
> > >>>>>>>>5. General optimization techniques like inlining, branch
> > >>>>>>>>reducing, prefetch pattern access
> > >>>>>>>>
> > >>>>>>>>Zhihong Wang (4):
> > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> and AVX
> > >>>>>>>> platforms
> > >>>>>>>>
> > >>>>>>>>  app/test/Makefile  |   6 +
> > >>>>>>>>  app/test/test_memcpy.c |  52 +-
> > >>>>>>>>  app/test/test_memcpy_perf.c| 238 +---
> > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h   | 664
> > >>>>>>>+++--
> > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> > >>>>>>>>
> > >>>>>>>>--
> > >>>>>>>>1.9.3
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>Are you able to compile this with gcc 4.9.2?  The compilation
> > >>>>>>>of test_memcpy_perf is taking forever for me.  It appears hung.
> > >>>>>>>Neil
> > >>>>>>Neil,
> > >>>>>>
> > >>>>>>Thanks for reporting this!
> > >>>>>>It should compile but will take quite some time if the CPU
> > >>>>>>doesn't support
> > >>>>AVX2, the reason is that:
> > >>>>>>1. The SSE & AVX memcpy implementation is more comp

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-26 Thread Wang, Zhihong
Hi Luke,

I?m very glad that you?re interested in this work. ?

I never published any performance data, and haven?t run cachebench.
We use test_memcpy_perf.c in DPDK to do the test mainly, because it?s the 
environment that DPDK runs. You can also find the performance comparison there 
with glibc.
It can be launched in /app/test: memcpy_perf_autotest.

Finally, inline can bring benefits based on practice, constant value unrolling 
for example, and for DPDK we need all possible optimization.


Thanks
John


From: lukego at gmail.com [mailto:luk...@gmail.com] On Behalf Of Luke Gorrie
Sent: Sunday, January 25, 2015 10:50 PM
To: Wang, Zhihong
Cc: dev at dpdk.org; snabb-devel at googlegroups.com
Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

Hi John,

On 19 January 2015 at 02:53, mailto:zhihong.wang at 
intel.com>> wrote:
This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
It also extends memcpy test coverage with unaligned cases and more test points.

I am really interested in this work you are doing on memory copies optimized 
for packet data. I would like to understand it in more depth. I have a lot of 
questions and ideas but let me try to keep it simple for now :-)

How do you benchmark? where does the "factor of 2-8" cited elsewhere in the 
thread come from? how can I reproduce? what results are you seeing compared 
with libc?

I did a quick benchmark this weekend based on 
cachebench<http://icl.cs.utk.edu/projects/llcbench/cachebench.html>. This seems 
like a fairly weak benchmark (always L1 cache, always same alignment, always 
predictable branches). Do you think this is relevant? How does this compare 
with your results?

I compared:
  rte_memcpy (the new optimized one compiled with gcc-4.9 and -march=native and 
-O3)
  memcpy from glibc 2.19 (ubuntu 14.04)
  memcpy from glibc 2.20 (arch linux)

on hardware:
  E5-2620v3 (Haswell)
  E5-2650 (Sandy Bridge)

running cachebench like this:

./cachebench -p -e1 -x1 -m14

rte_memcpy.h on Haswell:

Memory Copy Library Cache Test

C Size  Nanosec MB/sec  % Chnge
--- --- --- ---
256 0.0189191.881.00
384 0.0196505.430.92
512 0.0196509.191.00
768 0.0191475.721.06
10240.0196293.820.95
15360.0196521.661.00
20480.0196522.871.00
30720.0196525.531.00
40960.0196522.791.00
61440.0196507.711.00
81920.0194584.411.02
12288   0.0195062.800.99
16384   0.0180493.461.18

libc 2.20 on Haswell:

Memory Copy Library Cache Test

C Size  Nanosec MB/sec  % Chnge
--- --- --- ---
256 0.0165978.641.00
384 0.01100249.01   0.66
512 0.01123476.55   0.81
768 0.01144699.86   0.85
10240.01159459.88   0.91
15360.01168001.92   0.95
20480.0180738.312.08
30720.0180270.021.01
40960.0184239.840.95
61440.0190600.130.93
81920.0189767.941.01
12288   0.0192085.980.97
16384   0.0192719.950.99

libc 2.19 on Haswell:

Memory Copy Library Cache Test

C Size  Nanosec MB/sec  % Chnge
--- --- --- ---
256 0.0259871.691.00
384 0.0168545.940.87
512 0.0172674.230.94
768 0.0179257.470.92
10240.0179740.430.99
15360.0185483.670.93
20480.0187703.680.97
30720.0186685.711.01
40960.0187147.840.99
61440.0168622.961.27
81920.0170591.250.97
12288   0.0172621.280.97
16384   0.0167713.631.07

rte_memcpy on Sandy Bridge:

 Memory Copy Library Cache Test

C Size Nanosec   MB/sec% Chnge
------   ---   ---
256 0.0262158.191.00
384 0.0173256.410.85
512 0.0182032.160.89
768 0.01739

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-27 Thread Wang, Zhihong


> -Original Message-
> From: Ananyev, Konstantin
> Sent: Tuesday, January 27, 2015 2:29 AM
> To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> Cc: dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> Hi Zhihong,
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang, Zhihong
> > Sent: Friday, January 23, 2015 6:52 AM
> > To: Richardson, Bruce; Marc Sune
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> >
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > > Richardson
> > > Sent: Wednesday, January 21, 2015 9:26 PM
> > > To: Marc Sune
> > > Cc: dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > > >
> > > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > > >>>>-Original Message-
> > > > >>>>From: Richardson, Bruce
> > > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > > >>>>To: Neil Horman
> > > > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> > > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > >>>>
> > > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> > > > >>>>>>>-Original Message-
> > > > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > > >>>>>>>To: Wang, Zhihong
> > > > >>>>>>>Cc: dev at dpdk.org
> > > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > >>>>>>>optimization
> > > > >>>>>>>
> > > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > >>>>>>>zhihong.wang at intel.com
> > > > >>>>wrote:
> > > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and
> > > > >>>>>>>>AVX
> > > > >>>>platforms.
> > > > >>>>>>>>It also extends memcpy test coverage with unaligned cases
> > > > >>>>>>>>and more test
> > > > >>>>>>>points.
> > > > >>>>>>>>Optimization techniques are summarized below:
> > > > >>>>>>>>
> > > > >>>>>>>>1. Utilize full cache bandwidth
> > > > >>>>>>>>
> > > > >>>>>>>>2. Enforce aligned stores
> > > > >>>>>>>>
> > > > >>>>>>>>3. Apply load address alignment based on architecture
> > > > >>>>>>>>features
> > > > >>>>>>>>
> > > > >>>>>>>>4. Make load/store address available as early as possible
> > > > >>>>>>>>
> > > > >>>>>>>>5. General optimization techniques like inlining, branch
> > > > >>>>>>>>reducing, prefetch pattern access
> > > > >>>>>>>>
> > > > >>>>>>>>Zhihong Wang (4):
> > > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> > > and AVX
> > > > >>>>>>>> platforms
> > > > >>>>>>>>
> > > > >>>>>>>>  app/test/Makefile  |   6 +

[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms

2015-01-27 Thread Wang, Zhihong


> -Original Message-
> From: Wodkowski, PawelX
> Sent: Monday, January 26, 2015 10:43 PM
> To: Wang, Zhihong; dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in
> arch/x86/rte_memcpy.h for both SSE and AVX platforms
> 
> Hi,
> 
> I must say: greate work.
> 
> I have some small comments:
> 
> > +/**
> > + * Macro for copying unaligned block from one location to another,
> > + * 47 bytes leftover maximum,
> > + * locations should not overlap.
> > + * Requirements:
> > + * - Store is aligned
> > + * - Load offset is , which must be immediate value within [1, 15]
> > + * - For , make sure  bit backwards & <16 - offset> bit
> forwards
> > are available for loading
> > + * - , ,  must be variables
> > + * - __m128i  ~  must be pre-defined
> > + */
> > +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)
> > \
> > +{  
> >  \
> ...
> > +}
> 
> Why not do { ... } while(0) or ({ ... }) ? This could have unpredictable side
> effects.
> 
> Second:
> Why you completely substitute
> #define rte_memcpy(dst, src, n)  \
>   ({ (__builtin_constant_p(n)) ?   \
>   memcpy((dst), (src), (n)) :  \
>   rte_memcpy_func((dst), (src), (n)); })
> 
> with inline rte_memcpy()? This construction  can help compiler to deduce
> which version to use (static?) inline implementation or call external
> function.
> 
> Did you try 'extern inline' type? It could help reducing compilation time.

Hi Pawel,

Good call on "MOVEUNALIGNED_LEFT47". Thanks!

I removed the conditional __builtin_constant_p(n) because it calls glibc memcpy 
when the parameter is constant, while rte_memcpy has better performance there.
Current long compile time is caused by too many function calls, I'll fix that 
in the next version.

Zhihong (John)



[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-27 Thread Wang, Zhihong
Hey Luke,

Thanks for the excellent questions!

The following script will launch the memcpy test in DPDK:
echo -e 'memcpy_autotest\nmemcpy_perf_autotest\nquit\n' | 
./x86_64-native-linuxapp-gcc/app/test -c 4 -n 4 -- -i

Thanks for sharing the object code, I think it?s the Sandy Bridge version 
though.
The rte_memcpy for Haswell is quite simple too, this is a decision based on 
arch difference: Haswell has significant improvements in memory hierarchy.
The Sandy Bridge unaligned memcpy is large in size but it has better 
performance because converting unaligned loads into aligned ones is crucial for 
in cache memcpy on Sandy Bridge.

The rep instruction is still not fast enough yet, but I can?t say much about it 
since I haven?t investigated it thoroughly.

To my understanding memcpy optimization is all about trade-offs according to 
use cases and this one is for DPDK scenario (Small size, in cache: you may find 
quite a few with only 6 bytes or so), you can refer to the rfc for this patch.
It?s not likely that one could make one that?re optimal for all scenarios.

But I agree with the author of glibc memcpy on this: A program with too many 
memcpys is a program with design flaw.


Thanks
Zhihong (John)

From: lukego at gmail.com [mailto:luk...@gmail.com] On Behalf Of Luke Gorrie
Sent: Monday, January 26, 2015 4:03 PM
To: Wang, Zhihong
Cc: dev at dpdk.org; snabb-devel at googlegroups.com
Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

On 26 January 2015 at 02:30, Wang, Zhihong mailto:zhihong.wang at intel.com>> wrote:
Hi Luke,

I?m very glad that you?re interested in this work. ?

Great :).

 I never published any performance data, and haven?t run cachebench.
We use test_memcpy_perf.c in DPDK to do the test mainly, because it?s the 
environment that DPDK runs. You can also find the performance comparison there 
with glibc.
It can be launched in /app/test: memcpy_perf_autotest.

Could you give me a command-line example to run this please? (Sorry if this 
should be obvious.)

 Finally, inline can bring benefits based on practice, constant value unrolling 
for example, and for DPDK we need all possible optimization.

Do we need to think about code size and potential instruction cache thrashing?

For me one call to rte_memcpy compiles to 3520 
instructions<https://gist.github.com/lukego/8b17a07246d999331b04> in 20KB of 
object code. That's more than half the size of the Haswell instruction cache 
(32KB) per call.

glibc 2.20's 
memcpy_avx_unaligned<https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S;h=9f033f54568c3e5b6d9de9b3ba75f5be41070b92;hb=HEAD>
 is only 909 bytes shared/total and also seems to have basically excellent 
performance on Haswell.

So I am concerned about the code size of rte_memcpy, especially when inlined, 
and meta-concerned about the nonlinear impact of nested inlined functions on 
both compile time and object code size.


There is another issue that I am concerned about:

The Intel Optimization Guide suggests that rep movs is very efficient starting 
in Ivy Bridge. In practice though it seems to be much slower than using vector 
instructions, even though it is faster than it used to be in Sandy Bridge. Is 
that true?

This could have a substantial impact on off-the-shelf memcpy. glibc 2.20's 
memcpy uses movs for sizes >= 2048 and that is where performance takes a dive 
for me (in microbenchmarks). GCC will also emit inline string move instructions 
for certain constant-size memcpy calls at certain optimization levels.


So I feel like I haven't yet found the right memcpy for me. and we haven't even 
started to look at the interesting parts like cache-coherence behaviour when 
sharing data between cores (vhost) and whether streaming load/store can be used 
to defend the state of cache lines between cores.


Do I make any sense? What do I miss?


Cheers,
-Luke




[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-27 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of EDMISON, Kelvin
> (Kelvin)
> Sent: Friday, January 23, 2015 2:22 AM
> To: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> On 2015-01-21, 3:54 PM, "Neil Horman"  wrote:
> 
> >On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
> >> On Wed, 21 Jan 2015 13:26:20 +
> >> Bruce Richardson  wrote:
> >>
> >> > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> >> > >
> >> > > On 21/01/15 14:02, Bruce Richardson wrote:
> >> > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> >> > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> >> > > >>>>-Original Message-
> >> > > >>>>From: Richardson, Bruce
> >> > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> >> > > >>>>To: Neil Horman
> >> > > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> >> > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >> > > >>>>
> >> > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> >> > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong
> wrote:
> >> > > >>>>>>>-Original Message-
> >> > > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> >> > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> >> > > >>>>>>>To: Wang, Zhihong
> >> > > >>>>>>>Cc: dev at dpdk.org
> >> > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> optimization
> >> > > >>>>>>>
> >> > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> >>zhihong.wang at intel.com
> >> > > >>>>wrote:
> >> > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and
> >>AVX
> >> > > >>>>platforms.
> >> > > >>>>>>>>It also extends memcpy test coverage with unaligned cases
> >>and
> >> > > >>>>>>>>more test
> >> > > >>>>>>>points.
> >> > > >>>>>>>>Optimization techniques are summarized below:
> >> > > >>>>>>>>
> >> > > >>>>>>>>1. Utilize full cache bandwidth
> >> > > >>>>>>>>
> >> > > >>>>>>>>2. Enforce aligned stores
> >> > > >>>>>>>>
> >> > > >>>>>>>>3. Apply load address alignment based on architecture
> >>features
> >> > > >>>>>>>>
> >> > > >>>>>>>>4. Make load/store address available as early as possible
> >> > > >>>>>>>>
> >> > > >>>>>>>>5. General optimization techniques like inlining, branch
> >> > > >>>>>>>>reducing, prefetch pattern access
> >> > > >>>>>>>>
> >> > > >>>>>>>>Zhihong Wang (4):
> >> > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> >> > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> >> > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> >> > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both
> SSE
> >>and AVX
> >> > > >>>>>>>> platforms
> >> > > >>>>>>>>
> >> > > >>>>>>>>  app/test/Makefile  |   6 +
> >> > > >>>>>>>>  app/test/test_memcpy.c |  52
> >>+-
> >> > > >>>>>>>>  app/test/test_memcpy_perf.c| 238
> >>+---
> >> > > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h   | 664
> >&

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-28 Thread Wang, Zhihong


> -Original Message-
> From: Ananyev, Konstantin
> Sent: Tuesday, January 27, 2015 8:20 PM
> To: Wang, Zhihong; Richardson, Bruce; 'Marc Sune'
> Cc: 'dev at dpdk.org'
> Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> > -Original Message-
> > From: Ananyev, Konstantin
> > Sent: Tuesday, January 27, 2015 11:30 AM
> > To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> > Cc: dev at dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> >
> >
> > > -Original Message-
> > > From: Wang, Zhihong
> > > Sent: Tuesday, January 27, 2015 1:42 AM
> > > To: Ananyev, Konstantin; Richardson, Bruce; Marc Sune
> > > Cc: dev at dpdk.org
> > > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: Ananyev, Konstantin
> > > > Sent: Tuesday, January 27, 2015 2:29 AM
> > > > To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> > > > Cc: dev at dpdk.org
> > > > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >
> > > > Hi Zhihong,
> > > >
> > > > > -Original Message-
> > > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang,
> > > > > Zhihong
> > > > > Sent: Friday, January 23, 2015 6:52 AM
> > > > > To: Richardson, Bruce; Marc Sune
> > > > > Cc: dev at dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > >
> > > > >
> > > > >
> > > > > > -Original Message-
> > > > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > > > > > Richardson
> > > > > > Sent: Wednesday, January 21, 2015 9:26 PM
> > > > > > To: Marc Sune
> > > > > > Cc: dev at dpdk.org
> > > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > > >
> > > > > > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > > > > > >
> > > > > > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > > > > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > > > > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > > > > > >>>>-Original Message-
> > > > > > > >>>>From: Richardson, Bruce
> > > > > > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > > > > > >>>>To: Neil Horman
> > > > > > > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> > > > > > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > > > > >>>>optimization
> > > > > > > >>>>
> > > > > > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > > > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong
> wrote:
> > > > > > > >>>>>>>-Original Message-
> > > > > > > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > > > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > > > > > >>>>>>>To: Wang, Zhihong
> > > > > > > >>>>>>>Cc: dev at dpdk.org
> > > > > > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > > > > >>>>>>>optimization
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > > > > >>>>>>>zhihong.wang at intel.com
> > > > > > > >>>>wrote:
> > > > > > > >>>>>>>>This patch set optimizes memcpy for DPDK for both
> > > > > > > >>>>>>>>SSE and AVX
> > > > > > > >>>>platforms.
> > > > > > > >>>>>>>>It also extends memcpy test coverage with unaligned
> > > > > > > >&

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-29 Thread Wang, Zhihong


> -Original Message-
> From: EDMISON, Kelvin (Kelvin) [mailto:kelvin.edmison at alcatel-lucent.com]
> Sent: Thursday, January 29, 2015 5:48 AM
> To: Wang, Zhihong; Stephen Hemminger; Neil Horman
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> On 2015-01-27, 3:22 AM, "Wang, Zhihong"  wrote:
> 
> >
> >
> >> -Original Message-
> >> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of EDMISON,
> Kelvin
> >> (Kelvin)
> >> Sent: Friday, January 23, 2015 2:22 AM
> >> To: dev at dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >>
> >>
> >>
> >> On 2015-01-21, 3:54 PM, "Neil Horman" 
> wrote:
> >>
> >> >On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
> >> >> On Wed, 21 Jan 2015 13:26:20 + Bruce Richardson
> >> >>  wrote:
> >> >>
> [..trim...]
> >> >> One issue I have is that as a vendor we need to ship on binary,
> >> >>not different distributions  for each Intel chip variant. There is
> >> >>some support for multi-chip version functions  but only in latest
> >> >>Gcc which isn't in Debian stable. And the
> >>multi-chip
> >> >>version
> >> >> of functions is going to be more expensive than inlining. For some
> >> >>cases, I have  seen that the overhead of fancy instructions looks
> >> >>good but have
> >>nasty
> >> >>side effects
> >> >> like CPU stall and/or increased power consumption which turns of
> >>turbo
> >> >>boost.
> >> >>
> >> >>
> >> >> Distro's in general have the same problem with special case
> >> >>optimizations.
> >> >>
> >> >What we really need is to do something like borrow the alternatives
> >> >mechanism from the kernel so that we can dynamically replace
> >> >instructions at run time based on cpu flags.  That way we could make
> >> >the choice at run time, and wouldn't have to do alot of special case
> >> >jumping about.
> >> >Neil
> >>
> >> +1.
> >>
> >> I think it should be an anti-requirement that the build machine be
> >> the exact same chip as the deployment platform.
> >>
> >> I like the cpu flag inspection approach.  It would help in the case
> >>where  DPDK is in a VM and an odd set of CPU flags have been exposed.
> >>
> >> If that approach doesn't work though, then perhaps DPDK memcpy could
> >>go  through a benchmarking at app startup time and select the most
> >>performant  option out of a set, like mdraid's raid6 implementation
> >>does.  To give an  example, this is what my systems print out at boot
> >>time re: raid6  algorithm selection.
> >> raid6: sse2x13171 MB/s
> >> raid6: sse2x23925 MB/s
> >> raid6: sse2x44523 MB/s
> >> raid6: using algorithm sse2x4 (4523 MB/s)
> >>
> >> Regards,
> >>Kelvin
> >>
> >
> >Thanks for the proposal!
> >
> >For DPDK, performance is always the most important concern. We need to
> >utilize new architecture features to achieve that, so solution per arch
> >is necessary.
> >Even a few extra cycles can lead to bad performance if they're in a hot
> >loop.
> >For instance, let's assume DPDK takes 60 cycles to process a packet on
> >average, then 3 more cycles here means 5% performance drop.
> >
> >The dynamic solution is doable but with performance penalties, even if
> >it could be small. Also it may bring extra complexity, which can lead
> >to unpredictable behaviors and side effects.
> >For example, the dynamic solution won't have inline unrolling, which
> >can bring significant performance benefit for small copies with
> >constant length, like eth_addr.
> >
> >We can investigate the VM scenario more.
> >
> >Zhihong (John)
> 
> John,
> 
>   Thanks for taking the time to answer my newbie question. I deeply
> appreciate the attention paid to performance in DPDK. I have a follow-up
> though.
> 
> I'm trying to figure out what requirements this approach creates for the
> software build environment.  If we want to build optimized versions for
> Haswell, Ivy Bridge, Sandy Bridge, etc, does this mean that we must have one
> of each micro-architecture available for running the builds, or is there a way
> of cross-compiling for all micro-architectures from just one build
> environment?
> 
> Thanks,
>   Kelvin
> 

I'm not an expert in this, just some facts based on my test: The compile 
process depends on the compiler and the lib version.
So even on a machine that doesn't support the necessary ISA, it still should 
compile as long as gcc & glibc & etc have the support, only you'll get "Illegal 
instruction" trying launching the compiled binary.

Therefore if there's a way (worst case scenario: change flags manually) to make 
DPDK build process think that it's on a Haswell machine, it will produce 
Haswell binaries.

Zhihong (John)


[dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms

2015-01-30 Thread Wang, Zhihong
Hey Konstantin,

This method does reduce code size but lead to significant performance drop.
I think we need to keep the original code.


Thanks
Zhihong (John)


> -Original Message-
> From: Ananyev, Konstantin
> Sent: Thursday, January 29, 2015 11:18 PM
> To: Wang, Zhihong; dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in
> arch/x86/rte_memcpy.h for both SSE and AVX platforms
> 
> Hi Zhihong,
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zhihong Wang
> > Sent: Thursday, January 29, 2015 2:39 AM
> > To: dev at dpdk.org
> > Subject: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in
> > arch/x86/rte_memcpy.h for both SSE and AVX platforms
> >
> > Main code changes:
> >
> > 1. Differentiate architectural features based on CPU flags
> >
> > a. Implement separated move functions for SSE/AVX/AVX2 to make
> > full utilization of cache bandwidth
> >
> > b. Implement separated copy flow specifically optimized for target
> > architecture
> >
> > 2. Rewrite the memcpy function "rte_memcpy"
> >
> > a. Add store aligning
> >
> > b. Add load aligning based on architectural features
> >
> > c. Put block copy loop into inline move functions for better
> > control of instruction order
> >
> > d. Eliminate unnecessary MOVs
> >
> > 3. Rewrite the inline move functions
> >
> > a. Add move functions for unaligned load cases
> >
> > b. Change instruction order in copy loops for better pipeline
> > utilization
> >
> > c. Use intrinsics instead of assembly code
> >
> > 4. Remove slow glibc call for constant copies
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> >  .../common/include/arch/x86/rte_memcpy.h   | 680
> +++--
> >  1 file changed, 509 insertions(+), 171 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > index fb9eba8..7b2d382 100644
> > --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > @@ -34,166 +34,189 @@
> >  #ifndef _RTE_MEMCPY_X86_64_H_
> >  #define _RTE_MEMCPY_X86_64_H_
> >
> > +/**
> > + * @file
> > + *
> > + * Functions for SSE/AVX/AVX2 implementation of memcpy().
> > + */
> > +
> > +#include 
> >  #include 
> >  #include 
> > -#include 
> > +#include 
> >
> >  #ifdef __cplusplus
> >  extern "C" {
> >  #endif
> >
> > -#include "generic/rte_memcpy.h"
> > +/**
> > + * Copy bytes from one location to another. The locations must not
> overlap.
> > + *
> > + * @note This is implemented as a macro, so it's address should not
> > +be taken
> > + * and care is needed as parameter expressions may be evaluated
> multiple times.
> > + *
> > + * @param dst
> > + *   Pointer to the destination of the data.
> > + * @param src
> > + *   Pointer to the source data.
> > + * @param n
> > + *   Number of bytes to copy.
> > + * @return
> > + *   Pointer to the destination data.
> > + */
> > +static inline void *
> > +rte_memcpy(void *dst, const void *src, size_t n)
> > +__attribute__((always_inline));
> >
> > -#ifdef __INTEL_COMPILER
> > -#pragma warning(disable:593) /* Stop unused variable warning (reg_a
> > etc). */ -#endif
> > +#ifdef RTE_MACHINE_CPUFLAG_AVX2
> >
> > +/**
> > + * AVX2 implementation below
> > + */
> > +
> > +/**
> > + * Copy 16 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> >  static inline void
> >  rte_mov16(uint8_t *dst, const uint8_t *src)  {
> > -   __m128i reg_a;
> > -   asm volatile (
> > -   "movdqu (%[src]), %[reg_a]\n\t"
> > -   "movdqu %[reg_a], (%[dst])\n\t"
> > -   : [reg_a] "=x" (reg_a)
> > -   : [src] "r" (src),
> > - [dst] "r"(dst)
> > -   : "memory"
> > -   );
> > +   __m128i xmm0;
> > +
> > +   xmm0 = _mm_loadu_si128((const __m128i *)src);
> > +   _mm_storeu_si128((__m128i *)dst, xmm0);
> >  }
> >
> > +/**
> > + * Copy 32 bytes from one location to another,
> > + * locations should not overlap.
> > + *

Re: [dpdk-dev] [PATCH v5 2/5] vhost: support selective datapath

2018-04-03 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Tuesday, April 3, 2018 4:19 PM
> To: Wang, Zhihong ; dev@dpdk.org
> Cc: Tan, Jianfeng ; Bie, Tiwei ;
> y...@fridaylinux.org; Liang, Cunming ; Wang, Xiao
> W ; Daly, Dan 
> Subject: Re: [PATCH v5 2/5] vhost: support selective datapath
> 
> 
> 
> On 04/02/2018 01:46 PM, Zhihong Wang wrote:
> > +int
> > +rte_vdpa_register_device(struct rte_vdpa_dev_addr *addr,
> > +   struct rte_vdpa_dev_ops *ops)
> > +{
> > +   struct rte_vdpa_device *dev;
> > +   char device_name[MAX_VDPA_NAME_LEN];
> > +   int i;
> > +
> > +   if (vdpa_device_num >= MAX_VHOST_DEVICE)
> > +   return -1;
> > +
> > +   for (i = 0; i < MAX_VHOST_DEVICE; i++) {
> > +   if (vdpa_devices[i] && is_same_vdpa_device(addr,
> > +   &vdpa_devices[i]->addr))
> > +   return -1;
> > +   }
> 
> For consistency, I changed above check to look like same one in
> _find_device_id:

That's better. Thanks.

> 
>   for (i = 0; i < MAX_VHOST_DEVICE; i++) {
>   dev = vdpa_devices[i];
>   if (dev && is_same_vdpa_device(&dev->addr, addr))
>   return -1;
>   }
> 
> > +
> > +   for (i = 0; i < MAX_VHOST_DEVICE; i++) {
> > +   if (vdpa_devices[i] == NULL)
> > +   break;
> > +   }
> > +
> > +   sprintf(device_name, "vdpa-dev-%d", i);
> > +   dev = rte_zmalloc(device_name, sizeof(struct rte_vdpa_device),
> > +   RTE_CACHE_LINE_SIZE);
> > +   if (!dev)
> > +   return -1;
> > +
> > +   memcpy(&dev->addr, addr, sizeof(struct rte_vdpa_dev_addr));
> > +   dev->ops = ops;
> > +   vdpa_devices[i] = dev;
> > +   vdpa_device_num++;
> > +
> > +   return i;
> > +}
> > +
> > +int
> > +rte_vdpa_unregister_device(int did)
> > +{
> > +   if (did < 0 || did >= MAX_VHOST_DEVICE || vdpa_devices[did] == NULL)
> > +   return -1;
> > +
> > +   rte_free(vdpa_devices[did]);
> > +   vdpa_devices[did] = NULL;
> > +   vdpa_device_num--;
> > +
> > +   return did;
> > +}
> > +
> > +int
> > +rte_vdpa_find_device_id(struct rte_vdpa_dev_addr *addr)
> > +{
> > +   struct rte_vdpa_device *dev;
> > +   int i;
> > +
> > +   for (i = 0; i < MAX_VHOST_DEVICE; ++i) {
> > +   dev = vdpa_devices[i];
> > +   if (dev && is_same_vdpa_device(&dev->addr, addr) == 0)
> > +   return i;
> > +   }
> > +
> > +   return -1;
> > +}
> > +


Re: [dpdk-dev] [PATCH v2 6/6] vhost: export new apis

2018-03-15 Thread Wang, Zhihong


> -Original Message-
> From: Tan, Jianfeng
> Sent: Tuesday, March 6, 2018 5:52 PM
> To: Wang, Zhihong ; dev@dpdk.org
> Cc: Bie, Tiwei ; maxime.coque...@redhat.com;
> y...@fridaylinux.org; Liang, Cunming ; Wang, Xiao
> W ; Daly, Dan 
> Subject: RE: [PATCH v2 6/6] vhost: export new apis
> 
> 
> 
> > -----Original Message-
> > From: Wang, Zhihong
> > Sent: Tuesday, February 13, 2018 5:21 PM
> > To: dev@dpdk.org
> > Cc: Tan, Jianfeng; Bie, Tiwei; maxime.coque...@redhat.com;
> > y...@fridaylinux.org; Liang, Cunming; Wang, Xiao W; Daly, Dan; Wang,
> > Zhihong
> > Subject: [PATCH v2 6/6] vhost: export new apis
> >
> > This patch exports new APIs as experimental.
> 
> How about squeezing this patch with patch 2 where the APIs are introduced,
> as well as the related doc update?

Ok, will do that.

Thanks
-Zhihong

> 
> Thanks,
> Jianfeng
> 
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> >  lib/librte_vhost/rte_vdpa.h| 16 +++-
> >  lib/librte_vhost/rte_vhost.h   | 33 ++-
> --
> >  lib/librte_vhost/rte_vhost_version.map | 19 +++
> >  3 files changed, 52 insertions(+), 16 deletions(-)
> >
> > diff --git a/lib/librte_vhost/rte_vdpa.h b/lib/librte_vhost/rte_vdpa.h
> > index 1bde36f7f..23fb471be 100644
> > --- a/lib/librte_vhost/rte_vdpa.h
> > +++ b/lib/librte_vhost/rte_vdpa.h
> > @@ -100,15 +100,21 @@ extern struct rte_vdpa_engine *vdpa_engines[];
> >  extern uint32_t vdpa_engine_num;
> >
> >  /* engine management */
> > -int rte_vdpa_register_engine(const char *name, struct
> rte_vdpa_eng_addr
> > *addr);
> > -int rte_vdpa_unregister_engine(int eid);
> > +int __rte_experimental
> > +rte_vdpa_register_engine(const char *name, struct rte_vdpa_eng_addr
> > *addr);
> >
> > -int rte_vdpa_find_engine_id(struct rte_vdpa_eng_addr *addr);
> > +int __rte_experimental
> > +rte_vdpa_unregister_engine(int eid);
> >
> > -int rte_vdpa_info_query(int eid, struct rte_vdpa_eng_attr *attr);
> > +int __rte_experimental
> > +rte_vdpa_find_engine_id(struct rte_vdpa_eng_addr *addr);
> > +
> > +int __rte_experimental
> > +rte_vdpa_info_query(int eid, struct rte_vdpa_eng_attr *attr);
> >
> >  /* driver register api */
> > -void rte_vdpa_register_driver(struct rte_vdpa_eng_driver *drv);
> > +void __rte_experimental
> > +rte_vdpa_register_driver(struct rte_vdpa_eng_driver *drv);
> >
> >  #define RTE_VDPA_REGISTER_DRIVER(nm, drv) \
> >  RTE_INIT(vdpainitfn_ ##nm); \
> > diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
> > index 48005d9ff..d5589c543 100644
> > --- a/lib/librte_vhost/rte_vhost.h
> > +++ b/lib/librte_vhost/rte_vhost.h
> > @@ -187,7 +187,8 @@ int rte_vhost_driver_unregister(const char *path);
> >   * @return
> >   *  0 on success, -1 on failure
> >   */
> > -int rte_vhost_driver_set_vdpa_eid(const char *path, int eid);
> > +int __rte_experimental
> > +rte_vhost_driver_set_vdpa_eid(const char *path, int eid);
> >
> >  /**
> >   * Set the device id, enforce single connection per socket
> > @@ -199,7 +200,8 @@ int rte_vhost_driver_set_vdpa_eid(const char
> *path,
> > int eid);
> >   * @return
> >   *  0 on success, -1 on failure
> >   */
> > -int rte_vhost_driver_set_vdpa_did(const char *path, int did);
> > +int __rte_experimental
> > +rte_vhost_driver_set_vdpa_did(const char *path, int did);
> >
> >  /**
> >   * Get the engine id
> > @@ -209,7 +211,8 @@ int rte_vhost_driver_set_vdpa_did(const char
> *path,
> > int did);
> >   * @return
> >   *  Engine id, -1 on failure
> >   */
> > -int rte_vhost_driver_get_vdpa_eid(const char *path);
> > +int __rte_experimental
> > +rte_vhost_driver_get_vdpa_eid(const char *path);
> >
> >  /**
> >   * Get the device id
> > @@ -219,7 +222,8 @@ int rte_vhost_driver_get_vdpa_eid(const char
> *path);
> >   * @return
> >   *  Device id, -1 on failure
> >   */
> > -int rte_vhost_driver_get_vdpa_did(const char *path);
> > +int __rte_experimental
> > +rte_vhost_driver_get_vdpa_did(const char *path);
> >
> >  /**
> >   * Set the feature bits the vhost-user driver supports.
> > @@ -286,7 +290,8 @@ int rte_vhost_driver_get_features(const char *path,
> > uint64_t *features);
> >   * @return
> >   *  0 on success, -1 on failure
> >   */
> > -int rte_vhost_driver_get_protocol_features(const char *path,
> > +int __rte_ex

Re: [dpdk-dev] [PATCH v2 1/6] vhost: export vhost feature definitions

2018-03-15 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Tuesday, March 6, 2018 10:03 PM
> To: Tan, Jianfeng ; Wang, Zhihong
> ; dev@dpdk.org
> Cc: Bie, Tiwei ; y...@fridaylinux.org; Liang, Cunming
> ; Wang, Xiao W ; Daly,
> Dan 
> Subject: Re: [PATCH v2 1/6] vhost: export vhost feature definitions
> 
> 
> 
> On 03/06/2018 10:37 AM, Tan, Jianfeng wrote:
> >
> >
> >> -Original Message-
> >> From: Wang, Zhihong
> >> Sent: Tuesday, February 13, 2018 5:21 PM
> >> To: dev@dpdk.org
> >> Cc: Tan, Jianfeng; Bie, Tiwei; maxime.coque...@redhat.com;
> >> y...@fridaylinux.org; Liang, Cunming; Wang, Xiao W; Daly, Dan; Wang,
> >> Zhihong
> >> Subject: [PATCH v2 1/6] vhost: export vhost feature definitions
> >>
> >> This patch exports vhost-user protocol features to support device driver
> >> development.
> >>
> >> Signed-off-by: Zhihong Wang 
> >> ---
> >>   lib/librte_vhost/rte_vhost.h  |  8 
> >>   lib/librte_vhost/vhost.h  |  4 +---
> >>   lib/librte_vhost/vhost_user.c |  9 +
> >>   lib/librte_vhost/vhost_user.h | 20 +++-
> >>   4 files changed, 21 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
> >> index d33206997..b05162366 100644
> >> --- a/lib/librte_vhost/rte_vhost.h
> >> +++ b/lib/librte_vhost/rte_vhost.h
> >> @@ -29,6 +29,14 @@ extern "C" {
> >>   #define RTE_VHOST_USER_DEQUEUE_ZERO_COPY (1ULL << 2)
> >>   #define RTE_VHOST_USER_IOMMU_SUPPORT (1ULL << 3)
> >>
> >> +#define RTE_VHOST_USER_PROTOCOL_F_MQ  0
> >
> > Instead of adding a "RTE_" prefix. I prefer to define it like this:
> > #ifndef VHOST_USER_PROTOCOL_F_MQ
> > #define VHOST_USER_PROTOCOL_F_MQ   0
> > #endif
> >
> > Similar to other macros.
> 
> I agree, it is better to keep same naming as in the spec IMHO.

Ok. Thanks Jianfeng and Maxime.

> 
> >> +#define RTE_VHOST_USER_PROTOCOL_F_LOG_SHMFD   1
> >> +#define RTE_VHOST_USER_PROTOCOL_F_RARP2
> >> +#define RTE_VHOST_USER_PROTOCOL_F_REPLY_ACK   3
> >> +#define RTE_VHOST_USER_PROTOCOL_F_NET_MTU 4
> >> +#define RTE_VHOST_USER_PROTOCOL_F_SLAVE_REQ   5
> >> +#define RTE_VHOST_USER_F_PROTOCOL_FEATURES30
> 
> Please put the above declaration separately, it could be misleading,
> making to think it is a vhost-user protocol feature whereas it is a
> Virtio feature.

Good point. Will change it.

-Zhihong

> 
> >> +
> >>   /**
> >>* Information relating to memory regions including offsets to
> >>* addresses in QEMUs memory file.
> >> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> >> index 58aec2e0d..a0b0520e2 100644
> >> --- a/lib/librte_vhost/vhost.h
> >> +++ b/lib/librte_vhost/vhost.h
> >> @@ -174,8 +174,6 @@ struct vhost_msg {
> >>#define VIRTIO_F_VERSION_1 32
> >>   #endif
> >>
> >> -#define VHOST_USER_F_PROTOCOL_FEATURES30
> >> -
> >>   /* Features supported by this builtin vhost-user net driver. */
> >>   #define VIRTIO_NET_SUPPORTED_FEATURES ((1ULL <<
> >> VIRTIO_NET_F_MRG_RXBUF) | \
> >>(1ULL << VIRTIO_F_ANY_LAYOUT) | \
> >> @@ -185,7 +183,7 @@ struct vhost_msg {
> >>(1ULL << VIRTIO_NET_F_MQ)  | \
> >>(1ULL << VIRTIO_F_VERSION_1)   | \
> >>(1ULL << VHOST_F_LOG_ALL)  | \
> >> -  (1ULL <<
> >> VHOST_USER_F_PROTOCOL_FEATURES) | \
> >> +  (1ULL <<
> >> RTE_VHOST_USER_F_PROTOCOL_FEATURES) | \
> >>(1ULL << VIRTIO_NET_F_GSO) | \
> >>(1ULL << VIRTIO_NET_F_HOST_TSO4) | \
> >>(1ULL << VIRTIO_NET_F_HOST_TSO6) | \
> >> diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
> >> index 5c5361066..c93e48e4d 100644
> >> --- a/lib/librte_vhost/vhost_user.c
> >> +++ b/lib/librte_vhost/vhost_user.c
> >> @@ -527,7 +527,7 @@ vhost_user_set_vring_addr(struct virtio_net
> **pdev,
> >> VhostUserMsg *msg)
> >>vring_in

Re: [dpdk-dev] [PATCH v3 2/5] vhost: support selective datapath

2018-03-22 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Thursday, March 22, 2018 5:06 AM
> To: Wang, Zhihong ; dev@dpdk.org
> Cc: Tan, Jianfeng ; Bie, Tiwei
> ; y...@fridaylinux.org; Liang, Cunming
> ; Wang, Xiao W ; Daly,
> Dan 
> Subject: Re: [PATCH v3 2/5] vhost: support selective datapath
> 
> 
> 
> On 02/27/2018 11:13 AM, Zhihong Wang wrote:
> > This patch introduces support for selective datapath in DPDK vhost-user lib
> > to enable various types of virtio-compatible devices to do data transfer
> > with virtio driver directly to enable acceleration. The default datapath is
> > the existing software implementation, more options will be available when
> > new engines are registered.
> >
> > An engine is a group of virtio-compatible devices under a single address.
> > The engine driver includes:
> >
> >   1. A set of engine ops is defined in rte_vdpa_eng_ops to perform engine
> >  init, uninit, and attributes reporting.
> >
> >   2. A set of device ops is defined in rte_vdpa_dev_ops for virtio devices
> >  in the engine to do device specific operations:
> >
> >   a. dev_conf: Called to configure the actual device when the virtio
> >  device becomes ready.
> >
> >   b. dev_close: Called to close the actual device when the virtio device
> >  is stopped.
> >
> >   c. vring_state_set: Called to change the state of the vring in the
> >  actual device when vring state changes.
> >
> >   d. feature_set: Called to set the negotiated features to device.
> >
> >   e. migration_done: Called to allow the device to response to RARP
> >  sending.
> >
> >   f. get_vfio_group_fd: Called to get the VFIO group fd of the device.
> >
> >   g. get_vfio_device_fd: Called to get the VFIO device fd of the device.
> >
> >   h. get_notify_area: Called to get the notify area info of the queue.
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> > Changes in v2:
> >
> >   1. Add VFIO related vDPA device ops.
> >
> >   lib/librte_vhost/Makefile  |   4 +-
> >   lib/librte_vhost/rte_vdpa.h| 126
> +
> >   lib/librte_vhost/rte_vhost_version.map |   8 +++
> >   lib/librte_vhost/vdpa.c| 124
> 
> >   4 files changed, 260 insertions(+), 2 deletions(-)
> >   create mode 100644 lib/librte_vhost/rte_vdpa.h
> >   create mode 100644 lib/librte_vhost/vdpa.c
> >
> > diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
> > index 5d6c6abae..37044ac03 100644
> > --- a/lib/librte_vhost/Makefile
> > +++ b/lib/librte_vhost/Makefile
> > @@ -22,9 +22,9 @@ LDLIBS += -lrte_eal -lrte_mempool -lrte_mbuf -
> lrte_ethdev -lrte_net
> >
> >   # all source are stored in SRCS-y
> >   SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c
> \
> > -   vhost_user.c virtio_net.c
> > +   vhost_user.c virtio_net.c vdpa.c
> >
> >   # install includes
> > -SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
> > +SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
> rte_vdpa.h
> >
> >   include $(RTE_SDK)/mk/rte.lib.mk
> > diff --git a/lib/librte_vhost/rte_vdpa.h b/lib/librte_vhost/rte_vdpa.h
> > new file mode 100644
> > index 0..23fb471be
> > --- /dev/null
> > +++ b/lib/librte_vhost/rte_vdpa.h
> > @@ -0,0 +1,126 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2018 Intel Corporation
> > + */
> > +
> > +#ifndef _RTE_VDPA_H_
> > +#define _RTE_VDPA_H_
> > +
> > +/**
> > + * @file
> > + *
> > + * Device specific vhost lib
> > + */
> > +
> > +#include 
> > +#include "rte_vhost.h"
> > +
> > +#define MAX_VDPA_ENGINE_NUM 128
> > +#define MAX_VDPA_NAME_LEN 128
> > +
> > +struct rte_vdpa_eng_addr {
> > +   union {
> > +   uint8_t __dummy[64];
> > +   struct rte_pci_addr pci_addr;
> I think we should not only support PCI, but any type of buses.
> At least in the API.

Exactly, so we defined a 64 bytes union so any bus types can be added
without breaking the ABI.

But there is one place that may be impacted is the is_same_eng() function.
Maybe comparing all the bytes in __dummy[64] is a better way. What do you
think?


Re: [dpdk-dev] [PATCH v3 3/5] vhost: add apis for datapath configuration

2018-03-22 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Thursday, March 22, 2018 5:08 AM
> To: Wang, Zhihong ; dev@dpdk.org
> Cc: Tan, Jianfeng ; Bie, Tiwei
> ; y...@fridaylinux.org; Liang, Cunming
> ; Wang, Xiao W ; Daly,
> Dan 
> Subject: Re: [PATCH v3 3/5] vhost: add apis for datapath configuration
> 
> 
> 
> On 02/27/2018 11:13 AM, Zhihong Wang wrote:
> > This patch adds APIs for datapath configuration. The eid and did of the
> > vhost-user socket can be configured to identify the actual device.
> >
> > When the default software datapath is used, eid and did are set to -1.
> > When alternative datapath is used, eid and did are set by app to specify
> > which device to use. Each vhost-user socket can have only 1 connection in
> > this case.
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> >   lib/librte_vhost/rte_vhost.h   | 70
> ++
> >   lib/librte_vhost/rte_vhost_version.map |  6 +++
> >   lib/librte_vhost/socket.c  | 65
> +++
> >   lib/librte_vhost/vhost.c   | 50 
> >   lib/librte_vhost/vhost.h   | 10 +
> >   5 files changed, 201 insertions(+)
> >
> 
> Isn't the notion of EID & DID Intel specifics?
> At vhost API level, shouldn't we only care of the offload device ID?

It's not vendor specific: Engine id refers to an engine which is a device
on a bus, the engine could have multiple queue pairs or virtual functions.
The driver can manage them to present multiple vhost ports with vDPA to
application, so logically the concept of device id exists.

In a lot of acceleration cases, application needs to be able to choose the
exact port to use instead of letting the driver to decide (because it does
make a difference), therefore it's necessary to expose the device id here.


Re: [dpdk-dev] [PATCH v3 3/5] vhost: add apis for datapath configuration

2018-03-23 Thread Wang, Zhihong
Hi Maxime,

> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Thursday, March 22, 2018 10:19 PM
> To: Wang, Zhihong ; dev@dpdk.org
> Cc: Tan, Jianfeng ; Bie, Tiwei
> ; y...@fridaylinux.org; Liang, Cunming
> ; Wang, Xiao W ; Daly,
> Dan 
> Subject: Re: [PATCH v3 3/5] vhost: add apis for datapath configuration
> 
> Hi,
> 
> On 03/22/2018 09:22 AM, Wang, Zhihong wrote:
> >
> >
> >> -Original Message-
> >> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> >> Sent: Thursday, March 22, 2018 5:08 AM
> >> To: Wang, Zhihong ; dev@dpdk.org
> >> Cc: Tan, Jianfeng ; Bie, Tiwei
> >> ; y...@fridaylinux.org; Liang, Cunming
> >> ; Wang, Xiao W ;
> Daly,
> >> Dan 
> >> Subject: Re: [PATCH v3 3/5] vhost: add apis for datapath configuration
> >>
> >>
> >>
> >> On 02/27/2018 11:13 AM, Zhihong Wang wrote:
> >>> This patch adds APIs for datapath configuration. The eid and did of the
> >>> vhost-user socket can be configured to identify the actual device.
> >>>
> >>> When the default software datapath is used, eid and did are set to -1.
> >>> When alternative datapath is used, eid and did are set by app to specify
> >>> which device to use. Each vhost-user socket can have only 1 connection
> in
> >>> this case.
> >>>
> >>> Signed-off-by: Zhihong Wang 
> >>> ---
> >>>lib/librte_vhost/rte_vhost.h   | 70
> >> ++
> >>>lib/librte_vhost/rte_vhost_version.map |  6 +++
> >>>lib/librte_vhost/socket.c  | 65
> >> +++
> >>>lib/librte_vhost/vhost.c   | 50 
> >>>lib/librte_vhost/vhost.h   | 10 +
> >>>5 files changed, 201 insertions(+)
> >>>
> >>
> >> Isn't the notion of EID & DID Intel specifics?
> >> At vhost API level, shouldn't we only care of the offload device ID?
> >
> > It's not vendor specific: Engine id refers to an engine which is a device
> > on a bus, the engine could have multiple queue pairs or virtual functions.
> > The driver can manage them to present multiple vhost ports with vDPA to
> > application, so logically the concept of device id exists.
> >
> > In a lot of acceleration cases, application needs to be able to choose the
> > exact port to use instead of letting the driver to decide (because it does
> > make a difference), therefore it's necessary to expose the device id here.
> 
> Yes, but if I understood correctly with the IFCVF driver, we could pass
> directly the virtual function to the vhost-user lib, no need to specify
> the engine. We would just need to register one device per VF, but that
> looks like the right think to do looking at how IFCVF manages MAC
> addresses and link status for example.

The lib is for generic designs. An engine could also be an AFU device [1]
which has multiple virtio ring compatible queue pairs that can serve
different VMs independently, instead of multiple virtual functions. In
this case, we need eid to index the AFU device, and did to index the queue
pair(s).

[1] http://dpdk.org/ml/archives/dev/2018-March/093343.html (struct 
rte_afu_device)

Thanks
-Zhihong

> 
> Thanks,
> Maxime


Re: [dpdk-dev] [PATCH v3 0/5] vhost: support selective datapath

2018-03-30 Thread Wang, Zhihong
Pawel,

> >  3. To make vhost aware of its own type, an engine id (eid) and a device
> > id (did) are added into the vhost data structure to identify the actual
> > device. APIs are introduced to let app configure them. When the default
> > software datapath is used, eid and did are set to -1. When alternative
> > datapath is used, eid and did are set by app to specify which device to
> > use. Each vhost-user socket can have only 1 connection in this case.
> 
> Why only one connection is possible? We are already working on multiple
> simultaneous connections in SPDK. So this will be some kind of step
> backward.

Nothing is changed for existing use cases. This design is only true when
alternative vDPA backend (most likely HW) is used, because in this case
resource locking per vhost port is required for provisioning consideration.

> 
> >
> > Working process:
> > 
> >
> >  1. Register driver during DPDK initialization.
> >
> >  2. Register engine with driver name and address.
> >
> >  3. Get engine attributes.
> >
> >  4. For vhost device creation:
> >
> >   a. Register vhost-user socket.
> >
> >   b. Set eid and did of the vhost-user socket.
> >
> >   c. Register vhost-user callbacks.
> >
> >   d. Start to wait for connection.
> >
> >  4. When connection comes and virtio device data structure is negotiated,
> > the device will be configured with all needed info.
> >
> 
> Can you please provide new or modify existing example to show how to use
> this new API?
> It would be easier to find any possible gaps if we can see real use case.

This patch has only the lib change, a driver is also in progress at:
http://dpdk.org/ml/archives/dev/2018-March/093305.html

You can learn how it works from an old RFC patch at:
http://dpdk.org/ml/archives/dev/2017-December/085044.html
A lot of details are changed, but the idea is the same.

-Zhihong


Re: [dpdk-dev] [PATCH v4 2/5] vhost: support selective datapath

2018-04-01 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Saturday, March 31, 2018 2:10 PM
> To: Wang, Zhihong ; dev@dpdk.org
> Cc: Tan, Jianfeng ; Bie, Tiwei
> ; y...@fridaylinux.org; Liang, Cunming
> ; Wang, Xiao W ; Daly,
> Dan 
> Subject: Re: [PATCH v4 2/5] vhost: support selective datapath
> 
> 
> 
> On 03/10/2018 11:01 AM, Zhihong Wang wrote:
> > This patch set introduces support for selective datapath in DPDK vhost-user
> > lib. vDPA stands for vhost Data Path Acceleration. The idea is to support
> > virtio ring compatible devices to serve virtio driver directly to enable
> > datapath acceleration.
> >
> > A set of device ops is defined for device specific operations:
> >
> >   a. queue_num_get: Called to get supported queue number of the
> device.
> >
> >   b. feature_get: Called to get supported features of the device.
> >
> >   c. protocol_feature_get: Called to get supported protocol features of
> >  the device.
> >
> >   d. dev_conf: Called to configure the actual device when the virtio
> >  device becomes ready.
> >
> >   e. dev_close: Called to close the actual device when the virtio device
> >  is stopped.
> >
> >   f. vring_state_set: Called to change the state of the vring in the
> >  actual device when vring state changes.
> >
> >   g. feature_set: Called to set the negotiated features to device.
> >
> >   h. migration_done: Called to allow the device to response to RARP
> >  sending.
> >
> >   i. get_vfio_group_fd: Called to get the VFIO group fd of the device.
> >
> >   j. get_vfio_device_fd: Called to get the VFIO device fd of the device.
> >
> >   k. get_notify_area: Called to get the notify area info of the queue.
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> > Changes in v4:
> >
> >   1. Remove the "engine" concept in the lib.
> >
> > ---
> > Changes in v2:
> >
> >   1. Add VFIO related vDPA device ops.
> >
> >   lib/librte_vhost/Makefile  |  4 +-
> >   lib/librte_vhost/rte_vdpa.h| 94
> +
> >   lib/librte_vhost/rte_vhost_version.map |  6 +++
> >   lib/librte_vhost/vdpa.c| 96
> ++
> >   4 files changed, 198 insertions(+), 2 deletions(-)
> >   create mode 100644 lib/librte_vhost/rte_vdpa.h
> >   create mode 100644 lib/librte_vhost/vdpa.c
> >
> > diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
> > index 5d6c6abae..37044ac03 100644
> > --- a/lib/librte_vhost/Makefile
> > +++ b/lib/librte_vhost/Makefile
> > @@ -22,9 +22,9 @@ LDLIBS += -lrte_eal -lrte_mempool -lrte_mbuf -
> lrte_ethdev -lrte_net
> >
> >   # all source are stored in SRCS-y
> >   SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c
> \
> > -   vhost_user.c virtio_net.c
> > +   vhost_user.c virtio_net.c vdpa.c
> >
> >   # install includes
> > -SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
> > +SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
> rte_vdpa.h
> >
> >   include $(RTE_SDK)/mk/rte.lib.mk
> > diff --git a/lib/librte_vhost/rte_vdpa.h b/lib/librte_vhost/rte_vdpa.h
> > new file mode 100644
> > index 0..a4bbbd93d
> > --- /dev/null
> > +++ b/lib/librte_vhost/rte_vdpa.h
> > @@ -0,0 +1,94 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2018 Intel Corporation
> > + */
> > +
> > +#ifndef _RTE_VDPA_H_
> > +#define _RTE_VDPA_H_
> > +
> > +/**
> > + * @file
> > + *
> > + * Device specific vhost lib
> > + */
> > +
> > +#include 
> > +#include "rte_vhost.h"
> > +
> > +#define MAX_VDPA_NAME_LEN 128
> > +
> > +enum vdpa_addr_type {
> > +   PCI_ADDR,
> > +   VDPA_ADDR_MAX
> > +};
> > +
> > +struct rte_vdpa_dev_addr {
> > +   enum vdpa_addr_type type;
> > +   union {
> > +   uint8_t __dummy[64];
> > +   struct rte_pci_addr pci_addr;
> > +   };
> > +};
> > +
> > +/* Get capabilities of this device */
> > +typedef int (*vdpa_dev_queue_num_get_t)(int did, uint32_t
> *queue_num);
> > +typedef int (*vdpa_dev_feature_get_t)(int did, uint64_t *features);
> > +
> > +/* Driver configure/close the d

Re: [dpdk-dev] [PATCH v4 3/5] vhost: add apis for datapath configuration

2018-04-01 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Saturday, March 31, 2018 3:04 PM
> To: Wang, Zhihong ; dev@dpdk.org
> Cc: Tan, Jianfeng ; Bie, Tiwei
> ; y...@fridaylinux.org; Liang, Cunming
> ; Wang, Xiao W ; Daly,
> Dan 
> Subject: Re: [PATCH v4 3/5] vhost: add apis for datapath configuration
> 
> 
> 
> On 03/10/2018 11:01 AM, Zhihong Wang wrote:
> > This patch adds APIs for datapath configuration.
> >
> > The did of the vhost-user socket can be set to identify the backend device,
> > in this case each vhost-user socket can have only 1 connection. The did is
> > set to -1 by default when the software datapath is used.
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> > Changes in v4:
> >
> >   1. Remove the "engine" concept in the lib.
> >
> >   lib/librte_vhost/rte_vhost.h   | 35
> +
> >   lib/librte_vhost/rte_vhost_version.map |  3 +++
> >   lib/librte_vhost/socket.c  | 36
> ++
> >   lib/librte_vhost/vhost.c   | 25 +++
> >   lib/librte_vhost/vhost.h   |  9 +
> >   5 files changed, 108 insertions(+)
> >
> > diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
> > index 671ea5053..d50f4c67d 100644
> > --- a/lib/librte_vhost/rte_vhost.h
> > +++ b/lib/librte_vhost/rte_vhost.h
> > @@ -200,6 +200,30 @@ int rte_vhost_driver_register(const char *path,
> uint64_t flags);
> >   int rte_vhost_driver_unregister(const char *path);
> >
> >   /**
> > + * Set the device id, enforce single connection per socket
> > + *
> > + * @param path
> > + *  The vhost-user socket file path
> > + * @param did
> > + *  Device id
> > + * @return
> > + *  0 on success, -1 on failure
> > + */
> > +int __rte_experimental
> > +rte_vhost_driver_set_vdpa_did(const char *path, int did);
> 
> IIUC, we call this to attach a VDPA device to a Vhost-user port?
> 
> What about having this named explicitly? Something like:
> 
> rte_vhost_driver_attach_vdpa_did(const char *path, int did)
> rte_vhost_driver_detach_vdpa_did(const char *path)
> 
> The later would set to did -1

Great. This does make things clearer.

> 
> This is not mandatory though
> 
> > +
> > +/**
> > + * Get the device id
> > + *
> > + * @param path
> > + *  The vhost-user socket file path
> > + * @return
> > + *  Device id, -1 on failure
> > + */
> > +int __rte_experimental
> > +rte_vhost_driver_get_vdpa_did(const char *path);
> > +
> > +/**
> >* Set the feature bits the vhost-user driver supports.
> >*
> >* @param path
> > @@ -464,6 +488,17 @@ int rte_vhost_vring_call(int vid, uint16_t vring_idx);
> >*/
> >   uint32_t rte_vhost_rx_queue_count(int vid, uint16_t qid);
> >
> > +/**
> > + * Get vdpa device id for vhost device.
> > + *
> > + * @param vid
> > + *  vhost device ID
> > + * @return
> > + *  device id
> > + */
> > +int __rte_experimental
> > +rte_vhost_get_vdpa_did(int vid);
> > +
> >   #ifdef __cplusplus
> >   }
> >   #endif
> > diff --git a/lib/librte_vhost/rte_vhost_version.map
> b/lib/librte_vhost/rte_vhost_version.map
> > index 7bcffb490..6e2d5364a 100644
> > --- a/lib/librte_vhost/rte_vhost_version.map
> > +++ b/lib/librte_vhost/rte_vhost_version.map
> > @@ -64,4 +64,7 @@ EXPERIMENTAL {
> > rte_vdpa_register_device;
> > rte_vdpa_unregister_device;
> > rte_vdpa_find_device_id;
> > +   rte_vhost_driver_set_vdpa_did;
> > +   rte_vhost_driver_get_vdpa_did;
> > +   rte_vhost_get_vdpa_did;
> >   } DPDK_18.02;
> > diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
> > index cfc31e179..3d58da94e 100644
> > --- a/lib/librte_vhost/socket.c
> > +++ b/lib/librte_vhost/socket.c
> > @@ -52,6 +52,13 @@ struct vhost_user_socket {
> > uint64_t supported_features;
> > uint64_t features;
> >
> > +   /*
> > +* Device id to identify a specific backend device.
> > +* It's set to -1 for the default software implementation.
> > +* If valid, one socket can have 1 connection only.
> > +*/
> > +   int did;
> I would rename it to something like vdpa_did or even better,
> vdpa_dev_id. At least prefix it with vdpa not to confuse the user.

Good suggestion.

> 
> > +
> > struct vhost_device_ops const *notify_ops;
>

Re: [dpdk-dev] [PATCH v4 2/5] vhost: support selective datapath

2018-04-01 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Saturday, March 31, 2018 3:38 PM
> To: Wang, Zhihong ; dev@dpdk.org
> Cc: Tan, Jianfeng ; Bie, Tiwei
> ; y...@fridaylinux.org; Liang, Cunming
> ; Wang, Xiao W ; Daly,
> Dan 
> Subject: Re: [PATCH v4 2/5] vhost: support selective datapath
> 
> 
> 
> On 03/10/2018 11:01 AM, Zhihong Wang wrote:
> > +   uint64_t *size);
> > +/* Device ops */
> > +struct rte_vdpa_dev_ops {
> > +   vdpa_dev_queue_num_get_t  queue_num_get;
> > +   vdpa_dev_feature_get_tfeature_get;
> > +   vdpa_dev_feature_get_tprotocol_feature_get;
> 
> I would prefer them to be named as in Vhost-user spec:
> 
> get_queue_num
> get_features
> get_protocol_features

Ok. Will change them.

> 
> Thanks,
> Maxime


Re: [dpdk-dev] [PATCH v4 4/5] vhost: adapt vhost lib for selective datapath

2018-04-02 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Saturday, March 31, 2018 3:36 PM
> To: Wang, Zhihong ; dev@dpdk.org
> Cc: Tan, Jianfeng ; Bie, Tiwei
> ; y...@fridaylinux.org; Liang, Cunming
> ; Wang, Xiao W ; Daly,
> Dan 
> Subject: Re: [PATCH v4 4/5] vhost: adapt vhost lib for selective datapath
> 
> 
> 
> On 03/10/2018 11:01 AM, Zhihong Wang wrote:
> > This patch adapts vhost lib for selective datapath by calling device ops
> > at the corresponding stage.
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> > Changes in v4:
> >
> >   1. Remove the "engine" concept in the lib.
> >
> > ---
> > Changes in v2:
> >
> >   1. Ensure negotiated capabilities are supported in vhost-user lib.
> >
> >   2. Configure the data path at the right time.
> >
> >   lib/librte_vhost/rte_vhost.h   | 27 ++
> >   lib/librte_vhost/rte_vhost_version.map |  2 +
> >   lib/librte_vhost/socket.c  | 94
> --
> >   lib/librte_vhost/vhost.c   |  3 ++
> >   lib/librte_vhost/vhost.h   |  2 +
> >   lib/librte_vhost/vhost_user.c  | 54 +--
> >   6 files changed, 172 insertions(+), 10 deletions(-)
> >
> > diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
> > index d50f4c67d..3c3334d3e 100644
> > --- a/lib/librte_vhost/rte_vhost.h
> > +++ b/lib/librte_vhost/rte_vhost.h
> > @@ -279,6 +279,33 @@ int rte_vhost_driver_disable_features(const char
> *path, uint64_t features);
> >   int rte_vhost_driver_get_features(const char *path, uint64_t *features);
> >
> >   /**
> > + * Get the protocol feature bits before feature negotiation.
> > + *
> > + * @param path
> > + *  The vhost-user socket file path
> > + * @param protocol_features
> > + *  A pointer to store the queried protocol feature bits
> > + * @return
> > + *  0 on success, -1 on failure
> > + */
> > +int __rte_experimental
> > +rte_vhost_driver_get_protocol_features(const char *path,
> > +   uint64_t *protocol_features);
> > +
> > +/**
> > + * Get the queue number bits before feature negotiation.
> > + *
> > + * @param path
> > + *  The vhost-user socket file path
> > + * @param queue_num
> > + *  A pointer to store the queried queue number bits
> > + * @return
> > + *  0 on success, -1 on failure
> > + */
> > +int __rte_experimental
> > +rte_vhost_driver_get_queue_num(const char *path, uint32_t
> *queue_num);
> > +
> > +/**
> >* Get the feature bits after negotiation
> >*
> >* @param vid
> > diff --git a/lib/librte_vhost/rte_vhost_version.map
> b/lib/librte_vhost/rte_vhost_version.map
> > index 6e2d5364a..812ccd72b 100644
> > --- a/lib/librte_vhost/rte_vhost_version.map
> > +++ b/lib/librte_vhost/rte_vhost_version.map
> > @@ -67,4 +67,6 @@ EXPERIMENTAL {
> > rte_vhost_driver_set_vdpa_did;
> > rte_vhost_driver_get_vdpa_did;
> > rte_vhost_get_vdpa_did;
> > +   rte_vhost_driver_get_protocol_features;
> > +   rte_vhost_driver_get_queue_num;
> >   } DPDK_18.02;
> > diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
> > index 3d58da94e..ba7b422a0 100644
> > --- a/lib/librte_vhost/socket.c
> > +++ b/lib/librte_vhost/socket.c
> > @@ -216,6 +216,8 @@ vhost_user_add_connection(int fd, struct
> vhost_user_socket *vsocket)
> >
> > vhost_set_builtin_virtio_net(vid, vsocket->use_builtin_virtio_net);
> >
> > +   vhost_set_vdpa_did(vid, vsocket->did);
> > +
> > if (vsocket->dequeue_zero_copy)
> > vhost_enable_dequeue_zero_copy(vid);
> >
> > @@ -648,20 +650,102 @@ int
> >   rte_vhost_driver_get_features(const char *path, uint64_t *features)
> >   {
> > struct vhost_user_socket *vsocket;
> > +   uint64_t vdpa_features;
> > +   int did = -1;
> > +   int ret = 0;
> >
> > pthread_mutex_lock(&vhost_user.mutex);
> > vsocket = find_vhost_user_socket(path);
> > -   if (vsocket)
> > -   *features = vsocket->features;
> > +   if (vsocket) {
> > +   did = vsocket->did;
> > +   if (did < 0 || vdpa_devices[did]->ops->feature_get == NULL)
> > +   *features = vsocket->features;
> > +   else if (vdpa_devices[did]->ops->feature_get(did,
> > +   &vdpa_features) < 0) {
> >

Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms

2017-09-17 Thread Wang, Zhihong
> Hi Zhihong Wang
> 
> I test avx512 rte_memcpy found the performanc for ovs dpdk is lower than
> avx2 rte_memcpy.

Hi Haifeng,

AVX512 memcpy is marked as experimental and disabled by default, its
benefit varies from case to case. So enable it only when the case
(SW + HW setup with expected data pattern) is verified.

BTW, it's not recommended to use micro benchmarks like test_memcpy_perf
for memcpy performance report as they aren't likely able to reflect
performance of real world applications, please find more details at
https://software.intel.com/en-us/articles/performance-optimization-of-memcpy-in-dpdk


Thanks
Zhihong

> 
> The vm loop test for ovs dpdk results:
> avx512 is *15*Gbps
> perf data:
>   0.52 │  vmovdq (%r8,%r10,1),%zmm0
>  95.33 │  sub$0x40,%r9
>   0.45 │  add$0x40,%r8
>   0.60 │  vmovdq %zmm0,-0x40(%r8)
>   1.84 │  cmp$0x3f,%r9
>│↓ ja f20
>│  lea-0x40(%rsi),%r8
>   0.15 │  or $0xffc0,%rsi
>   0.21 │  and$0xffc0,%r8
>   0.00 │  lea0x40(%rsi,%r8,1),%rsi
>   0.00 │  vmovdq (%rcx,%rsi,1),%zmm0
>   0.22 │  vmovdq %zmm0,(%rdx,%rsi,1)
>   0.67 │↓ jmpq   c78
>│  mov-0x128(%rbp),%rdi
>│  rex.R
>│  .byte  0x89
>│  popfq
> 
> avx2 is *18.8*Gbps
> perf data:
>   0.96 │  add%r9,%r13
>  66.04 │  vmovdq (%rdx),%ymm0
>   1.20 │  sub$0x40,%rdi
>   1.53 │  add$0x40,%rdx
>  10.83 │  vmovdq %ymm0,-0x40(%rdx,%r15,1)
>   8.64 │  vmovdq -0x20(%rdx),%ymm0
>   7.58 │  vmovdq %ymm0,-0x40(%rdx,%r13,1)
> 
> 
> dpdk version: v17.05
> ovs version: 2.8.90
> qemu version: QEMU emulator version 2.9.94 (v2.10.0-rc4-dirty)
> 
> gcc version: gcc (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6)
> kernal version: 3.10.0
> 
> 
> compile dpdk:
> CONFIG_RTE_ENABLE_AVX512=y
> export DPDK_DIR=$PWD
> export DPDK_TARGET=x86_64-native-linuxapp-gcc
> export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET
> make install T=$DPDK_TARGET DESTDIR=install
> 
> compile ovs:
> sh boot.sh
> ./configure  CFLAGS="-g -O2" --with-dpdk=$DPDK_BUILD --prefix=/usr --
> localstatedir=/var --sysconfdir=/etc
> make -j
> make install
> 
> The test for dpdk test_memcpy_perf:
> avx2:
> ** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **
> === == == ==
> ==
>Size Cache to cache   Cache to mem   Mem to cache Mem to mem
> (bytes)(ticks)(ticks)(ticks)(ticks)
> --- -- -- -- --
> == 32B aligned
> 
>  64   6 -   10  27 -   52  30 -   39  56 -   97
> 512  24 -   44 251 -  271 145 -  217 396 -  447
>1024  35 -   78 394 -  433 252 -  319 609 -  670
> --- -- -- -- --
> C64   3 -9  28 -   31  29 -   40  55 -   66
> C   512  25 -   55 253 -  268 139 -  268 397 -  410
> C  1024  32 -   83 394 -  416 250 -  396 612 -  687
> === Unaligned
> =
>  64   8 -9  85 -   71  45 -   45 125 -  121
> 512  33 -   49 282 -  305 153 -  252 420 -  478
>1024  42 -   83 409 -  491 259 -  389 640 -  748
> --- -- -- -- --
> C64   4 -9  42 -   46  39 -   46  76 -   90
> C   512  33 -   55 280 -  272 153 -  281 421 -  415
> C  1024  41 -   83 407 -  427 258 -  405 578 -  701
> === == == ==
> ==
> 
> avx512:
> ** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **
> === == == ==
> ==
>Size Cache to cache   Cache to mem   Mem to cache Mem to mem
> (bytes)(ticks)(ticks)(ticks)(ticks)
> --- -- -- -- --
> == 64B aligned
> 
>  64   6 -9  18 -   33  24 -   38  40 -   65
> 512  18 -   44 178 -  262 138 -  218 309 -  429
>1024  27 -   79 338 -  430 250 -  322 560 -  674
> --- -- -- -- --
> C64   3 -9  18 -   20  23 -   41  39 -   50
> C   512  15 -   54 205 -  270 134 -  268 304 -  409
> C  1024  24 -   83 371 -  414 242 -  400 550 -  692
> === Unaligned
> =
>  64   8 -9  87 -   74  45 -   48 125 -  118
> 512  23 -   49 298 -  311 150 -  250 437 -  482
>1024  36 -   83 427 -  505 259 - 

Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy

2017-11-02 Thread Wang, Zhihong
> I don't know what is creating this drop exactly.
> When doing different tests on different environments, we do not see this
> drop.
> If nobody else can see such issue, I guess we can ignore it.

Hi Thomas, Xiaoyun,

With this patch (commit 84cc318424d49372dd2a5fbf3cf84426bf95acce) I see
more than 20% performance drop in vhost loopback test with testpmd
macswap for 256 bytes packets, which means it impacts actual vSwitching
performance.

Suggest we fix it or revert it for this release.

Thanks
Zhihong


Re: [dpdk-dev] [PATCH RFC 1/2] vhost: make capabilities configurable

2018-01-30 Thread Wang, Zhihong
Hi Maxime,

> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Tuesday, January 30, 2018 11:00 PM
> To: Wang, Zhihong ; dev@dpdk.org
> Cc: Tan, Jianfeng ; Bie, Tiwei
> ; y...@fridaylinux.org; Liang, Cunming
> ; Wang, Xiao W ; Daly,
> Dan 
> Subject: Re: [PATCH RFC 1/2] vhost: make capabilities configurable
> 
> Hi Zhihong
> 
> On 12/23/2017 04:36 AM, Zhihong Wang wrote:
> > This patch makes vhost device capabilities configurable to adopt various
> > engines. Such capabilities include supported features, protocol features,
> > queue number. APIs are introduced to let app configure these capabilities.
> 
> Why does the vDPA driver need to mask protocol features?

Different vhost devices may support different combinations of protocol
features, e.g. One may not support MQ, or RARP. So it should be reported
by the device.

Thanks
-Zhihong

> 
> Maxime


Re: [dpdk-dev] [PATCH 1/7] vhost: make capabilities configurable

2018-02-07 Thread Wang, Zhihong
Hi Maxime,

> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Tuesday, February 6, 2018 6:19 PM
> To: Wang, Zhihong ; dev@dpdk.org
> Cc: Tan, Jianfeng ; Bie, Tiwei
> ; y...@fridaylinux.org; Liang, Cunming
> ; Wang, Xiao W ; Daly,
> Dan 
> Subject: Re: [PATCH 1/7] vhost: make capabilities configurable
> 
> Hi Zhihong,
> 
...
> > +int rte_vhost_driver_set_queue_num(const char *path, uint16_t
> queue_num)
> > +{
> > +   struct vhost_user_socket *vsocket;
> > +
> > +   pthread_mutex_lock(&vhost_user.mutex);
> > +   vsocket = find_vhost_user_socket(path);
> > +   if (vsocket)
> > +   vsocket->queue_num = queue_num;
> 
> Shouldn't be MIN(queue_num, VHOST_MAX_QUEUE_PAIRS) to be sure you
> can
> switch from HW offload to SW processing?

Yes, the check is necessary.

> 
> > +   pthread_mutex_unlock(&vhost_user.mutex);
> > +
> > +   return vsocket ? 0 : -1;
> > +}
...
> > -static void
> > +static int
> >   vhost_user_set_protocol_features(struct virtio_net *dev,
> >  uint64_t protocol_features)
> >   {
> > -   if (protocol_features & ~VHOST_USER_PROTOCOL_FEATURES)
> 
> I think the above check is still necessary, or it should be checked
> in rte_vhost_driver_set_protocol_features().

Thanks. Yes I think all set capabilities should be contained in the
capabilities of the vhost-user lib.

Will update rte_vhost_driver_set_*().

-Zhihong

> 
> Indeed, the application shouldn't set a protocol feature bit that isn't
> supported by the libvhost-user library.
> 
> > -   return;
> > +   uint64_t vhost_protocol_features = 0;
> > +
> > +   rte_vhost_driver_get_protocol_features(dev->ifname,
> > +   &vhost_protocol_features);
> > +   if (protocol_features & ~vhost_protocol_features) {
> > +   RTE_LOG(ERR, VHOST_CONFIG,
> > +   "(%d) received invalid negotiated
> protocol_features.\n",
> > +   dev->vid);
> > +   return -1;
> > +   }
> >
> > dev->protocol_features = protocol_features;
> > +
> > +   return 0;
> >   }
> >
> >   static int
> > @@ -1391,7 +1416,8 @@ vhost_user_msg_handler(int vid, int fd)
> > break;
> >
> > case VHOST_USER_GET_PROTOCOL_FEATURES:
> > -   vhost_user_get_protocol_features(dev, &msg);
> > +   msg.payload.u64 = vhost_user_get_protocol_features(dev);
> > +   msg.size = sizeof(msg.payload.u64);
> > send_vhost_reply(fd, &msg);
> > break;
> > case VHOST_USER_SET_PROTOCOL_FEATURES:
> > @@ -1451,7 +1477,7 @@ vhost_user_msg_handler(int vid, int fd)
> > break;
> >
> > case VHOST_USER_GET_QUEUE_NUM:
> > -   msg.payload.u64 = VHOST_MAX_QUEUE_PAIRS;
> > +   msg.payload.u64 = vhost_user_get_queue_num(dev);
> > msg.size = sizeof(msg.payload.u64);
> > send_vhost_reply(fd, &msg);
> > break;
> >
> 
> Maxime


Re: [dpdk-dev] [PATCH v2 06/10] net/virtio: fix queue setup consistency

2018-02-08 Thread Wang, Zhihong
Hi Olivier,

Given the situation that the vec path can be selected silently now once
condition is met. So theoretically speaking this issue impacts the whole
virtio pmd. If you plan to fix it in the next release, do you want to do
a temporary workaround to disable the vec path selection till then?

Thanks
-Zhihong

> -Original Message-
> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Olivier Matz
> Sent: Thursday, February 8, 2018 6:01 AM
> To: Xu, Qian Q 
> Cc: Yao, Lei A ; dev@dpdk.org; y...@fridaylinux.org;
> maxime.coque...@redhat.com; Thomas Monjalon ;
> sta...@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 06/10] net/virtio: fix queue setup
> consistency
> 
> Hi,
> 
> It's in my short plans, but unfortunately some other high priority tasks
> were inserted before. Honnestly, I'm not sure I'll be able to make it
> for the release, but I'll do my best.
> 
> Olivier
> 
> 
> 
> On Wed, Feb 07, 2018 at 08:31:07AM +, Xu, Qian Q wrote:
> > Any update, Olivier?
> > We are near to release, and the bug-fix is important for the virtio vector
> path usage. Thanks.
> >
> > > -Original Message-
> > > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Olivier Matz
> > > Sent: Thursday, February 1, 2018 4:28 PM
> > > To: Yao, Lei A 
> > > Cc: dev@dpdk.org; y...@fridaylinux.org; maxime.coque...@redhat.com;
> > > Thomas Monjalon ; sta...@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v2 06/10] net/virtio: fix queue setup
> consistency
> > >
> > > Hi Lei,
> > >
> > > It's on my todo list, I'll check this as soon as possible.
> > >
> > > Olivier
> > >
> > >
> > > On Thu, Feb 01, 2018 at 03:14:15AM +, Yao, Lei A wrote:
> > > > Hi, Olivier
> > > >
> > > > This is Lei from DPDK validation team in Intel. During our DPDK
> > > > 18.02-rc1 test, I find the following patch will cause one serious issue
> with virtio
> > > vector path:
> > > > the traffic can't resume after stop/start the virtio device.
> > > >
> > > > The step like following:
> > > > 1. Launch vhost-user port using testpmd at Host 2. Launch VM with
> > > > virtio device, mergeable is off 3. Bind the virtio device to pmd
> > > > driver, launch testpmd, let the tx/rx use vector path
> > > > virtio_xmit_pkts_simple
> > > > virtio_recv_pkts_vec
> > > > 4. Send traffic to virtio device from vhost side, then stop the virtio
> > > > device 5. Start the virtio device again After step 5, the traffic
> > > > can't resume.
> > > >
> > > > Could you help check this and give a fix? This issue will impact the
> > > > virtio pmd user experience heavily. By the way, this patch is already
> > > > included into V17.11. Looks like we need give a patch to this LTS 
> > > > version.
> > > Thanks a lot!
> > > >
> > > > BRs
> > > > Lei
> > > > > -Original Message-
> > > > > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Olivier Matz
> > > > > Sent: Thursday, September 7, 2017 8:14 PM
> > > > > To: dev@dpdk.org; y...@fridaylinux.org;
> maxime.coque...@redhat.com
> > > > > Cc: step...@networkplumber.org; sta...@dpdk.org
> > > > > Subject: [dpdk-dev] [PATCH v2 06/10] net/virtio: fix queue setup
> > > > > consistency
> > > > >
> > > > > In rx/tx queue setup functions, some code is executed only if
> > > > > use_simple_rxtx == 1. The value of this variable can change
> > > > > depending on the offload flags or sse support. If Rx queue setup is
> > > > > called before Tx queue setup, it can result in an invalid 
> > > > > configuration:
> > > > >
> > > > > - dev_configure is called: use_simple_rxtx is initialized to 0
> > > > > - rx queue setup is called: queues are initialized without simple path
> > > > >   support
> > > > > - tx queue setup is called: use_simple_rxtx switch to 1, and simple
> > > > >   Rx/Tx handlers are selected
> > > > >
> > > > > Fix this by postponing a part of Rx/Tx queue initialization in
> > > > > dev_start(), as it was the case in the initial implementation.
> > > > >
> > > > > Fixes: 48cec290a3d2 ("net/virtio: move queue configure code to
> > > > > proper
> > > > > place")
> > > > > Cc: sta...@dpdk.org
> > > > >
> > > > > Signed-off-by: Olivier Matz 
> > > > > ---
> > > > >  drivers/net/virtio/virtio_ethdev.c | 13 +
> > > > > drivers/net/virtio/virtio_ethdev.h |  6 ++
> > > > >  drivers/net/virtio/virtio_rxtx.c   | 40
> ++-
> > > > > ---
> > > > >  3 files changed, 51 insertions(+), 8 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/virtio/virtio_ethdev.c
> > > > > b/drivers/net/virtio/virtio_ethdev.c
> > > > > index 8eee3ff80..c7888f103 100644
> > > > > --- a/drivers/net/virtio/virtio_ethdev.c
> > > > > +++ b/drivers/net/virtio/virtio_ethdev.c
> > > > > @@ -1737,6 +1737,19 @@ virtio_dev_start(struct rte_eth_dev *dev)
> > > > >   struct virtnet_rx *rxvq;
> > > > >   struct virtnet_tx *txvq __rte_unused;
> > > > >   struct virtio_hw *hw = dev->data->dev_private;
> > > > > + int ret;
> > > > > +
> > > > > + /* Finish the initialization of

[dpdk-dev] [dpdk-dev, v3] Implement memcmp using Intel SIMD instrinsics.

2016-02-23 Thread Wang, Zhihong
> > It'd be great if you could format this patch into a patch set with several
> > little ones. :-)
> > Also, the kernel checkpatch is very helpful.
> > Good coding style and patch organization make it easy for in-depth reviews.
> > 
> Combination of scalar and vector (32/64/128) was done to get optimal 
> performance numbers. If there is enough interest in this I can work on it and 
> provide an updated patch set.

That'll be very helpful! Looking forward to your patch :)
BTW, have you tested real example performance with your patch?


[dpdk-dev] [dpdk-dev,v2] Clean up rte_memcpy.h file

2016-02-29 Thread Wang, Zhihong


> -Original Message-
> From: Ravi Kerur [mailto:rkerur at gmail.com]
> Sent: Saturday, February 27, 2016 10:06 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev,v2] Clean up rte_memcpy.h file
> 
> 
> 
> On Wed, Jan 27, 2016 at 8:18 PM, Zhihong Wang 
> wrote:
> > Remove unnecessary type casting in functions.
> >
> > Tested on Ubuntu (14.04 x86_64) with "make test".
> > "make test" results match the results with baseline.
> > "Memcpy perf" results match the results with baseline.
> >
> > Signed-off-by: Ravi Kerur 
> > Acked-by: Stephen Hemminger 
> >
> > ---
> > .../common/include/arch/x86/rte_memcpy.h? ? ? ? ? ?| 340 +++---
> ---
> >? 1 file changed, 175 insertions(+), 165 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > index 6a57426..839d4ec 100644
> > --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> 
> [...]
> 
> >? /**
> > @@ -150,13 +150,16 @@ rte_mov64blocks(uint8_t *dst, const uint8_t *src,
> size_t n)
> >? ? ? ?__m256i ymm0, ymm1;
> >
> >? ? ? ?while (n >= 64) {
> > -? ? ? ? ? ? ?ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t
> *)src + 0 * 32));
> > +
> > +? ? ? ? ? ? ?ymm0 = _mm256_loadu_si256((const __m256i *)(src + 0 * 32));
> > +? ? ? ? ? ? ?ymm1 = _mm256_loadu_si256((const __m256i *)(src + 1 * 32));
> > +
> > +? ? ? ? ? ? ?_mm256_storeu_si256((__m256i *)(dst + 0 * 32), ymm0);
> > +? ? ? ? ? ? ?_mm256_storeu_si256((__m256i *)(dst + 1 * 32), ymm1);
> > +
> 
> Any particular reason to change the order of the statements here? :)
> Overall this patch looks good.
> 
> I checked the code changes, initial code had moving ?addresses (src and dst) 
> and
> decrement counter scattered between store and load instructions. I changed it 
> to
> loads, followed by stores and handle address/counters increment/decrement
> without changing functionality.
> 

It's definitely okay to do this. Actually changing it or not won't affect
the final output at all since gcc will optimize it while generating code.
It's C code we're writing after all.

But personally I prefer to keep the original order just as a comment
that what's needed in the future should be calculated ASAP, and
different kinds (CPU port) of instructions should be mixed together. :)

Could you please rebase this patch since there has been some changes
already?

> >? ? ? ? ? ? ? ?n -= 64;
> > -? ? ? ? ? ? ?ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t
> *)src + 1 * 32));
> > -? ? ? ? ? ? ?src = (const uint8_t *)src + 64;
> > -? ? ? ? ? ? ?_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32),
> ymm0);
> > -? ? ? ? ? ? ?_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32),
> ymm1);
> > -? ? ? ? ? ? ?dst = (uint8_t *)dst + 64;
> > +? ? ? ? ? ? ?src = src + 64;
> > +? ? ? ? ? ? ?dst = dst + 64;
> >? ? ? ?}
> >? }
> >



[dpdk-dev] [PATCH v5 2/3] examples/l2fwd: Handle SIGINT and SIGTERM in l2fwd

2016-01-04 Thread Wang, Zhihong


> -Original Message-
> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> Sent: Friday, January 1, 2016 1:02 AM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ananyev, Konstantin ; 
> Qiu,
> Michael 
> Subject: Re: [PATCH v5 2/3] examples/l2fwd: Handle SIGINT and SIGTERM in
> l2fwd
> 
> On Wed, 30 Dec 2015 16:59:50 -0500
> Zhihong Wang  wrote:
> 
> > +static void
> > +signal_handler(int signum)
> > +{
> > +   if (signum == SIGINT || signum == SIGTERM) {
> > +   printf("\n\nSignal %d received, preparing to exit...\n",
> > +   signum);
> > +   force_quit = true;
> 
> Actually, the if () is redundant since you only registered SIGINT, and SIGTERM
> those are the only signals you could possibly receive.

Yes it's kind of an obsession I guess, just want to make the code crystal clear 
:)

> 
> Acked-by: Stephen Hemminger 


[dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms

2016-01-15 Thread Wang, Zhihong


> -Original Message-
> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> Sent: Friday, January 15, 2016 12:49 AM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ananyev, Konstantin ;
> Richardson, Bruce ; Xie, Huawei
> 
> Subject: Re: [PATCH 0/4] Optimize memcpy for AVX512 platforms
> 
> On Thu, 14 Jan 2016 01:13:18 -0500
> Zhihong Wang  wrote:
> 
> > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> > utilization of hardware resources and deliver high performance.
> >
> > In current DPDK, memcpy holds a large proportion of execution time in
> > libs like Vhost, especially for large packets, and this patch can bring
> > considerable benefits.
> >
> > The implementation is based on the current DPDK memcpy framework, some
> > background introduction can be found in these threads:
> > http://dpdk.org/ml/archives/dev/2014-November/008158.html
> > http://dpdk.org/ml/archives/dev/2015-January/011800.html
> >
> > Code changes are:
> >
> >   1. Read CPUID to check if AVX512 is supported by CPU
> >
> >   2. Predefine AVX512 macro if AVX512 is enabled by compiler
> >
> >   3. Implement AVX512 memcpy and choose the right implementation based
> on
> >  predefined macros
> >
> >   4. Decide alignment unit for memcpy perf test based on predefined macros
> >
> > Zhihong Wang (4):
> >   lib/librte_eal: Identify AVX512 CPU flag
> >   mk: Predefine AVX512 macro for compiler
> >   lib/librte_eal: Optimize memcpy for AVX512 platforms
> >   app/test: Adjust alignment unit for memcpy perf test
> >
> >  app/test/test_memcpy_perf.c|   6 +
> >  .../common/include/arch/x86/rte_cpuflags.h |   2 +
> >  .../common/include/arch/x86/rte_memcpy.h   | 247
> -
> >  mk/rte.cpuflags.mk |   4 +
> >  4 files changed, 255 insertions(+), 4 deletions(-)
> >
> 
> This really looks like code that could benefit from Gcc
> function multiversioning. The current cpuflags model is useless/flawed
> in real product deployment


I've tried gcc function multi versioning, with a simple add() function
which returns a + b, and a loop calling it for millions of times. Turned
out this mechanism adds 17% extra time to execute, overall it's a lot
of extra overhead.

Quote the gcc wiki: "GCC takes care of doing the dispatching to call
the right version at runtime". So it loses inlining and adds extra
dispatching overhead.

Also this mechanism works only for C++, right?

I think using predefined macros at compile time is more efficient and
suits DPDK more.

Could you please give an example when the current CPU flags model
stop working? So I can fix it.



[dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms

2016-01-19 Thread Wang, Zhihong
> -Original Message-
> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> Sent: Tuesday, January 19, 2016 4:06 AM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ananyev, Konstantin ;
> Richardson, Bruce ; Xie, Huawei
> 
> Subject: Re: [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
> 
> On Sun, 17 Jan 2016 22:05:09 -0500
> Zhihong Wang  wrote:
> 
> > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> > utilization of hardware resources and deliver high performance.
> >
> > In current DPDK, memcpy holds a large proportion of execution time in
> > libs like Vhost, especially for large packets, and this patch can bring
> > considerable benefits.
> >
> > The implementation is based on the current DPDK memcpy framework, some
> > background introduction can be found in these threads:
> > http://dpdk.org/ml/archives/dev/2014-November/008158.html
> > http://dpdk.org/ml/archives/dev/2015-January/011800.html
> >
> > Code changes are:
> >
> >   1. Read CPUID to check if AVX512 is supported by CPU
> >
> >   2. Predefine AVX512 macro if AVX512 is enabled by compiler
> >
> >   3. Implement AVX512 memcpy and choose the right implementation based
> on
> >  predefined macros
> >
> >   4. Decide alignment unit for memcpy perf test based on predefined macros
> 
> Cool, I like it. How much impact does this have on VHOST?

The impact is significant especially for enqueue (Detailed numbers might not
be appropriate here due to policy :-), only how I test it), because VHOST 
actually
spends a lot of time doing memcpy. Simply measure 1024B RX/TX time cost and
compare it with 64B's and you'll get a sense of it, although not precise.

My test cases include NIC2VM2NIC and VM2VM scenarios, which are the main
use cases currently, and use both throughput and RX/TX cycles for evaluation.



[dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms

2016-01-28 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Wednesday, January 27, 2016 11:24 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ravi Kerur 
> Subject: Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
> 
> 2016-01-17 22:05, Zhihong Wang:
> > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> > utilization of hardware resources and deliver high performance.
> 
> On a related note, your expertise would be very valuable to review
> these patches please:
> (memcpy) http://dpdk.org/dev/patchwork/patch/4396/
> (memcmp) http://dpdk.org/dev/patchwork/patch/4788/

Will do, thanks.

> 
> Thanks


[dpdk-dev] [PATCH] lib/librte_eal: Fix compile issue with gcc 5.3.1

2016-01-28 Thread Wang, Zhihong
> Subject: [PATCH] lib/librte_eal: Fix compile issue with gcc 5.3.1
> 
> In fedora 22 with GCC version 5.3.1, when compile,
> will result an error:
> 
> include/rte_memcpy.h:309:7: error: "RTE_MACHINE_CPUFLAG_AVX2"
> is not defined [-Werror=undef]
> #elif RTE_MACHINE_CPUFLAG_AVX2
> 
> Fixes: 9484092baad3 ("eal/x86: optimize memcpy for AVX512 platforms")
> 
> Signed-off-by: Michael Qiu 
> ---
>  app/test/test_memcpy_perf.c | 2 +-
>  lib/librte_eal/common/include/arch/x86/rte_memcpy.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)


There's issue in the original code.

#elif works with statements:
#elif < statement: true or false>

But what it meant is whether the identifier has been defined:
#elif defined 

Thanks for correcting this!

Acked-by: Wang, Zhihong 


[dpdk-dev] [PATCH] doc: virtio pmd versions

2016-06-15 Thread Wang, Zhihong


> -Original Message-
> From: Mcnamara, John
> Sent: Thursday, June 9, 2016 8:56 PM
> To: Richardson, Bruce ; Wang, Zhihong
> ; dev at dpdk.org
> Cc: Wang, Zhihong 
> Subject: RE: [dpdk-dev] [PATCH] doc: virtio pmd versions
> 
> > -Original Message-
> > From: Richardson, Bruce
> > Sent: Thursday, June 9, 2016 1:53 PM
> > To: Mcnamara, John ; Wang, Zhihong
> > ; dev at dpdk.org
> > Cc: Wang, Zhihong 
> > Subject: RE: [dpdk-dev] [PATCH] doc: virtio pmd versions
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Mcnamara, John
> > 
> > >
> > > > -Original Message-
> > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zhihong Wang
> > 
> > > > +
> > > > +Virtio PMD Versions
> > > > +---
> > > > +
> > > > +Virtio driver has 3 versions of rx functions and 2 versions of tx
> > > > functions.
> > >
> > > In some places RX/TX is used and in some rx/tx. I would suggest the
> > > uppercase versions throughout.
> > >
> >
> > In the commit logs, the only valid contractions allowed by the check-git-
> > log.sh script are Rx and Tx
> >
> > bad=$(echo "$headlines" | grep -E --color=always \
> > -e '\<(rx|tx|RX|TX)\>' \
> >  
> >
> > I would therefore suggest we follow the same rules for the docs for
> > consistency.
> 
> Hi,
> 
> I don't mind what it is once we have consistency, so Rx/Tx is fine. Zhihong,
> please note.

Thank you John and Bruce!
V2 has been sent, please take a look.

> 
> John
> 
> 



[dpdk-dev] [PATCH v2 5/5] testpmd: show topology at forwarding start

2016-06-15 Thread Wang, Zhihong


> -Original Message-
> From: De Lara Guarch, Pablo
> Sent: Tuesday, June 14, 2016 11:13 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: Ananyev, Konstantin ; Richardson, Bruce
> ; thomas.monjalon at 6wind.com
> Subject: RE: [PATCH v2 5/5] testpmd: show topology at forwarding start
> 
> 
> Hi Zhihong,
> 
> > -----Original Message-
> > From: Wang, Zhihong
> > Sent: Wednesday, June 01, 2016 4:28 AM
> > To: dev at dpdk.org
> > Cc: Ananyev, Konstantin; Richardson, Bruce; De Lara Guarch, Pablo;
> > thomas.monjalon at 6wind.com; Wang, Zhihong
> > Subject: [PATCH v2 5/5] testpmd: show topology at forwarding start
> >
> > This patch show topology at forwarding start.
> >
> > "show config fwd" also does this, but showing it directly can reduce the
> > possibility of misconfiguration.
> >
> >
> > Signed-off-by: Zhihong Wang 
> [...]
> 
> > diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> > index 9b1d99c..b946034 100644
> > --- a/app/test-pmd/testpmd.c
> > +++ b/app/test-pmd/testpmd.c
> > @@ -1009,7 +1009,7 @@ start_packet_forwarding(int with_tx_first)
> > if(!no_flush_rx)
> > flush_fwd_rx_queues();
> >
> > -   fwd_config_setup();
> > +   fwd_config_setup_display();
> 
> Bernard has made a patch that separates the display and setup of the
> configuration,
> (http://dpdk.org/dev/patchwork/patch/13650/)
> so fwd_config_display() does not call fwd_config_setup() anymore.
> 
> Could you modify this patch, so you call fwd_config_setup() and
> fwd_config_display()?

Thanks for the info! I've updated this patch with a v3.
Could you please help review?


> 
> Sorry for the confusion,
> Pablo
> 
> > rxtx_config_display();
> >
> > for (i = 0; i < cur_fwd_config.nb_fwd_ports; i++) {



[dpdk-dev] [PATCH v3 4/5] testpmd: handle all rxqs in rss setup

2016-06-28 Thread Wang, Zhihong
Thanks Nelio and Pablo!

> -Original Message-
> From: N?lio Laranjeiro [mailto:nelio.laranjeiro at 6wind.com]
> Sent: Tuesday, June 28, 2016 4:34 PM
> To: De Lara Guarch, Pablo 
> Cc: Wang, Zhihong ; dev at dpdk.org; Ananyev,
> Konstantin ; Richardson, Bruce
> ; thomas.monjalon at 6wind.com
> Subject: Re: [dpdk-dev] [PATCH v3 4/5] testpmd: handle all rxqs in rss setup
> 
> Hi Pablo,
> 
> On Mon, Jun 27, 2016 at 10:36:38PM +, De Lara Guarch, Pablo wrote:
> > Hi Nelio,
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of N?lio Laranjeiro
> > > Sent: Monday, June 27, 2016 3:24 PM
> > > To: Wang, Zhihong
> > > Cc: dev at dpdk.org; Ananyev, Konstantin; Richardson, Bruce; De Lara 
> > > Guarch,
> > > Pablo; thomas.monjalon at 6wind.com
> > > Subject: Re: [dpdk-dev] [PATCH v3 4/5] testpmd: handle all rxqs in rss 
> > > setup
> > >
> > > On Tue, Jun 14, 2016 at 07:08:05PM -0400, Zhihong Wang wrote:
> > > > This patch removes constraints in rxq handling when multiqueue is 
> > > > enabled
> > > > to handle all the rxqs.
> > > >
> > > > Current testpmd forces a dedicated core for each rxq, some rxqs may be
> > > > ignored when core number is less than rxq number, and that causes
> > > confusion
> > > > and inconvenience.
> > > >
> > > > One example: One Red Hat engineer was doing multiqueue test, there're 2
> > > > ports in guest each with 4 queues, and testpmd was used as the 
> > > > forwarding
> > > > engine in guest, as usual he used 1 core for forwarding, as a results he
> > > > only saw traffic from port 0 queue 0 to port 1 queue 0, then a lot of
> > > > emails and quite some time are spent to root cause it, and of course 
> > > > it's
> > > > caused by this unreasonable testpmd behavior.
> > > >
> > > > Moreover, even if we understand this behavior, if we want to test the
> > > > above case, we still need 8 cores for a single guest to poll all the
> > > > rxqs, obviously this is too expensive.
> > > >
> > > > We met quite a lot cases like this, one recent example:
> > > > http://openvswitch.org/pipermail/dev/2016-June/072110.html
> > > >
> > > >
> > > > Signed-off-by: Zhihong Wang 
> > > > ---
> > > >  app/test-pmd/config.c | 8 +---
> > > >  1 file changed, 1 insertion(+), 7 deletions(-)
> > > >
> > > > diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
> > > > index ede7c78..4719a08 100644
> > > > --- a/app/test-pmd/config.c
> > > > +++ b/app/test-pmd/config.c
> > > > @@ -1199,19 +1199,13 @@ rss_fwd_config_setup(void)
> > > > cur_fwd_config.nb_fwd_ports = nb_fwd_ports;
> > > > cur_fwd_config.nb_fwd_streams =
> > > > (streamid_t) (nb_q * cur_fwd_config.nb_fwd_ports);
> > > > -   if (cur_fwd_config.nb_fwd_streams > 
> > > > cur_fwd_config.nb_fwd_lcores)
> > > > -   cur_fwd_config.nb_fwd_streams =
> > > > -   (streamid_t)cur_fwd_config.nb_fwd_lcores;
> > > > -   else
> > > > -   cur_fwd_config.nb_fwd_lcores =
> > > > -   (lcoreid_t)cur_fwd_config.nb_fwd_streams;
> > > >
> > > > /* reinitialize forwarding streams */
> > > > init_fwd_streams();
> > > >
> > > > setup_fwd_config_of_each_lcore(&cur_fwd_config);
> > > > rxp = 0; rxq = 0;
> > > > -   for (lc_id = 0; lc_id < cur_fwd_config.nb_fwd_lcores; lc_id++) {
> > > > +   for (lc_id = 0; lc_id < cur_fwd_config.nb_fwd_streams; lc_id++) 
> > > > {
> > > > struct fwd_stream *fs;
> > > >
> > > > fs = fwd_streams[lc_id];
> > > > --
> > > > 2.5.0
> > >
> > > Hi Zhihong,
> > >
> > > It seems this commits introduce a bug in pkt_burst_transmit(), this only
> > > occurs when the number of cores present in the coremask is greater than
> > > the number of queues i.e. coremask=0xffe --txq=4 --rxq=4.
> > >
> > >   Port 0 Link Up - speed 4 Mbps - full-duplex
> > >   Port 1 Link Up - speed 4 Mbps - full-duplex
> > >   Done
> > >   testpmd> start tx_first

[dpdk-dev] [PATCH] A fix to work around strict-aliasing rules breaking

2015-03-04 Thread Wang, Zhihong


> -Original Message-
> From: Richardson, Bruce
> Sent: Monday, March 02, 2015 6:32 PM
> To: Wang, Zhihong
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] A fix to work around strict-aliasing rules
> breaking
> 
> On Mon, Mar 02, 2015 at 05:03:50PM +0800, zhihong.wang at intel.com wrote:
> > Fixed strict-aliasing rules breaking errors for some GCC version.
> >
> 
> This looks messy. Also, I believe the definition of memcpy should include the
> "restrict" keyword to indicate that source and dest can't overlap. Might that
> help fix the issue?

It's actually caused by casting void * to multiple other pointer types.

> 
> /Bruce
> 
> > Signed-off-by: Zhihong Wang 
> > ---
> >  .../common/include/arch/x86/rte_memcpy.h   | 44 
> --
> >  1 file changed, 24 insertions(+), 20 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > index 69a5c6f..f412099 100644
> > --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > @@ -195,6 +195,8 @@ rte_mov256blocks(uint8_t *dst, const uint8_t *src,
> > size_t n)  static inline void *  rte_memcpy(void *dst, const void
> > *src, size_t n)  {
> > +   uintptr_t dstu = (uintptr_t)dst;
> > +   uintptr_t srcu = (uintptr_t)src;
> > void *ret = dst;
> > int dstofss;
> > int bits;
> > @@ -204,22 +206,22 @@ rte_memcpy(void *dst, const void *src, size_t n)
> >  */
> > if (n < 16) {
> > if (n & 0x01) {
> > -   *(uint8_t *)dst = *(const uint8_t *)src;
> > -   src = (const uint8_t *)src + 1;
> > -   dst = (uint8_t *)dst + 1;
> > +   *(uint8_t *)dstu = *(const uint8_t *)srcu;
> > +   srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> > +   dstu = (uintptr_t)((uint8_t *)dstu + 1);
> > }
> > if (n & 0x02) {
> > -   *(uint16_t *)dst = *(const uint16_t *)src;
> > -   src = (const uint16_t *)src + 1;
> > -   dst = (uint16_t *)dst + 1;
> > +   *(uint16_t *)dstu = *(const uint16_t *)srcu;
> > +   srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> > +   dstu = (uintptr_t)((uint16_t *)dstu + 1);
> > }
> > if (n & 0x04) {
> > -   *(uint32_t *)dst = *(const uint32_t *)src;
> > -   src = (const uint32_t *)src + 1;
> > -   dst = (uint32_t *)dst + 1;
> > +   *(uint32_t *)dstu = *(const uint32_t *)srcu;
> > +   srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> > +   dstu = (uintptr_t)((uint32_t *)dstu + 1);
> > }
> > if (n & 0x08) {
> > -   *(uint64_t *)dst = *(const uint64_t *)src;
> > +   *(uint64_t *)dstu = *(const uint64_t *)srcu;
> > }
> > return ret;
> > }
> > @@ -458,6 +460,8 @@ static inline void *  rte_memcpy(void *dst, const
> > void *src, size_t n)  {
> > __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7,
> xmm8;
> > +   uintptr_t dstu = (uintptr_t)dst;
> > +   uintptr_t srcu = (uintptr_t)src;
> > void *ret = dst;
> > int dstofss;
> > int srcofs;
> > @@ -467,22 +471,22 @@ rte_memcpy(void *dst, const void *src, size_t n)
> >  */
> > if (n < 16) {
> > if (n & 0x01) {
> > -   *(uint8_t *)dst = *(const uint8_t *)src;
> > -   src = (const uint8_t *)src + 1;
> > -   dst = (uint8_t *)dst + 1;
> > +   *(uint8_t *)dstu = *(const uint8_t *)srcu;
> > +   srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> > +   dstu = (uintptr_t)((uint8_t *)dstu + 1);
> > }
> > if (n & 0x02) {
> > -   *(uint16_t *)dst = *(const uint16_t *)src;
> > -   src = (const uint16_t *)src + 1;
> > -   dst = (uint16_t *)dst + 1;
> > +   *(uint16_t *)dstu = *(const uint16_t *)srcu;
> > +   srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> > +   dstu = (uintptr_t)((uint16_t *)dstu + 1);
> > }
> > if (n & 0x04) {
> > -   

[dpdk-dev] [PATCH] A fix to work around strict-aliasing rules breaking

2015-03-04 Thread Wang, Zhihong


> -Original Message-
> From: Wodkowski, PawelX
> Sent: Monday, March 02, 2015 8:32 PM
> To: Richardson, Bruce; Wang, Zhihong
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] A fix to work around strict-aliasing rules
> breaking
> 
> On 2015-03-02 11:32, Bruce Richardson wrote:
> > On Mon, Mar 02, 2015 at 05:03:50PM +0800, zhihong.wang at intel.com
> wrote:
> >> Fixed strict-aliasing rules breaking errors for some GCC version.
> >>
> >
> > This looks messy. Also, I believe the definition of memcpy should
> > include the "restrict" keyword to indicate that source and dest can't
> > overlap. Might that help fix the issue?
> >
> 
> Is this error related with overlapping or casting 'void *' to 'uintXX_t *' 
> that
> make compiler report aliasing rule breaking?
> 
> >
> >> Signed-off-by: Zhihong Wang 
> >> ---
> >>   .../common/include/arch/x86/rte_memcpy.h   | 44 --
> 
> >>   1 file changed, 24 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> >> b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> >> index 69a5c6f..f412099 100644
> >> --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> >> +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> >> @@ -195,6 +195,8 @@ rte_mov256blocks(uint8_t *dst, const uint8_t *src,
> size_t n)
> >>   static inline void *
> >>   rte_memcpy(void *dst, const void *src, size_t n)
> >>   {
> >> +  uintptr_t dstu = (uintptr_t)dst;
> >> +  uintptr_t srcu = (uintptr_t)src;
> 
> If so maybe using union here would be good solution or 'char *'.

Pawel,

Thanks for the suggestion! But I don't think union can work around this --- 
already tried in CentOS release 6.5.
Anyway this is for compiler ethics only, the assembly code generated will be 
the same no matter what kind of method is used.

Zhihong (John)

> 
> --
> Pawel


[dpdk-dev] [PATCH] librte_eal/common: Fix cast from pointer to integer of different size

2015-03-09 Thread Wang, Zhihong


> -Original Message-
> From: Qiu, Michael
> Sent: Friday, March 06, 2015 11:13 AM
> To: dev at dpdk.org
> Cc: Qiu, Michael; Wang, Zhihong
> Subject: [PATCH] librte_eal/common: Fix cast from pointer to integer of
> different size
> 
> ./i686-native-linuxapp-gcc/include/rte_memcpy.h:592:23: error:
> cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
> 
>   dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16;
> 
> Type 'long long' is 64-bit in i686 platform while 'void *'
> is 32-bit.
> 
> Signed-off-by: Michael Qiu 
> Signed-off-by: Zhihong Wang 
> ---
> v4 --> v3:
>   fix dstofss/bits to size_t in rte_memcpy()
> v3 --> v2:
> make dstofss and srcofs to be type size_t
> casting type use uintptr_t
> 
> v2 --> v1:
> Remove unnecessary casting (void *)
> 
>  lib/librte_eal/common/include/arch/x86/rte_memcpy.h | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> index 7b2d382..6ec4434 100644
> --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> @@ -196,8 +196,8 @@ static inline void *  rte_memcpy(void *dst, const void
> *src, size_t n)  {
>   void *ret = dst;
> - int dstofss;
> - int bits;
> + size_t dstofss;
> + size_t bits;
> 
>   /**
>* Copy less than 16 bytes
> @@ -271,7 +271,7 @@ COPY_BLOCK_64_BACK31:
>   /**
>* Make store aligned when copy size exceeds 512 bytes
>*/
> - dstofss = 32 - (int)((long long)(void *)dst & 0x1F);
> + dstofss = 32 - ((uintptr_t)dst & 0x1F);
>   n -= dstofss;
>   rte_mov32((uint8_t *)dst, (const uint8_t *)src);
>   src = (const uint8_t *)src + dstofss;
> @@ -493,8 +493,8 @@ rte_memcpy(void *dst, const void *src, size_t n)  {
>   __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7,
> xmm8;
>   void *ret = dst;
> - int dstofss;
> - int srcofs;
> + size_t dstofss;
> + size_t srcofs;
> 
>   /**
>* Copy less than 16 bytes
> @@ -589,12 +589,12 @@ COPY_BLOCK_64_BACK15:
>* unaligned copy functions require up to 15 bytes
>* backwards access.
>*/
> - dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16;
> + dstofss = 16 - ((uintptr_t)dst & 0x0F) + 16;
>   n -= dstofss;
>   rte_mov32((uint8_t *)dst, (const uint8_t *)src);
>   src = (const uint8_t *)src + dstofss;
>   dst = (uint8_t *)dst + dstofss;
> - srcofs = (int)((long long)(const void *)src & 0x0F);
> + srcofs = ((uintptr_t)src & 0x0F);
> 
>   /**
>* For aligned copy
> --
> 1.9.3

Acked-by:  Wang, Zhihong 


[dpdk-dev] rte_memcpy.h: additional cflags required with OVS

2015-03-11 Thread Wang, Zhihong

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Kavanagh, Mark B  
> Sent: Tuesday, March 10, 2015 6:04 PM
> To: Mcnamara, John; Qiu, Michael; dev at dpdk.org; Panu Matilainen
> Subject: Re: [dpdk-dev] rte_memcpy.h: additional cflags required with OVS
> 
> 
> 
> >-Original Message-
> >From: Mcnamara, John
> >Sent: Tuesday, March 10, 2015 8:27 AM
> >To: Qiu, Michael; Kavanagh, Mark B; dev at dpdk.org; Panu Matilainen
> >Subject: RE: [dpdk-dev] rte_memcpy.h: additional cflags required with
> >OVS
> >
> >> -Original Message-
> >> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Qiu, Michael
> >> Sent: Tuesday, March 10, 2015 3:05 AM
> >> To: Kavanagh, Mark B; dev at dpdk.org
> >> Subject: Re: [dpdk-dev] rte_memcpy.h: additional cflags required with
> >> OVS
> >>
> >
> >> What's your gcc version? this should be an issue with old version
> >> gcc, and I'm working on this to solve this issue now.
> >
> >
> >Hi Michael,
> >
> >I see the issue with gcc 4.7.2 but not with 4.9.2.
> 
> I'm using gcc v4.8.3.
> 
> Just to clarify my initial post, there are two issues related to gcc intrinsic
> headers emmintrin.h, and tmmintrin.h:
>   - in former, a difference in parameter types for _mm_storeu_si128 is
> the issue. This is the primary issue observed.
>   - in tmmintrin.h, when __OPTIMIZE__ is not defined, function
> _mm_alignr_epi8 is also not defined, leading to an 'implicit definition of
> function' error.

Add the "-mssse3" flag should be able to solve the 'implicit definition of 
function' error.
BTW, current dpdk should compile with gcc 4.7.2, anything changed there that 
makes this flag mandatory?

Zhihong (John)

> I've only noticed this intermittently (even though I compile OVS with
> -O2 CFLAGS)
> 
> >
> >John


[dpdk-dev] [PATCH] common/rte_memcpy: Fix x86intrin.h missed

2015-03-13 Thread Wang, Zhihong


> -Original Message-
> From: Qiu, Michael
> Sent: Friday, March 13, 2015 3:03 PM
> To: dev at dpdk.org
> Cc: Wang, Zhihong; Qiu, Michael
> Subject: [PATCH] common/rte_memcpy: Fix x86intrin.h missed
> 
> rte_memcpy.h(46): catastrophic error: cannot open source file "x86intrin.h"
> 
> For icc and old gcc, this header is not included.
> 
> Signed-off-by: Michael Qiu 
> ---
>  lib/librte_eal/common/include/arch/x86/rte_memcpy.h | 20
> 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> index ac72069..bd10d36 100644
> --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> @@ -43,7 +43,27 @@
>  #include 
>  #include 
>  #include 
> +#if (defined(__ICC) || (__GNUC__ == 4 &&  __GNUC_MINOR__ < 4))
> +
> +#ifdef __SSE__
> +#include 
> +#endif
> +
> +#ifdef __SSE2__
> +#include 
> +#endif
> +
> +#if defined(__SSE4_2__) || defined(__SSE4_1__) #include 
> +#endif
> +
> +#if defined(__AVX__)
> +#include 
> +#endif
> +
> +#else
>  #include 
> +#endif
> 
>  #ifdef __cplusplus
>  extern "C" {
> --
> 1.9.3

Acked-by:  Wang, Zhihong 



[dpdk-dev] [PATCH RFC] Memcpy optimization

2014-11-14 Thread Wang, Zhihong
   * Use intrinsics instead of assembly code
* Remove slow glibc call for constant copies

Current memcpy performance test is in "test_memcpy_perf.c", which will also be 
updated with unaligned test cases.

4. Glibc memcpy analysis

Glibc 2.16 (Fedora 20) and 2.20 (Currently the latest, released on Sep 07, 
2014) are analyzed.

Glibc 2.16 issues:
* No support for 256-bit load/store
* Significant slowdown for unaligned constant cases due to split loads and 4k 
aliasing

Glibc 2.20 issue:
* Removed load address alignment, which can lead to significant slowdown for 
unaligned cases in former architectures like Sandy Bridge

Also, calls to glibc can't be optimized by gcc at compile time.

Acknowledgements

Valuable suggestions from: Liang Cunming, Zhu Heqing, Bruce Richardson, and 
Chen Wenjun.

Author's Address

Wang Zhihong (John)
Email: zhihong.wang at intel.com



[dpdk-dev] [PATCH v3 5/7] virtio: virtio vec rx

2015-10-22 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Huawei Xie
> Sent: Tuesday, October 20, 2015 11:30 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH v3 5/7] virtio: virtio vec rx
> 
> With fixed avail ring, we don't need to get desc idx from avail ring.
> virtio driver only has to deal with desc ring.
> This patch uses vector instruction to accelerate processing desc ring.
> 
> Signed-off-by: Huawei Xie 
> ---
>  drivers/net/virtio/virtio_ethdev.h  |   2 +
>  drivers/net/virtio/virtio_rxtx.c|   3 +
>  drivers/net/virtio/virtio_rxtx.h|   2 +
>  drivers/net/virtio/virtio_rxtx_simple.c | 224
> 
>  drivers/net/virtio/virtqueue.h  |   1 +
>  5 files changed, 232 insertions(+)
> 
> diff --git a/drivers/net/virtio/virtio_ethdev.h 
> b/drivers/net/virtio/virtio_ethdev.h
> index 9026d42..d7797ab 100644
> --- a/drivers/net/virtio/virtio_ethdev.h
> +++ b/drivers/net/virtio/virtio_ethdev.h
> @@ -108,6 +108,8 @@ uint16_t virtio_recv_mergeable_pkts(void *rx_queue,
> struct rte_mbuf **rx_pkts,
>  uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>   uint16_t nb_pkts);
> 
> +uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
> + uint16_t nb_pkts);
> 
>  /*
>   * The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
> diff --git a/drivers/net/virtio/virtio_rxtx.c 
> b/drivers/net/virtio/virtio_rxtx.c
> index 5162ce6..947fc46 100644
> --- a/drivers/net/virtio/virtio_rxtx.c
> +++ b/drivers/net/virtio/virtio_rxtx.c
> @@ -432,6 +432,9 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
>   vq->mpool = mp;
> 
>   dev->data->rx_queues[queue_idx] = vq;
> +
> + virtio_rxq_vec_setup(vq);
> +
>   return 0;
>  }
> 
> diff --git a/drivers/net/virtio/virtio_rxtx.h 
> b/drivers/net/virtio/virtio_rxtx.h
> index 7d2d8fe..831e492 100644
> --- a/drivers/net/virtio/virtio_rxtx.h
> +++ b/drivers/net/virtio/virtio_rxtx.h
> @@ -33,5 +33,7 @@
> 
>  #define RTE_PMD_VIRTIO_RX_MAX_BURST 64
> 
> +int virtio_rxq_vec_setup(struct virtqueue *rxq);
> +
>  int virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
>   struct rte_mbuf *m);
> diff --git a/drivers/net/virtio/virtio_rxtx_simple.c
> b/drivers/net/virtio/virtio_rxtx_simple.c
> index cac5b9f..ef17562 100644
> --- a/drivers/net/virtio/virtio_rxtx_simple.c
> +++ b/drivers/net/virtio/virtio_rxtx_simple.c
> @@ -58,6 +58,10 @@
>  #include "virtqueue.h"
>  #include "virtio_rxtx.h"
> 
> +#define RTE_VIRTIO_VPMD_RX_BURST 32
> +#define RTE_VIRTIO_DESC_PER_LOOP 8
> +#define RTE_VIRTIO_VPMD_RX_REARM_THRESH
> RTE_VIRTIO_VPMD_RX_BURST
> +
>  int __attribute__((cold))
>  virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
>   struct rte_mbuf *cookie)
> @@ -82,3 +86,223 @@ virtqueue_enqueue_recv_refill_simple(struct
> virtqueue *vq,
> 
>   return 0;
>  }
> +
> +static inline void
> +virtio_rxq_rearm_vec(struct virtqueue *rxvq)
> +{
> + int i;
> + uint16_t desc_idx;
> + struct rte_mbuf **sw_ring;
> + struct vring_desc *start_dp;
> + int ret;
> +
> + desc_idx = rxvq->vq_avail_idx & (rxvq->vq_nentries - 1);
> + sw_ring = &rxvq->sw_ring[desc_idx];
> + start_dp = &rxvq->vq_ring.desc[desc_idx];
> +
> + ret = rte_mempool_get_bulk(rxvq->mpool, (void **)sw_ring,
> + RTE_VIRTIO_VPMD_RX_REARM_THRESH);
> + if (unlikely(ret)) {
> + rte_eth_devices[rxvq->port_id].data->rx_mbuf_alloc_failed +=
> + RTE_VIRTIO_VPMD_RX_REARM_THRESH;
> + return;
> + }
> +
> + for (i = 0; i < RTE_VIRTIO_VPMD_RX_REARM_THRESH; i++) {
> + uintptr_t p;
> +
> + p = (uintptr_t)&sw_ring[i]->rearm_data;
> + *(uint64_t *)p = rxvq->mbuf_initializer;
> +
> + start_dp[i].addr =
> + (uint64_t)((uintptr_t)sw_ring[i]->buf_physaddr +
> + RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
> + start_dp[i].len = sw_ring[i]->buf_len -
> + RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
> + }
> +
> + rxvq->vq_avail_idx += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
> + rxvq->vq_free_cnt -= RTE_VIRTIO_VPMD_RX_REARM_THRESH;
> + vq_update_avail_idx(rxvq);
> +}
> +
> +/* virtio vPMD receive routine, only accept(nb_pkts >=
> RTE_VIRTIO_DESC_PER_LOOP)
> + *
> + * This routine is for non-mergable RX, one desc for each guest buffer.
> + * This routine is based on the RX ring layout optimization. Each entry in 
> the
> + * avail ring points to the desc with the same index in the desc ring and 
> this
> + * will never be changed in the driver.
> + *
> + * - nb_pkts < RTE_VIRTIO_DESC_PER_LOOP, just return no packet
> + */
> +uint16_t
> +virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
> + uint16_t nb_pkts)
> +{
> + struct virtqueue *rxvq = rx_queue;
> + uint16_t nb_used;
> + uint16_t desc_idx

[dpdk-dev] [PATCH v5 5/7] virtio: virtio vec rx

2015-10-26 Thread Wang, Zhihong
> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Huawei Xie
> Sent: Sunday, October 25, 2015 11:35 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH v5 5/7] virtio: virtio vec rx
> 
> With fixed avail ring, we don't need to get desc idx from avail ring.
> virtio driver only has to deal with desc ring.
> This patch uses vector instruction to accelerate processing desc ring.
> 
> Signed-off-by: Huawei Xie 

Acked-by: Wang, Zhihong 



[dpdk-dev] [PATCH v3] doc: virtio PMD Rx/Tx callbacks

2016-07-04 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Monday, July 4, 2016 10:57 AM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Richardson, Bruce ; 
> Mcnamara,
> John 
> Subject: Re: [PATCH v3] doc: virtio PMD Rx/Tx callbacks
> 
> On Thu, Jun 30, 2016 at 11:28:01PM -0400, Zhihong Wang wrote:
> > This patch explains current virtio PMD Rx/Tx callbacks, to help understand
> > what's the difference, and how to enable the right ones.
> >
> > Signed-off-by: Zhihong Wang 
> > Acked-by: John McNamara 
> 
> Applied to dpdk-next-virtio.
> 
> > --
> > Changes in v3:
> 
> However, this is still wrong. It must start with 3 dashes, ---.
> 
> $ man git am
> 
>  The patch is expected to be inline, directly following the
>  message. Any line that is of the form:
> 
>  *   three-dashes and end-of-line, or
> 
>  *   a line that begins with "diff -", or
> 
>  *   a line that begins with "Index: "
> 
>  is taken as the beginning of a patch, and the commit log message
>  is terminated before the first occurrence of such a line.
> 
> 
> This is just a note; I have fixed it while apply.

Noted. Thanks!

> 
>   --yliu


[dpdk-dev] [PATCH 4/6] testpmd: handle all rxqs in rss setup

2016-06-03 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang, Zhihong
> Sent: Thursday, May 26, 2016 10:55 AM
> To: Thomas Monjalon 
> Cc: dev at dpdk.org; Ananyev, Konstantin ;
> Richardson, Bruce ; De Lara Guarch, Pablo
> 
> Subject: Re: [dpdk-dev] [PATCH 4/6] testpmd: handle all rxqs in rss setup
> 
> 
> 
> > -Original Message-
> > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> > Sent: Wednesday, May 25, 2016 5:42 PM
> > To: Wang, Zhihong 
> > Cc: dev at dpdk.org; Ananyev, Konstantin ;
> > Richardson, Bruce ; De Lara Guarch, Pablo
> > 
> > Subject: Re: [PATCH 4/6] testpmd: handle all rxqs in rss setup
> >
> > 2016-05-05 18:46, Zhihong Wang:
> > > This patch removes constraints in rxq handling when multiqueue is enabled
> > > to handle all the rxqs.
> > >
> > > Current testpmd forces a dedicated core for each rxq, some rxqs may be
> > > ignored when core number is less than rxq number, and that causes
> confusion
> > > and inconvenience.
> >
> > I have the feeling that "constraints", "confusion" and "inconvenience"
> > should be more explained.
> > Please give some examples with not enough and too much cores. Thanks
> 
> Sure, will add detailed description in v2  ;)

V2 has been sent.
We see increasing examples looking for help on this "confusion",
one recent example:
http://openvswitch.org/pipermail/dev/2016-June/072110.html




[dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions

2016-06-07 Thread Wang, Zhihong


> -Original Message-
> From: Ravi Kerur [mailto:rkerur at gmail.com]
> Sent: Tuesday, June 7, 2016 2:32 AM
> To: Wang, Zhihong ; Thomas Monjalon
> 
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
> 
> Zhilong, Thomas,
> 
> If there is enough interest within DPDK community I can work on adding support
> for 'unaligned access' and 'test cases' for it. Please let me know either way.
> 


Hi Ravi,

This rte_memcmp is proved with better performance than glibc's in aligned
cases, I think it has good value to DPDK lib.

Though we don't have memcmp in critical pmd data path, it offers a better
choice for applications who do.


Thanks
Zhihong


> Thanks,
> Ravi
> 
> 
> On Thu, May 26, 2016 at 2:05 AM, Wang, Zhihong 
> wrote:
> 
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ravi Kerur
> > Sent: Tuesday, March 8, 2016 7:01 AM
> > To: dev at dpdk.org
> > Subject: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
> >
> > v1:
> >? ? ? ? ?This patch adds test cases for rte_memcmp functions.
> >? ? ? ? ?New rte_memcmp functions can be tested via 'make test'
> >? ? ? ? ?and 'testpmd' utility.
> >
> >? ? ? ? ?Compiled and tested on Ubuntu 14.04(non-NUMA) and
> >? ? ? ? ?15.10(NUMA) systems.
> [...]
> 
> > +/
> > ***
> > + * Memcmp function performance test configuration section. Each performance
> > test
> > + * will be performed MEMCMP_ITERATIONS times.
> > + *
> > + * The five arrays below control what tests are performed. Every 
> > combination
> > + * from the array entries is tested.
> > + */
> > +#define MEMCMP_ITERATIONS (500 * 500 * 500)
> 
> 
> Maybe less iteration will make the test faster without compromise precison?
> 
> 
> > +
> > +static size_t memcmp_sizes[] = {
> > +? ? ?2, 5, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
> > +? ? ?129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384,
> > +? ? ?385, 447, 448, 449, 511, 512, 513, 767, 768, 769, 1023, 1024,
> > +? ? ?1025, 1522, 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608,
> > +? ? ?5632, 6144, 6656, 7168, 7680, 8192, 16834
> > +};
> > +
> [...]
> > +/*
> > + * Do all performance tests.
> > + */
> > +static int
> > +test_memcmp_perf(void)
> > +{
> > +? ? ?if (run_all_memcmp_eq_perf_tests() != 0)
> > +? ? ? ? ? ? ?return -1;
> > +
> > +? ? ?if (run_all_memcmp_gt_perf_tests() != 0)
> > +? ? ? ? ? ? ?return -1;
> > +
> > +? ? ?if (run_all_memcmp_lt_perf_tests() != 0)
> > +? ? ? ? ? ? ?return -1;
> > +
> 
> 
> Perhaps unaligned test cases are needed here.
> How do you think?
> 
> 
> > +
> > +? ? ?return 0;
> > +}
> > +
> > +static struct test_command memcmp_perf_cmd = {
> > +? ? ?.command = "memcmp_perf_autotest",
> > +? ? ?.callback = test_memcmp_perf,
> > +};
> > +REGISTER_TEST_COMMAND(memcmp_perf_cmd);
> > --
> > 1.9.1



  1   2   >