[dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2015-12-22 Thread Xie, Huawei
On 12/22/2015 5:32 AM, Thomas Monjalon wrote:
> 2015-12-21 17:20, Wiles, Keith:
>> On 12/21/15, 9:21 AM, "Xie, Huawei"  wrote:
>>> On 12/19/2015 3:27 AM, Wiles, Keith wrote:
 On 12/18/15, 11:32 AM, "dev on behalf of Stephen Hemminger" >>> at dpdk.org on behalf of stephen at networkplumber.org> wrote:
> On Fri, 18 Dec 2015 10:44:02 +
> "Ananyev, Konstantin"  wrote:
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Stephen Hemminger
>>> On Mon, 14 Dec 2015 09:14:41 +0800
>>> Huawei Xie  wrote:
 +  switch (count % 4) {
 +  while (idx != count) {
 +  case 0:
 +  
 RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
 +  rte_mbuf_refcnt_set(mbufs[idx], 1);
 +  rte_pktmbuf_reset(mbufs[idx]);
 +  idx++;
 +  case 3:
 +  
 RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
 +  rte_mbuf_refcnt_set(mbufs[idx], 1);
 +  rte_pktmbuf_reset(mbufs[idx]);
 +  idx++;
 +  case 2:
 +  
 RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
 +  rte_mbuf_refcnt_set(mbufs[idx], 1);
 +  rte_pktmbuf_reset(mbufs[idx]);
 +  idx++;
 +  case 1:
 +  
 RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
 +  rte_mbuf_refcnt_set(mbufs[idx], 1);
 +  rte_pktmbuf_reset(mbufs[idx]);
 +  idx++;
 +  }
 +  }
 +  return 0;
 +}
>>> This is weird. Why not just use Duff's device in a more normal manner.
>> But it is a sort of Duff's method.
>> Not sure what looks weird to you here?
>> while () {} instead of do {} while();?
>> Konstantin
>>
>>
>>
> It is unusual to have cases not associated with block of the switch.
> Unusual to me means, "not used commonly in most code".
>
> Since you are jumping into the loop, might make more sense as a do { } 
> while()
 I find this a very odd coding practice and I would suggest we not do this, 
 unless it gives us some great performance gain.

 Keith
>>> The loop unwinding could give performance gain. The only problem is the
>>> switch/loop combination makes people feel weird at the first glance but
>>> soon they will grasp this style. Since this is inherited from old famous
>>> duff's device, i prefer to keep this style which saves lines of code.
>> Please add a comment to the code to reflex where this style came from and 
>> why you are using it, would be very handy here.
> +1
> At least the words "loop" and "unwinding" may be helpful to some readers.
OK. Will add more context. Probably the wiki page for duff's device
should be updated on how to handle the case count is zero, using while()
or add one line to check.

> Thanks
>



[dpdk-dev] [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request

2015-12-22 Thread Yuanhan Liu
On Mon, Dec 21, 2015 at 03:32:53PM +, Xie, Huawei wrote:
> > +
> > +   /*
> > +* mmap from 0 to workaround a hugepage mmap bug: mmap will be
> > +* failed when offset is not page size aligned.
> > +*/
> s /will be failed/will fail/
> mmap will fail when offset is not zero.
> Also we only know this workaround is for hugetlbfs. Not sure of other
> tmpfs, so mention hugetlbfs here.

I have already mentioned "to workaround a __hugepage__ mmap bug"; it's
not enough?

> > +   addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> > +   if (addr == MAP_FAILED) {
> > +   RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
> > +   return -1;
> > +   }
> > +
> > +   /* TODO: unmap on stop */
> > +   dev->log_base = (uint64_t)(uintptr_t)addr + off;
> (uint64_t)(uintptr_t)RTE_PTR_ADD(addr, off)?

No, addr is of (void *) type, we should cast it to uint64_t type first,
before adding it with "off".

--yliu

> > +   dev->log_size = size;
> > +
> > +   return 0;
> > +}
> > diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h 
> > b/lib/librte_vhost/vhost_user/virtio-net-user.h
> > index b82108d..013cf38 100644
> > --- a/lib/librte_vhost/vhost_user/virtio-net-user.h
> > +++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
> > @@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct 
> > VhostUserMsg *);
> >  
> >  void user_set_protocol_features(struct vhost_device_ctx ctx,
> > uint64_t protocol_features);
> > +int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
> >  
> >  int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state 
> > *);
> >  
> 


[dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write

2015-12-22 Thread Yuanhan Liu
On Mon, Dec 21, 2015 at 03:06:43PM +, Xie, Huawei wrote:
> On 12/17/2015 11:11 AM, Yuanhan Liu wrote:
> > Introduce vhost_log_write() helper function to log the dirty pages we
> > touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
> > log is presented by 1 bit.
> >
> > Therefore, vhost_log_write() simply finds the right bit for related
> > page we are gonna change, and set it to 1. dev->log_base denotes the
> > start of the dirty page bitmap.
> >
> > Signed-off-by: Yuanhan Liu 
> > Signed-off-by: Victor Kaplansky  > ---
> >  lib/librte_vhost/rte_virtio_net.h | 29 +
> >  1 file changed, 29 insertions(+)
> >
> > diff --git a/lib/librte_vhost/rte_virtio_net.h 
> > b/lib/librte_vhost/rte_virtio_net.h
> > index 8acee02..5726683 100644
> > --- a/lib/librte_vhost/rte_virtio_net.h
> > +++ b/lib/librte_vhost/rte_virtio_net.h
> > @@ -40,6 +40,7 @@
> >   */
> >  
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -59,6 +60,8 @@ struct rte_mbuf;
> >  /* Backend value set by guest. */
> >  #define VIRTIO_DEV_STOPPED -1
> >  
> > +#define VHOST_LOG_PAGE 4096
> > +
> >  
> >  /* Enum for virtqueue management. */
> >  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> > @@ -205,6 +208,32 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
> > return vhost_va;
> >  }
> >  
> > +static inline void __attribute__((always_inline))
> > +vhost_log_page(uint8_t *log_base, uint64_t page)
> > +{
> > +   log_base[page / 8] |= 1 << (page % 8);
> > +}
> > +
> Those logging functions are not supposed to be API. Could we move them
> into an internal header file?

Agreed. I should have put them into vhost_rxtx.c

> > +static inline void __attribute__((always_inline))
> > +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> > +{
> > +   uint64_t page;
> > +
> Before we log, we need memory barrier to make sure updates are in place.
> > +   if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> > +  !dev->log_base || !len))
> > +   return;

Put a memory barrier inside set_features()?

I see no var dependence here, why putting a barrier then? We are
accessing and modifying same var, doesn't the cache MESI protocol
will get rid of your concerns?

> > +
> > +   if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
> > +   return;
> > +
> > +   page = addr / VHOST_LOG_PAGE;
> > +   while (page * VHOST_LOG_PAGE < addr + len) {
> Let us have a page_end var to make the code simpler?

Could do that.


> > +   vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
> > +   page += VHOST_LOG_PAGE;
> page += 1?

Oops, right.

--yliu

> > +   }
> > +}
> > +
> > +
> >  /**
> >   *  Disable features in feature_mask. Returns 0 on success.
> >   */
> 


[dpdk-dev] [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request

2015-12-22 Thread Xie, Huawei


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Tuesday, December 22, 2015 10:26 AM
> To: Xie, Huawei
> Cc: dev at dpdk.org; Michael S. Tsirkin; Victor Kaplansky; Iremonger,
> Bernard; Pavel Fedin; Peter Xu
> Subject: Re: [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE
> request
> 
> On Mon, Dec 21, 2015 at 03:32:53PM +, Xie, Huawei wrote:
> > > +
> > > + /*
> > > +  * mmap from 0 to workaround a hugepage mmap bug: mmap will be
> > > +  * failed when offset is not page size aligned.
> > > +  */
> > s /will be failed/will fail/
> > mmap will fail when offset is not zero.
I mistake for 4KB page size. Please check if huge page size align is enough.
> > Also we only know this workaround is for hugetlbfs. Not sure of
> other
> > tmpfs, so mention hugetlbfs here.
> 
> I have already mentioned "to workaround a __hugepage__ mmap bug"; it's
> not enough?
Yes.
> 
> > > + addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
> 0);
> > > + if (addr == MAP_FAILED) {
> > > + RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
> > > + return -1;
> > > + }
> > > +
> > > + /* TODO: unmap on stop */
> > > + dev->log_base = (uint64_t)(uintptr_t)addr + off;
> > (uint64_t)(uintptr_t)RTE_PTR_ADD(addr, off)?
> 
> No, addr is of (void *) type, we should cast it to uint64_t type first,
> before adding it with "off".
> 
>   --yliu
RTE_PTR_ADD is the DPDK interface for pointer arithmetic operation.
> 
> > > + dev->log_size = size;
> > > +
> > > + return 0;
> > > +}
> > > diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h
> b/lib/librte_vhost/vhost_user/virtio-net-user.h
> > > index b82108d..013cf38 100644
> > > --- a/lib/librte_vhost/vhost_user/virtio-net-user.h
> > > +++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
> > > @@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx,
> struct VhostUserMsg *);
> > >
> > >  void user_set_protocol_features(struct vhost_device_ctx ctx,
> > >   uint64_t protocol_features);
> > > +int user_set_log_base(struct vhost_device_ctx ctx, struct
> VhostUserMsg *);
> > >
> > >  int user_get_vring_base(struct vhost_device_ctx, struct
> vhost_vring_state *);
> > >
> >


[dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write

2015-12-22 Thread Xie, Huawei
On 12/22/2015 10:40 AM, Yuanhan Liu wrote:
> On Mon, Dec 21, 2015 at 03:06:43PM +, Xie, Huawei wrote:
>> On 12/17/2015 11:11 AM, Yuanhan Liu wrote:
>>> Introduce vhost_log_write() helper function to log the dirty pages we
>>> touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
>>> log is presented by 1 bit.
>>>
>>> Therefore, vhost_log_write() simply finds the right bit for related
>>> page we are gonna change, and set it to 1. dev->log_base denotes the
>>> start of the dirty page bitmap.
>>>
>>> Signed-off-by: Yuanhan Liu 
>>> Signed-off-by: Victor Kaplansky >> ---
>>>  lib/librte_vhost/rte_virtio_net.h | 29 +
>>>  1 file changed, 29 insertions(+)
>>>
>>> diff --git a/lib/librte_vhost/rte_virtio_net.h 
>>> b/lib/librte_vhost/rte_virtio_net.h
>>> index 8acee02..5726683 100644
>>> --- a/lib/librte_vhost/rte_virtio_net.h
>>> +++ b/lib/librte_vhost/rte_virtio_net.h
>>> @@ -40,6 +40,7 @@
>>>   */
>>>  
>>>  #include 
>>> +#include 
>>>  #include 
>>>  #include 
>>>  #include 
>>> @@ -59,6 +60,8 @@ struct rte_mbuf;
>>>  /* Backend value set by guest. */
>>>  #define VIRTIO_DEV_STOPPED -1
>>>  
>>> +#define VHOST_LOG_PAGE 4096
>>> +
>>>  
>>>  /* Enum for virtqueue management. */
>>>  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
>>> @@ -205,6 +208,32 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
>>> return vhost_va;
>>>  }
>>>  
>>> +static inline void __attribute__((always_inline))
>>> +vhost_log_page(uint8_t *log_base, uint64_t page)
>>> +{
>>> +   log_base[page / 8] |= 1 << (page % 8);
>>> +}
>>> +
>> Those logging functions are not supposed to be API. Could we move them
>> into an internal header file?
> Agreed. I should have put them into vhost_rxtx.c
>
>>> +static inline void __attribute__((always_inline))
>>> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
>>> +{
>>> +   uint64_t page;
>>> +
>> Before we log, we need memory barrier to make sure updates are in place.
>>> +   if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
>>> +  !dev->log_base || !len))
>>> +   return;
> Put a memory barrier inside set_features()?
>
> I see no var dependence here, why putting a barrier then? We are
> accessing and modifying same var, doesn't the cache MESI protocol
> will get rid of your concerns?
This fence isn't about feature var. It is to ensure that updates to the
guest buffer are committed before the logging.
For IA strong memory model, compiler barrier is enough. For other weak
memory model, fence is required.
>>> +
>>> +   if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
>>> +   return;
>>> +
>>> +   page = addr / VHOST_LOG_PAGE;
>>> +   while (page * VHOST_LOG_PAGE < addr + len) {
>> Let us have a page_end var to make the code simpler?
> Could do that.
>
>
>>> +   vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
>>> +   page += VHOST_LOG_PAGE;
>> page += 1?
> Oops, right.
>
>   --yliu
>
>>> +   }
>>> +}
>>> +
>>> +
>>>  /**
>>>   *  Disable features in feature_mask. Returns 0 on success.
>>>   */



[dpdk-dev] building LIBRTE_PMD_XENVIRT in 32bit triggers some errors

2015-12-22 Thread Xie, Huawei
On 12/22/2015 2:04 AM, Martinx - ? wrote:
> On 10 December 2015 at 02:45, Xie, Huawei  wrote:
>> On 12/10/2015 6:49 AM, Martinx - ? wrote:
>>> On 9 December 2015 at 18:05, Thomas Monjalon  
>>> wrote:
 2015-12-09 15:54, Martinx - ?:
>  Sorry to insist on this subject but, the time for releasing DPDK 2.2
> is near and DPDK build with Xen 32-bit is broken.
>
>  If DPDK doesn't fix this, there will be no way to enable XenVirt
> support for next Ubuntu LTS 16.04, which is a shame...
>
>  I'm planning to use DPDK on Xen domUs (PVM, HVM, XenServer and on
> Amazon EC2) powered exclusively by a supported version of Ubuntu but,
> it is broken now...
>
>  So, please, can someone take a look into this?:-P
>
>  Thanks in advance!
 Sorry, this area has no maintainer:
 http://dpdk.org/browse/dpdk/tree/MAINTAINERS#n169

 In such case, it may be logic to remove the dead code.
 If someone wants to make it alive, he's welcome!
>>> Hi Thomas,
>>>
>>>  Listen, if DPDK on Xen has no maintainer, where can I find the
>>> current state of DPDK on Xen?
>>>
>>>  I mean, I'm planning to use DPDK with Xen on the following environments:
>>>
>>>  * Amazon EC2 - HVM Enhanced Networking - *priority*
>>>  * XenServer
>>>  * Open Source Xen on Debian / Ubuntu (both PVM / HVM)
>>>
>>>  But, if Xen support on DPDK has no maintainer, how to you guys are
>>> running DPDK on top of Xen (like for example, within Amazon EC2)?
>>>
>>>  If I google for "DPDK Xen", I can find lots of good information but,
>>> I can't find recommended setup / drivers...
>>>
>>>  Do you have any recommendation?
>> Thiago:
>> This xen PMD is based on grant table mechanism and virtio interface.
>> Worth to note is it needs customized backend, which now resides in
>> examples/vhost_xen.
>> Another approach is netfront based PMD, which has kernel netback backend
>> in place, but i guess it couldn't achieve best performance as we need
>> map each grant page in backend. Stephen submitted the patch for netfront
>> PMD http://dpdk.org/dev/patchwork/patch/3330/. Thomas, do you know its
>> status?
>> Anyway i will try to create the XEN environment, and check the issues.
>>
>>>  Thank you!
>>>
>>> Best,
>>> Thiago
>>>
> Hello Xie,
>
> Thank you for your help, I really appreciated it!
>
> Basically, what I would like to understand is:
>
>
> - What is the BEST way of running DPDK inside a Xen domU guest?
>
>
> I'm seeing that there are too many options and not enough
> documentation about each, for example...
>
> * Does DPDK XENVIRT option, depends on XENDOM0 option? However, you
> said that it isn't fast / can't achieve best performance...
XENDOM0 option is for running DPDK in the "DOM0" domain. For running
DPDK PMD in guest, you don't need to enable it.
Every para virtualization has a driver(in the guest) and the device
implementation(in the host domain).
For KVM, the driver is the virtio-net driver(kernel implementation) or
DPDK virtio PMD(user space implementation), and the device is
vhost-net(kernel acceleration implementation) or vhost-user(user space
acceleration implementation).
For XEN, the driver is the netfront, and the device is netback. For
netfront, Stephen has a DPDK netfront PMD, which isn't merged. For
netback, currently there is no DPDK implementation. The possible
performance problem with netfront/netback is for each guest buffer,
netfront in guest has to dynamically allocate grant id from the
hypervisor to establish the mapping, and netback in DOM0 has to
dynamically translate the buffer through grant id. I haven't followed
the thread using permanent mapping to solve this issue. I recall it adds
extra memory copy.

If your purpose is to run DPDK pmd in the guest, and you have no control
of DOM0 domain, you could start with Stephen's netfront.
How about VF PMD in your domU guest?
>
> * Apparently, Xen supports VirtIO (if I'm not wrong), but, I honestly
> don't know for sure, where/when it is available (XenServer? Amazon
> high-perf Net Instance? HVM? PVM?)
For XEN, we provide customized DPDK virtio PMD and DPDK vhost, which
means if you want to run DPDK virtio in XEN guest, you have to run our
dpdk vhost in DOM0 as well.
AFAIK, XEN doesn't support VirtIO. There is some Google program on this.
No idea why it isn't up streamed.
>
> * If Xen supports VirtIO (especially on Amazon / XenServer), isn't
> this the BEST way of running DPDK Apps on top of this kind of
> hypervisor (i.e., by not using XENVIRT at all)?
Yes.
>
> Thanks again!
> Thiago
>



[dpdk-dev] [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request

2015-12-22 Thread Yuanhan Liu
On Tue, Dec 22, 2015 at 02:41:43AM +, Xie, Huawei wrote:
> 
> 
> > -Original Message-
> > From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> > Sent: Tuesday, December 22, 2015 10:26 AM
> > To: Xie, Huawei
> > Cc: dev at dpdk.org; Michael S. Tsirkin; Victor Kaplansky; Iremonger,
> > Bernard; Pavel Fedin; Peter Xu
> > Subject: Re: [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE
> > request
> > 
> > On Mon, Dec 21, 2015 at 03:32:53PM +, Xie, Huawei wrote:
> > > > +
> > > > +   /*
> > > > +* mmap from 0 to workaround a hugepage mmap bug: mmap will be
> > > > +* failed when offset is not page size aligned.
> > > > +*/
> > > s /will be failed/will fail/
> > > mmap will fail when offset is not zero.
> I mistake for 4KB page size.

Didn't follow you.

> Please check if huge page size align is enough.

It should be. However, I don't think we need bother to do that:
first of all, it happened on few specific old kernels. And, "off"
here is kind of guaranteed to be 0. Last, even it's not, mmaping
it from 0 will resolve that.

> > > Also we only know this workaround is for hugetlbfs. Not sure of
> > other
> > > tmpfs, so mention hugetlbfs here.
> > 
> > I have already mentioned "to workaround a __hugepage__ mmap bug"; it's
> > not enough?
> Yes.
> > 
> > > > +   addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, 
> > > > fd,
> > 0);
> > > > +   if (addr == MAP_FAILED) {
> > > > +   RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
> > > > +   return -1;
> > > > +   }
> > > > +
> > > > +   /* TODO: unmap on stop */
> > > > +   dev->log_base = (uint64_t)(uintptr_t)addr + off;
> > > (uint64_t)(uintptr_t)RTE_PTR_ADD(addr, off)?
> > 
> > No, addr is of (void *) type, we should cast it to uint64_t type first,
> > before adding it with "off".
> > 
> > --yliu
> RTE_PTR_ADD is the DPDK interface for pointer arithmetic operation.

log_base is with "uint64_t" type, RTE_PTR_ADD() returns (void*), so it
won't work here.

--yliu


[dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write

2015-12-22 Thread Yuanhan Liu
On Tue, Dec 22, 2015 at 02:45:52AM +, Xie, Huawei wrote:
> >>> +static inline void __attribute__((always_inline))
> >>> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> >>> +{
> >>> + uint64_t page;
> >>> +
> >> Before we log, we need memory barrier to make sure updates are in place.
> >>> + if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> >>> +!dev->log_base || !len))
> >>> + return;
> > Put a memory barrier inside set_features()?
> >
> > I see no var dependence here, why putting a barrier then? We are
> > accessing and modifying same var, doesn't the cache MESI protocol
> > will get rid of your concerns?
> This fence isn't about feature var. It is to ensure that updates to the
> guest buffer are committed before the logging.

Oh.., I was thinking you were talking about the "dev->features" field
concurrent access and modify you mentioned from V1.

> For IA strong memory model, compiler barrier is enough. For other weak
> memory model, fence is required.
> >>> +
> >>> + if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
> >>> + return;

So that I should put a "rte_mb()" __here__?

--yliu
> >>> +
> >>> + page = addr / VHOST_LOG_PAGE;
> >>> + while (page * VHOST_LOG_PAGE < addr + len) {
> >> Let us have a page_end var to make the code simpler?
> > Could do that.
> >
> >
> >>> + vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
> >>> + page += VHOST_LOG_PAGE;
> >> page += 1?
> > Oops, right.
> >
> > --yliu
> >
> >>> + }
> >>> +}
> >>> +
> >>> +
> >>>  /**
> >>>   *  Disable features in feature_mask. Returns 0 on success.
> >>>   */
> 


[dpdk-dev] [PATCH 1/3] vhost: get rid of linked list dev

2015-12-22 Thread Xie, Huawei
On 12/18/2015 3:03 PM, Yuanhan Liu wrote:
> While we use a single linked list to maintain all devices, we could
> use a static array to achieve the same goal, just like what we did
> to maintain the eth devices with rte_eth_devices array. This could
> simplifies the code a bit.
>
> Signed-off-by: Yuanhan Liu 

This does simplify the implementation. Cced Tommy.
[...]



[dpdk-dev] [PATCH v5 1/3] vhost: Add callback and private data for vhost PMD

2015-12-22 Thread Yuanhan Liu
On Fri, Dec 18, 2015 at 10:01:25AM -0800, Rich Lane wrote:
> I'm using the vhost callbacks and struct virtio_net with the vhost PMD in a 
> few
> ways:

Rich, thanks for the info!

> 
> 1. new_device/destroy_device: Link state change (will be covered by the link
> status interrupt).
> 2. new_device: Add first queue to datapath.

I'm wondering why vring_state_changed() is not used, as it will also be
triggered at the beginning, when the default queue (the first queue) is
enabled.

> 3. vring_state_changed: Add/remove queue to datapath.
> 4. destroy_device: Remove all queues (vring_state_changed is not called when
> qemu is killed).

I had a plan to invoke vring_state_changed() to disable all vrings
when destroy_device() is called.

> 5. new_device and struct virtio_net: Determine NUMA node of the VM.

You can get the 'struct virtio_net' dev from all above callbacks.

> 
> The vring_state_changed callback is necessary because the VM might not be 
> using
> the maximum number of RX queues. If I boot Linux in the VM it will start out
> using one RX queue, which can be changed with ethtool. The DPDK app in the 
> host
> needs to be notified that it can start sending traffic to the new queue.
> 
> The vring_state_changed callback is also useful for guest TX queues to avoid
> reading from an inactive queue.
> 
> API I'd like to have:
> 
> 1. Link status interrupt.

To vhost pmd, new_device()/destroy_device() equals to the link status
interrupt, where new_device() is a link up, and destroy_device() is link
down().


> 2. New queue_state_changed callback. Unlike vring_state_changed this should
> cover the first queue at new_device and removal of all queues at
> destroy_device.

As stated above, vring_state_changed() should be able to do that, except
the one on destroy_device(), which is not done yet.

> 3. Per-queue or per-device NUMA node info.

You can query the NUMA node info implicitly by get_mempolicy(); check
numa_realloc() at lib/librte_vhost/virtio-net.c for reference.

--yliu
> 
> On Thu, Dec 17, 2015 at 8:28 PM, Tetsuya Mukawa  wrote:
> 
> On 2015/12/18 13:15, Yuanhan Liu wrote:
> > On Fri, Dec 18, 2015 at 12:15:42PM +0900, Tetsuya Mukawa wrote:
> >> On 2015/12/17 20:42, Yuanhan Liu wrote:
> >>> On Tue, Nov 24, 2015 at 06:00:01PM +0900, Tetsuya Mukawa wrote:
>  The vhost PMD will be a wrapper of vhost library, but some of vhost
>  library APIs cannot be mapped to ethdev library APIs.
>  Becasue of this, in some cases, we still need to use vhost library
> APIs
>  for a port created by the vhost PMD.
> 
>  Currently, when virtio device is created and destroyed, vhost library
>  will call one of callback handlers. The vhost PMD need to use this
>  pair of callback handlers to know which virtio devices are connected
>  actually.
>  Because we can register only one pair of callbacks to vhost library,
> if
>  the PMD use it, DPDK applications cannot have a way to know the
> events.
> 
>  This may break legacy DPDK applications that uses vhost library. To
> prevent
>  it, this patch adds one more pair of callbacks to vhost library
> especially
>  for the vhost PMD.
>  With the patch, legacy applications can use the vhost PMD even if 
> they
> need
>  additional specific handling for virtio device creation and
> destruction.
> 
>  For example, legacy application can call
>  rte_vhost_enable_guest_notification() in callbacks to change setting.
> >>> TBH, I never liked it since the beginning. Introducing two callbacks
> >>> for one event is a bit messy, and therefore error prone.
> >> I agree with you.
> >>
> >>> I have been thinking this occasionally last few weeks, and have came
> >>> up something that we may introduce another layer callback based on
> >>> the vhost pmd itself, by a new API:
> >>>
> >>>? ? ?rte_eth_vhost_register_callback().
> >>>
> >>> And we then call those new callback inside the vhost pmd new_device()
> >>> and vhost pmd destroy_device() implementations.
> >>>
> >>> And we could have same callbacks like vhost have, but I'm thinking
> >>> that new_device() and destroy_device() doesn't sound like a good name
> >>> to a PMD driver. Maybe a name like "link_state_changed" is better?
> >>>
> >>> What do you think of that?
> >> Yes,? "link_state_changed" will be good.
> >>
> >> BTW, I thought it was ok that an DPDK app that used vhost PMD called
> >> vhost library APIs directly.
> >> But probably you may feel strangeness about it. Is this correct?
> > Unluckily, that's true :)
> >
> >> If so, how about implementing legacy status interrupt mechanism to 
> vhost
> >> PMD?
> >> For example, an DPDK app can register callback handler like
> >> "examples/link_status_interrupt".
>

[dpdk-dev] [Question] How pmd virtio works without UIO?

2015-12-22 Thread Peter Xu
Hi,

I got a question related to how virtio pmd driver work without
UIO layer.

I see that in virtio PMD driver, DPDK will first try to init the
device using UIO interfaces. If it fails, it will try to init by
manipulating IO ports directly (see virtio_resource_init()).

For the ioport case, is it okay to do it like this? E.g., in
eth_virtio_dev_init(), we are resetting the virtio device, however,
this device should still be owned by virtio-pci driver in the
kernel.

How is that working? Did I miss anything?

Thanks in advance.
Peter


[dpdk-dev] [PATCH v5 1/3] vhost: Add callback and private data for vhost PMD

2015-12-22 Thread Yuanhan Liu
On Mon, Dec 21, 2015 at 11:10:10AM +0900, Tetsuya Mukawa wrote:
> nes: 168
> 
> On 2015/12/19 3:01, Rich Lane wrote:
> > I'm using the vhost callbacks and struct virtio_net with the vhost PMD in a
> > few ways:
> >
> > 1. new_device/destroy_device: Link state change (will be covered by the
> > link status interrupt).
> > 2. new_device: Add first queue to datapath.
> > 3. vring_state_changed: Add/remove queue to datapath.
> > 4. destroy_device: Remove all queues (vring_state_changed is not called
> > when qemu is killed).
> > 5. new_device and struct virtio_net: Determine NUMA node of the VM.
> >
> > The vring_state_changed callback is necessary because the VM might not be
> > using the maximum number of RX queues. If I boot Linux in the VM it will
> > start out using one RX queue, which can be changed with ethtool. The DPDK
> > app in the host needs to be notified that it can start sending traffic to
> > the new queue.
> >
> > The vring_state_changed callback is also useful for guest TX queues to
> > avoid reading from an inactive queue.
> >
> > API I'd like to have:
> >
> > 1. Link status interrupt.
> > 2. New queue_state_changed callback. Unlike vring_state_changed this should
> > cover the first queue at new_device and removal of all queues at
> > destroy_device.
> > 3. Per-queue or per-device NUMA node info.
> 
> Hi Rich and Yuanhan,
> 
> As Rich described, some users needs more information when the interrupts
> comes.
> And the virtio_net structure contains the information.
> 
> I guess it's very similar to interrupt handling of normal hardware.
> First, a interrupt comes, then an interrupt handler checks status
> register of the device to know actually what was happened.
> In vhost PMD case, reading status register equals reading virtio_net
> structure.
> 
> So how about below specification?
> 
> 1. The link status interrupt of vhost PMD will occurs when new_device,
> destroy_device and vring_state_changed events are happened.
> 2. Vhost PMD provides a function to let the users know virtio_net
> structure of the interrupted port.
>(Probably almost same as "rte_eth_vhost_portid2vdev" that I described
> in "[PATCH v5 3/3] vhost: Add helper function to convert port id to
> virtio device pointer")

That is one option, to wrapper everything into the link status interrupt
handler, and let it to query the virtio_net structure to know what exactly
happened, and then take the proper action.

With that, we could totally get rid of the "two sets of vhost callbacks".
The interface is a also clean. However, there is a drawback: it's not
that extensible: what if vhost introduces a new vhost callback, and it
does not literally belong to a link status interrupt event?

The another options is to introduce a new set of callbacks (not based on
vhost, but based on vhost-pmd as I suggested before). Here we could
rename the callback to make it looks more reasonable to a pmd driver,
say, remove the new_device()/destroy_device() and combine them as one
callback: link_status_changed.

The good thing about that is it's extensible; we could easily add a
new callback when there is a new one at vhost. However, it's not that
clean as the first one. Besides that, two sets of callback for vhost
is always weird to me.

Thoughts, comments?

--yliu


[dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write

2015-12-22 Thread Peter Xu
On Thu, Dec 17, 2015 at 11:11:57AM +0800, Yuanhan Liu wrote:
> +static inline void __attribute__((always_inline))
> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> +{
> + uint64_t page;
> +
> + if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> +!dev->log_base || !len))
> + return;
> +
> + if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))

Should it be "<="?

Peter


[dpdk-dev] [PATCH v5 1/3] vhost: Add callback and private data for vhost PMD

2015-12-22 Thread Yuanhan Liu
On Mon, Dec 21, 2015 at 08:47:28PM -0800, Rich Lane wrote:
> On Mon, Dec 21, 2015 at 7:41 PM, Yuanhan Liu 
> wrote:
> 
> On Fri, Dec 18, 2015 at 10:01:25AM -0800, Rich Lane wrote:
> > I'm using the vhost callbacks and struct virtio_net with the vhost PMD 
> in
> a few
> > ways:
> 
> Rich, thanks for the info!
>
> >
> > 1. new_device/destroy_device: Link state change (will be covered by the
> link
> > status interrupt).
> > 2. new_device: Add first queue to datapath.
> 
> I'm wondering why vring_state_changed() is not used, as it will also be
> triggered at the beginning, when the default queue (the first queue) is
> enabled.
> 
> 
> Turns out I'd misread the code and it's already using the vring_state_changed
> callback for the
> first queue. Not sure if this is intentional but vring_state_changed is called
> for the first queue
> before new_device.

Yeah, you were right, we can't count on this vring_state_changed(), for
it's sent earlier than vring has been initialized on vhost side. Maybe
we should invoke vring_state_changed() callback at new_device() as well.

> ?
> 
> > 3. vring_state_changed: Add/remove queue to datapath.
> > 4. destroy_device: Remove all queues (vring_state_changed is not called
> when
> > qemu is killed).
> 
> I had a plan to invoke vring_state_changed() to disable all vrings
> when destroy_device() is called.
> 
> 
> That would be good.
> ?
> 
> > 5. new_device and struct virtio_net: Determine NUMA node of the VM.
> 
> You can get the 'struct virtio_net' dev from all above callbacks.
> 
> ?
> 
> > 1. Link status interrupt.
> 
> To vhost pmd, new_device()/destroy_device() equals to the link status
> interrupt, where new_device() is a link up, and destroy_device() is link
> down().
>
> 
> > 2. New queue_state_changed callback. Unlike vring_state_changed this
> should
> > cover the first queue at new_device and removal of all queues at
> > destroy_device.
> 
> As stated above, vring_state_changed() should be able to do that, except
> the one on destroy_device(), which is not done yet.
>
> > 3. Per-queue or per-device NUMA node info.
> 
> You can query the NUMA node info implicitly by get_mempolicy(); check
> numa_realloc() at lib/librte_vhost/virtio-net.c for reference.
> 
> 
> Your suggestions are exactly how my application is already working. I was
> commenting on the
> proposed changes to the vhost PMD API. I would prefer to
> use?RTE_ETH_EVENT_INTR_LSC
> and?rte_eth_dev_socket_id for consistency with other NIC drivers, instead of
> these vhost-specific
> hacks.

That's a good suggestion.

> The queue state change callback is the one new API that needs to be
> added because
> normal NICs don't have this behavior.

Again I'd ask, will vring_state_changed() be enough, when above issues
are resolved: vring_state_changed() will be invoked at new_device()/
destroy_device(), and of course, ethtool change?

--yliu

> 
> You could add another?rte_eth_event_type for the queue state change callback,
> and pass the
> queue ID, RX/TX direction, and enable bit through cb_arg. The application 
> would
> never need
> to touch struct virtio_net.


[dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write

2015-12-22 Thread Yuanhan Liu
On Tue, Dec 22, 2015 at 01:11:02PM +0800, Peter Xu wrote:
> On Thu, Dec 17, 2015 at 11:11:57AM +0800, Yuanhan Liu wrote:
> > +static inline void __attribute__((always_inline))
> > +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> > +{
> > +   uint64_t page;
> > +
> > +   if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> > +  !dev->log_base || !len))
> > +   return;
> > +
> > +   if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
> 
> Should it be "<="?

Right, thanks for catching it.

--yliu


[dpdk-dev] [PATCH 1/3] vhost: get rid of linked list dev

2015-12-22 Thread Xie, Huawei
On 12/18/2015 3:03 PM, Yuanhan Liu wrote:
> While we use a single linked list to maintain all devices, we could
> use a static array to achieve the same goal, just like what we did
> to maintain the eth devices with rte_eth_devices array. This could
> simplifies the code a bit.
>
> Signed-off-by: Yuanhan Liu 

Acked-by: Huawei Xie 
[...]




[dpdk-dev] [ [PATCH v2] 01/13] virtio: Introduce config RTE_VIRTIO_INC_VECTOR

2015-12-22 Thread Yuanhan Liu
On Fri, Dec 18, 2015 at 06:16:36PM +0530, Santosh Shukla wrote:
> On Fri, Dec 18, 2015 at 4:54 AM, Stephen Hemminger
>  wrote:
> > On Thu, 17 Dec 2015 17:32:38 +0530
> > Santosh Shukla  wrote:
> >
> >> On Mon, Dec 14, 2015 at 6:30 PM, Santosh Shukla  
> >> wrote:
> >> > virtio_recv_pkts_vec and other virtio vector friend apis are written for 
> >> > sse/avx
> >> > instructions. For arm64 in particular, virtio vector implementation does 
> >> > not
> >> > exist(todo).
> >> >
> >> > So virtio pmd driver wont build for targets like i686, arm64.  By making
> >> > RTE_VIRTIO_INC_VECTOR=n, Driver can build for non-sse/avx targets and 
> >> > will work
> >> > in non-vectored virtio mode.
> >> >
> >> > Signed-off-by: Santosh Shukla 
> >> > ---
> >>
> >> Ping?
> >>
> >> any review  / comment on this patch much appreciated. Thanks
> >
> > The patches I posted (and were ignored by Intel) to support indirect
> > and any layout should have much bigger performance gain than all this
> > low level SSE bit twiddling.
> >
> 
> I little confused - do we care for this patch?

Santosh,

As a reviewer that still have a lot of work to do, I don't have the
bandwidth to review _all_ your patches carefully __once__. That is
to say, I will only comment when I find something should be commented,
from time to time when I put more thoughts there. For other patches
I've no comment, it could mean that it's okay to me so far, or I'm
not quite sure it's okay but I don't find anything obvious wrong.
Hence, I put no comments so far. But later, when get time, I will
revisit them, think more, and either ACK it, or comment it.

So, you could simply keep those patches unchanged if they received
no comments, and fix other comments, and send out a new version at
anytime that is proper to you.

--yliu


[dpdk-dev] [PATCH 2/3] vhost: simplify numa_realloc

2015-12-22 Thread Xie, Huawei
On 12/18/2015 3:03 PM, Yuanhan Liu wrote:
> We could first check if we need realloc vq or not, if so,
> reallocate it. We then do similar to vhost dev realloc.
>
> This could get rid of the tons of repeated "if (realloc_dev)"
> and "if (realloc_vq)" statements, therefore, makes code
> a bit more readable.
>
> Signed-off-by: Yuanhan Liu 
> ---
>  lib/librte_vhost/virtio-net.c | 77 
> ---
>  1 file changed, 36 insertions(+), 41 deletions(-)
>
> diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> index 2f83438..31ca4f7 100644
> --- a/lib/librte_vhost/virtio-net.c
> +++ b/lib/librte_vhost/virtio-net.c
> @@ -441,64 +441,59 @@ static struct virtio_net*
>  numa_realloc(struct virtio_net *dev, int index)
>  {
>   int oldnode, newnode;
> - struct virtio_net *old_dev, *new_dev = NULL;
> - struct vhost_virtqueue *old_vq, *new_vq = NULL;
> + struct virtio_net *old_dev;
> + struct vhost_virtqueue *old_vq, *vq;
>   int ret;
> - int realloc_dev = 0, realloc_vq = 0;
>  
>   old_dev = dev;
> - old_vq  = dev->virtqueue[index];
> + vq = old_vq = dev->virtqueue[index];
>  
> - ret  = get_mempolicy(&newnode, NULL, 0, old_vq->desc,
> - MPOL_F_NODE | MPOL_F_ADDR);
> - ret = ret | get_mempolicy(&oldnode, NULL, 0, old_dev,
> + ret = get_mempolicy(&newnode, NULL, 0, old_vq->desc,
>   MPOL_F_NODE | MPOL_F_ADDR);
> +
> + /* check if we need to reallocate vq */
> + ret = get_mempolicy(&oldnode, NULL, 0, old_vq, MPOL_F_NODE | 
> MPOL_F_ADDR);

Why remove the ret = ret | ? Both get_mempolicy could fail.

>   if (ret) {
>   RTE_LOG(ERR, VHOST_CONFIG,
> - "Unable to get vring desc or dev numa information.\n");
> + "Unable to get vq numa information.\n");
>   return dev;
>   }
> - if (oldnode != newnode)
> - realloc_dev = 1;
> + if (oldnode != newnode) {
> + RTE_LOG(INFO, VHOST_CONFIG,
> + "reallocate vq from %d to %d node\n", oldnode, newnode);
> + vq = rte_malloc_socket(NULL, sizeof(*vq), 0, newnode);
> + if (!vq)
> + return dev;
> +
> + memcpy(vq, old_vq, sizeof(*vq));
> + rte_free(old_vq);
> + }
>  
> - ret = get_mempolicy(&oldnode, NULL, 0, old_vq,
> - MPOL_F_NODE | MPOL_F_ADDR);
> + /* check if we need to reallocate dev */
> + ret = get_mempolicy(&oldnode, NULL, 0, old_dev, MPOL_F_NODE | 
> MPOL_F_ADDR);
>   if (ret) {
>   RTE_LOG(ERR, VHOST_CONFIG,
> - "Unable to get vq numa information.\n");
> - return dev;
> + "Unable to get vring desc or dev numa information.\n");
> + goto out;
>   }

Why vring desc in the err message?
-huawei
[...]


[dpdk-dev] [PATCH v4 1/6] fm10k: implement rx_descriptor_done function

2015-12-22 Thread Qiu, Michael
On 12/21/2015 6:20 PM, Shaopeng He wrote:
> rx_descriptor_done is used by interrupt mode example application
> (l3fwd-power) to check rxd DD bit to decide the RX trend,
> then l3fwd-power will adjust the cpu frequency according to
> the result.
>
> Signed-off-by: Shaopeng He 
> Acked-by: Jing Chen 
> ---
>  drivers/net/fm10k/fm10k.h|  3 +++
>  drivers/net/fm10k/fm10k_ethdev.c |  1 +
>  drivers/net/fm10k/fm10k_rxtx.c   | 25 +
>  3 files changed, 29 insertions(+)
>
> diff --git a/drivers/net/fm10k/fm10k.h b/drivers/net/fm10k/fm10k.h
> index cd38af2..e2f677a 100644
> --- a/drivers/net/fm10k/fm10k.h
> +++ b/drivers/net/fm10k/fm10k.h
> @@ -345,6 +345,9 @@ uint16_t fm10k_recv_pkts(void *rx_queue, struct rte_mbuf 
> **rx_pkts,
>  uint16_t fm10k_recv_scattered_pkts(void *rx_queue,
>   struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>  
> +int
> +fm10k_dev_rx_descriptor_done(void *rx_queue, uint16_t offset);
> +
>  uint16_t fm10k_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>   uint16_t nb_pkts);
>  
> diff --git a/drivers/net/fm10k/fm10k_ethdev.c 
> b/drivers/net/fm10k/fm10k_ethdev.c
> index e4aed94..d39c33b 100644
> --- a/drivers/net/fm10k/fm10k_ethdev.c
> +++ b/drivers/net/fm10k/fm10k_ethdev.c
> @@ -2435,6 +2435,7 @@ static const struct eth_dev_ops fm10k_eth_dev_ops = {
>   .rx_queue_release   = fm10k_rx_queue_release,
>   .tx_queue_setup = fm10k_tx_queue_setup,
>   .tx_queue_release   = fm10k_tx_queue_release,
> + .rx_descriptor_done = fm10k_dev_rx_descriptor_done,
>   .reta_update= fm10k_reta_update,
>   .reta_query = fm10k_reta_query,
>   .rss_hash_update= fm10k_rss_hash_update,
> diff --git a/drivers/net/fm10k/fm10k_rxtx.c b/drivers/net/fm10k/fm10k_rxtx.c
> index e958865..36d3002 100644
> --- a/drivers/net/fm10k/fm10k_rxtx.c
> +++ b/drivers/net/fm10k/fm10k_rxtx.c
> @@ -369,6 +369,31 @@ fm10k_recv_scattered_pkts(void *rx_queue, struct 
> rte_mbuf **rx_pkts,
>   return nb_rcv;
>  }
>  
> +int
> +fm10k_dev_rx_descriptor_done(void *rx_queue, uint16_t offset)
> +{
> + volatile union fm10k_rx_desc *rxdp;
> + struct fm10k_rx_queue *rxq = rx_queue;
> + uint16_t desc;
> + int ret;
> +
> + if (unlikely(offset >= rxq->nb_desc)) {
> + PMD_DRV_LOG(ERR, "Invalid RX queue id %u", offset);

Sorry, here makes my confuse: offset for RX queue id?

> + return 0;
> + }
> +
> + desc = rxq->next_dd + offset;
> + if (desc >= rxq->nb_desc)
> + desc -= rxq->nb_desc;
> +
> + rxdp = &rxq->hw_ring[desc];
> +
> + ret = !!(rxdp->w.status &
> + rte_cpu_to_le_16(FM10K_RXD_STATUS_DD));
> +
> + return ret;
> +}
> +
>  static inline void tx_free_descriptors(struct fm10k_tx_queue *q)
>  {
>   uint16_t next_rs, count = 0;



[dpdk-dev] [PATCH 2/3] vhost: simplify numa_realloc

2015-12-22 Thread Yuanhan Liu
On Tue, Dec 22, 2015 at 06:46:32AM +, Xie, Huawei wrote:
> On 12/18/2015 3:03 PM, Yuanhan Liu wrote:
> > We could first check if we need realloc vq or not, if so,
> > reallocate it. We then do similar to vhost dev realloc.
> >
> > This could get rid of the tons of repeated "if (realloc_dev)"
> > and "if (realloc_vq)" statements, therefore, makes code
> > a bit more readable.
> >
> > Signed-off-by: Yuanhan Liu 
> > ---
> >  lib/librte_vhost/virtio-net.c | 77 
> > ---
> >  1 file changed, 36 insertions(+), 41 deletions(-)
> >
> > diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> > index 2f83438..31ca4f7 100644
> > --- a/lib/librte_vhost/virtio-net.c
> > +++ b/lib/librte_vhost/virtio-net.c
> > @@ -441,64 +441,59 @@ static struct virtio_net*
> >  numa_realloc(struct virtio_net *dev, int index)
> >  {
> > int oldnode, newnode;
> > -   struct virtio_net *old_dev, *new_dev = NULL;
> > -   struct vhost_virtqueue *old_vq, *new_vq = NULL;
> > +   struct virtio_net *old_dev;
> > +   struct vhost_virtqueue *old_vq, *vq;
> > int ret;
> > -   int realloc_dev = 0, realloc_vq = 0;
> >  
> > old_dev = dev;
> > -   old_vq  = dev->virtqueue[index];
> > +   vq = old_vq = dev->virtqueue[index];
> >  
> > -   ret  = get_mempolicy(&newnode, NULL, 0, old_vq->desc,
> > -   MPOL_F_NODE | MPOL_F_ADDR);
> > -   ret = ret | get_mempolicy(&oldnode, NULL, 0, old_dev,
> > +   ret = get_mempolicy(&newnode, NULL, 0, old_vq->desc,
> > MPOL_F_NODE | MPOL_F_ADDR);
> > +
> > +   /* check if we need to reallocate vq */
> > +   ret = get_mempolicy(&oldnode, NULL, 0, old_vq, MPOL_F_NODE | 
> > MPOL_F_ADDR);
> 
> Why remove the ret = ret | ? Both get_mempolicy could fail.

Right, will fix it.

> 
> > if (ret) {
> > RTE_LOG(ERR, VHOST_CONFIG,
> > -   "Unable to get vring desc or dev numa information.\n");
> > +   "Unable to get vq numa information.\n");
> > return dev;
> > }
> > -   if (oldnode != newnode)
> > -   realloc_dev = 1;
> > +   if (oldnode != newnode) {
> > +   RTE_LOG(INFO, VHOST_CONFIG,
> > +   "reallocate vq from %d to %d node\n", oldnode, newnode);
> > +   vq = rte_malloc_socket(NULL, sizeof(*vq), 0, newnode);
> > +   if (!vq)
> > +   return dev;
> > +
> > +   memcpy(vq, old_vq, sizeof(*vq));
> > +   rte_free(old_vq);
> > +   }
> >  
> > -   ret = get_mempolicy(&oldnode, NULL, 0, old_vq,
> > -   MPOL_F_NODE | MPOL_F_ADDR);
> > +   /* check if we need to reallocate dev */
> > +   ret = get_mempolicy(&oldnode, NULL, 0, old_dev, MPOL_F_NODE | 
> > MPOL_F_ADDR);
> > if (ret) {
> > RTE_LOG(ERR, VHOST_CONFIG,
> > -   "Unable to get vq numa information.\n");
> > -   return dev;
> > +   "Unable to get vring desc or dev numa information.\n");
> > +   goto out;
> > }
> 
> Why vring desc in the err message?

Oops, no idea why I did that. Will fix it.

--yliu


[dpdk-dev] [PATCH v4 3/6] fm10k: remove rx queue interrupts when dev stops

2015-12-22 Thread Qiu, Michael
On 12/21/2015 6:20 PM, Shaopeng He wrote:
> Previous dev_stop function stops the rx/tx queues. This patch adds logic
> to disable rx queue interrupt, clean the datapath event and queue/vec map.
>
> Signed-off-by: Shaopeng He 
> Acked-by: Jing Chen 
> ---
>  drivers/net/fm10k/fm10k_ethdev.c | 22 ++
>  1 file changed, 22 insertions(+)
>
> diff --git a/drivers/net/fm10k/fm10k_ethdev.c 
> b/drivers/net/fm10k/fm10k_ethdev.c
> index a34c5e2..b5b809c 100644
> --- a/drivers/net/fm10k/fm10k_ethdev.c
> +++ b/drivers/net/fm10k/fm10k_ethdev.c
> @@ -1125,6 +1125,8 @@ fm10k_dev_start(struct rte_eth_dev *dev)
>  static void
>  fm10k_dev_stop(struct rte_eth_dev *dev)
>  {
> + struct fm10k_hw *hw = FM10K_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> + struct rte_intr_handle *intr_handle = &dev->pci_dev->intr_handle;
>   int i;
>  
>   PMD_INIT_FUNC_TRACE();
> @@ -1136,6 +1138,26 @@ fm10k_dev_stop(struct rte_eth_dev *dev)
>   if (dev->data->rx_queues)
>   for (i = 0; i < dev->data->nb_rx_queues; i++)
>   fm10k_dev_rx_queue_stop(dev, i);
> +
> + /* Disable datapath event */
> + if (rte_intr_dp_is_en(intr_handle)) {
> + for (i = 0; i < dev->data->nb_rx_queues; i++) {
> + FM10K_WRITE_REG(hw, FM10K_RXINT(i),
> + 3 << FM10K_RXINT_TIMER_SHIFT);
> + if (hw->mac.type == fm10k_mac_pf)
> + FM10K_WRITE_REG(hw, FM10K_ITR(Q2V(dev, i)),
> + FM10K_ITR_MASK_SET);
> + else
> + FM10K_WRITE_REG(hw, FM10K_VFITR(Q2V(dev, i)),
> + FM10K_ITR_MASK_SET);
> + }
> + }
> + /* Clean datapath event and queue/vec mapping */
> + rte_intr_efd_disable(intr_handle);
> + if (intr_handle->intr_vec != NULL) {

This line could be removed, because rte_free already do the check, see
below:
void rte_free(void *addr)
{
if (addr == NULL) return;
if (malloc_elem_free(malloc_elem_from_data(addr)) < 0)
rte_panic("Fatal error: Invalid memory\n");
}

> + rte_free(intr_handle->intr_vec);
> + intr_handle->intr_vec = NULL;
> + }
>  }
>  
>  static void



[dpdk-dev] [PATCH v2 3/6] vhost: log used vring changes

2015-12-22 Thread Peter Xu
On Thu, Dec 17, 2015 at 11:11:58AM +0800, Yuanhan Liu wrote:
> +static inline void __attribute__((always_inline))
> +vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +  uint64_t offset, uint64_t len)
> +{

One thing optional: I feel it a little bit confusing regarding to
the helper function name here, for the reasons:

1. It more sounds like "logging all the vrings we used", however,
   what I understand is that, here we are logging dirty pages for
   guest memory. Or say, there is merely nothing to do directly with
   vring (although many vring ops might call this function, we are
   only passing [buf, len] pairs).

2. It may also let people think of "vring_used", which is part of
   virtio protocol. However, it does not mean it too.

I would suggest a better name without confusion, like
vhost_log_dirty_range() or anything else to avoid those keywords.

> + uint64_t addr;
> +
> + addr = vq->log_guest_addr + offset;
> + vhost_log_write(dev, addr, len);

Optional too: since addr is only used once, would it cleaner using
one line? Like:

vhost_log_write(dev, vq->log_guest_addr + offset, len);

> +}
> +
>  /**
>   * This function adds buffers to the virtio devices RX virtqueue. Buffers can
>   * be received from the physical port or from another virtio device. A packet
> @@ -129,6 +139,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>   uint32_t offset = 0, vb_offset = 0;
>   uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
>   uint8_t hdr = 0, uncompleted_pkt = 0;
> + uint16_t idx;
>  
>   /* Get descriptor from available ring */
>   desc = &vq->desc[head[packet_success]];
> @@ -200,16 +211,18 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>   }
>  
>   /* Update used ring with desc information */
> - vq->used->ring[res_cur_idx & (vq->size - 1)].id =
> - head[packet_success];
> + idx = res_cur_idx & (vq->size - 1);
> + vq->used->ring[idx].id = head[packet_success];
>  
>   /* Drop the packet if it is uncompleted */
>   if (unlikely(uncompleted_pkt == 1))
> - vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> - vq->vhost_hlen;
> + vq->used->ring[idx].len = vq->vhost_hlen;
>   else
> - vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> - pkt_len + 
> vq->vhost_hlen;
> + vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
> +
> + vhost_log_used_vring(dev, vq,
> + offsetof(struct vring_used, ring[idx]),
> + sizeof(vq->used->ring[idx]));

Got a question here:

I see that we are logging down changes when we are marking
used_vring. Do we need to log down buffer copy in rte_memcpy() too?
I am not sure whether I understand it correctly, it seems that this
is part of DPDK API ops to deliver data to the guest (from, e.g.,
OVS?), when we do rte_memcpy(), we seems to be modifying guest
memory too. Am I wrong?

Peter

>  
>   res_cur_idx++;
>   packet_success++;
> @@ -236,6 +249,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  
>   *(volatile uint16_t *)&vq->used->idx += count;
>   vq->last_used_idx = res_end_idx;
> + vhost_log_used_vring(dev, vq,
> + offsetof(struct vring_used, idx),
> + sizeof(vq->used->idx));
>  
>   /* flush used->idx update before we read avail->flags. */
>   rte_mb();
> @@ -265,6 +281,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t 
> queue_id,
>   uint32_t seg_avail;
>   uint32_t vb_avail;
>   uint32_t cpy_len, entry_len;
> + uint16_t idx;
>  
>   if (pkt == NULL)
>   return 0;
> @@ -302,16 +319,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, 
> uint32_t queue_id,
>   entry_len = vq->vhost_hlen;
>  
>   if (vb_avail == 0) {
> - uint32_t desc_idx =
> - vq->buf_vec[vec_idx].desc_idx;
> + uint32_t desc_idx = vq->buf_vec[vec_idx].desc_idx;
> +
> + if ((vq->desc[desc_idx].flags & VRING_DESC_F_NEXT) == 0) {
> + idx = cur_idx & (vq->size - 1);
>  
> - if ((vq->desc[desc_idx].flags
> - & VRING_DESC_F_NEXT) == 0) {
>   /* Update used ring with desc information */
> - vq->used->ring[cur_idx & (vq->size - 1)].id
> - = vq->buf_vec[vec_idx].desc_idx;
> - vq->used->ring[cur_idx & (vq->size - 1)].len
> - = entry_len;
> + vq->used->ring[idx].id = vq->buf_vec[vec_idx].desc_idx;
> +  

[dpdk-dev] [PATCH 3/3] vhost: fix vq realloc at numa_realloc

2015-12-22 Thread Xie, Huawei
On 12/18/2015 3:03 PM, Yuanhan Liu wrote:
> vq is allocated on pairs, hence we should do pair reallocation
> at numa_realloc() as well, otherwise an error like following
> occurs while do numa reallocation:
>
> VHOST_CONFIG: reallocate vq from 0 to 1 node
> PANIC in rte_free():
> Fatal error: Invalid memory
>
> The reason we don't catch it is because numa_realloc() will
looks to me, but
s /because/that/
s /on pairs/in pairs/ ? :).
> not take effect when RTE_LIBRTE_VHOST_NUMA is not enabled,
> which is the default case.
>
> Fixes: e049ca6d10e0 ("vhost-user: prepare multiple queue setup")
>
> Signed-off-by: Yuanhan Liu 

Acked-by: Huawei Xie 




[dpdk-dev] [Question] How pmd virtio works without UIO?

2015-12-22 Thread Yuanhan Liu
On Tue, Dec 22, 2015 at 11:50:41AM +0800, Peter Xu wrote:
> Hi,
> 
> I got a question related to how virtio pmd driver work without
> UIO layer.
> 
> I see that in virtio PMD driver, DPDK will first try to init the
> device using UIO interfaces. If it fails, it will try to init by
> manipulating IO ports directly (see virtio_resource_init()).
> 
> For the ioport case, is it okay to do it like this? E.g., in
> eth_virtio_dev_init(), we are resetting the virtio device, however,
> this device should still be owned by virtio-pci driver in the
> kernel.
> 
> How is that working? Did I miss anything?

That's for configuration part: as far as we can read/write the right
PCI port, virtio pmd configuration will work. Note that on this case,
uio just provides another way to tell you where the port is.

--yliu


[dpdk-dev] [PATCH v2 7/7] doc: release note update for fm10k intr mode

2015-12-22 Thread Qiu, Michael
On 10/26/2015 11:48 AM, He, Shaopeng wrote:
> Signed-off-by: Shaopeng He 
> ---
>  doc/guides/rel_notes/release_2_2.rst | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/doc/guides/rel_notes/release_2_2.rst 
> b/doc/guides/rel_notes/release_2_2.rst
> index 73dba47..44b3aea 100644
> --- a/doc/guides/rel_notes/release_2_2.rst
> +++ b/doc/guides/rel_notes/release_2_2.rst
> @@ -5,11 +5,13 @@ New Features
>  
>  
>  * Support interrupt mode on i40e
> +* Support interrupt mode on fm10k
>  

Do we need to declare the limitation like only VFIO support for RRC, and
other limitations like could not work on the management port for
BR(maybe I'm wrong, but need BAR4 to run Testpoint)

>  Resolved Issues
>  ---
>  
>  * Fix ixgbe/igb rx interrupt compatible issue with mbox
> +* Fix l3fwd-power memory leak for non-ip packet
>  
>  Known Issues
>  



[dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write

2015-12-22 Thread Xie, Huawei
On 12/22/2015 11:03 AM, Yuanhan Liu wrote:
> On Tue, Dec 22, 2015 at 02:45:52AM +, Xie, Huawei wrote:
> +static inline void __attribute__((always_inline))
> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> +{
> + uint64_t page;
> +
 Before we log, we need memory barrier to make sure updates are in place.
> + if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> +!dev->log_base || !len))
> + return;
>>> Put a memory barrier inside set_features()?
>>>
>>> I see no var dependence here, why putting a barrier then? We are
>>> accessing and modifying same var, doesn't the cache MESI protocol
>>> will get rid of your concerns?
>> This fence isn't about feature var. It is to ensure that updates to the
>> guest buffer are committed before the logging.
> Oh.., I was thinking you were talking about the "dev->features" field
> concurrent access and modify you mentioned from V1.
>
>> For IA strong memory model, compiler barrier is enough. For other weak
>> memory model, fence is required.
> +
> + if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
> + return;
> So that I should put a "rte_mb()" __here__?
>
>   --yliu

I find that we already have the arch dependent version of rte_smp_wmb()
--huawei
> +
> + page = addr / VHOST_LOG_PAGE;
> + while (page * VHOST_LOG_PAGE < addr + len) {
 Let us have a page_end var to make the code simpler?
>>> Could do that.
>>>
>>>
> + vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
> + page += VHOST_LOG_PAGE;
 page += 1?
>>> Oops, right.
>>>
>>> --yliu
>>>
> + }
> +}
> +
> +
>  /**
>   *  Disable features in feature_mask. Returns 0 on success.
>   */



[dpdk-dev] [PATCH v2 3/6] vhost: log used vring changes

2015-12-22 Thread Xie, Huawei
On 12/22/2015 2:56 PM, Peter Xu wrote:
> On Thu, Dec 17, 2015 at 11:11:58AM +0800, Yuanhan Liu wrote:
>> +static inline void __attribute__((always_inline))
>> +vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
>> + uint64_t offset, uint64_t len)
>> +{
[...]
> Got a question here:
>
> I see that we are logging down changes when we are marking
> used_vring. Do we need to log down buffer copy in rte_memcpy() too?
> I am not sure whether I understand it correctly, it seems that this
> is part of DPDK API ops to deliver data to the guest (from, e.g.,
> OVS?), when we do rte_memcpy(), we seems to be modifying guest
> memory too. Am I wrong?
>
> Peter

desc buffer logging isn't included in v1, but in the patch 4 of this
patch set, and actually it is the major work in vhost live migration.
--huawei

[...]



[dpdk-dev] [PATCH v2 3/6] vhost: log used vring changes

2015-12-22 Thread Yuanhan Liu
On Tue, Dec 22, 2015 at 02:55:52PM +0800, Peter Xu wrote:
> On Thu, Dec 17, 2015 at 11:11:58AM +0800, Yuanhan Liu wrote:
> > +static inline void __attribute__((always_inline))
> > +vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > +uint64_t offset, uint64_t len)
> > +{
> 
> One thing optional: I feel it a little bit confusing regarding to
> the helper function name here, for the reasons:
> 
> 1. It more sounds like "logging all the vrings we used", however,
>what I understand is that, here we are logging dirty pages for
>guest memory. Or say, there is merely nothing to do directly with
>vring (although many vring ops might call this function, we are
>only passing [buf, len] pairs).
> 
> 2. It may also let people think of "vring_used", which is part of
>virtio protocol. However, it does not mean it too.

Yes, it does. Here log_guest_addr is set to physical address of
vring_used. (check code at vhost_virtqueue_set_addr() of qemu).

 627 static int vhost_virtqueue_set_addr(struct vhost_dev *dev,
 628 struct vhost_virtqueue *vq,
 629 unsigned idx, bool enable_log)
 630 {
 631 struct vhost_vring_addr addr = {
 632 .index = idx,
 633 .desc_user_addr = (uint64_t)(unsigned long)vq->desc,
 634 .avail_user_addr = (uint64_t)(unsigned long)vq->avail,
 635 .used_user_addr = (uint64_t)(unsigned long)vq->used,
==>  636 .log_guest_addr = vq->used_phys,
 637 .flags = enable_log ? (1 << VHOST_VRING_F_LOG) : 0,
 638 };

So, this function does log changes to used vring.

> 
> I would suggest a better name without confusion, like
> vhost_log_dirty_range() or anything else to avoid those keywords.
> 
> > +   uint64_t addr;
> > +
> > +   addr = vq->log_guest_addr + offset;
> > +   vhost_log_write(dev, addr, len);
> 
> Optional too: since addr is only used once, would it cleaner using
> one line? Like:

Yes, it is. Will fix it.

> 
> vhost_log_write(dev, vq->log_guest_addr + offset, len);
> 
> > +}
> > +
> >  /**
> >   * This function adds buffers to the virtio devices RX virtqueue. Buffers 
> > can
> >   * be received from the physical port or from another virtio device. A 
> > packet
> > @@ -129,6 +139,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
> > uint32_t offset = 0, vb_offset = 0;
> > uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
> > uint8_t hdr = 0, uncompleted_pkt = 0;
> > +   uint16_t idx;
> >  
> > /* Get descriptor from available ring */
> > desc = &vq->desc[head[packet_success]];
> > @@ -200,16 +211,18 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t 
> > queue_id,
> > }
> >  
> > /* Update used ring with desc information */
> > -   vq->used->ring[res_cur_idx & (vq->size - 1)].id =
> > -   head[packet_success];
> > +   idx = res_cur_idx & (vq->size - 1);
> > +   vq->used->ring[idx].id = head[packet_success];
> >  
> > /* Drop the packet if it is uncompleted */
> > if (unlikely(uncompleted_pkt == 1))
> > -   vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> > -   vq->vhost_hlen;
> > +   vq->used->ring[idx].len = vq->vhost_hlen;
> > else
> > -   vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> > -   pkt_len + 
> > vq->vhost_hlen;
> > +   vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
> > +
> > +   vhost_log_used_vring(dev, vq,
> > +   offsetof(struct vring_used, ring[idx]),
> > +   sizeof(vq->used->ring[idx]));
> 
> Got a question here:
> 
> I see that we are logging down changes when we are marking
> used_vring. Do we need to log down buffer copy in rte_memcpy() too?
> I am not sure whether I understand it correctly, it seems that this
> is part of DPDK API ops to deliver data to the guest (from, e.g.,
> OVS?), when we do rte_memcpy(), we seems to be modifying guest
> memory too. Am I wrong?

It's done in the next patch.

--yliu


[dpdk-dev] [PATCH v2 1/3] vhost: get rid of linked list dev

2015-12-22 Thread Yuanhan Liu
While we use a single linked list to maintain all devices, we could
use a static array to achieve the same goal, just like what we did
to maintain the eth devices with rte_eth_devices array. This could
simplifies the code a bit.

Signed-off-by: Yuanhan Liu 
Acked-by: Huawei Xie 
---

Note that there is a slight functional change to the old code: this
patch limits the vhost devices to 1024. We could either make it
configurable to increase the limit, or dynamically re-allocate a
bigger array when necessary to totally get rid of the limit.

Need comments on that.
---
 lib/librte_vhost/virtio-net.c | 209 ++
 1 file changed, 50 insertions(+), 159 deletions(-)

diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index de78a0f..2f83438 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -55,18 +55,11 @@
 #include "vhost-net.h"
 #include "virtio-net.h"

-/*
- * Device linked list structure for configuration.
- */
-struct virtio_net_config_ll {
-   struct virtio_net dev;  /* Virtio device.*/
-   struct virtio_net_config_ll *next;  /* Next dev on linked list.*/
-};
+#define MAX_VHOST_DEVICE   1024
+static struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];

 /* device ops to add/remove device to/from data core. */
 struct virtio_net_device_ops const *notify_ops;
-/* root address of the linked list of managed virtio devices */
-static struct virtio_net_config_ll *ll_root;

 #define VHOST_USER_F_PROTOCOL_FEATURES 30

@@ -108,77 +101,17 @@ qva_to_vva(struct virtio_net *dev, uint64_t qemu_va)
 }


-/*
- * Retrieves an entry from the devices configuration linked list.
- */
-static struct virtio_net_config_ll *
-get_config_ll_entry(struct vhost_device_ctx ctx)
-{
-   struct virtio_net_config_ll *ll_dev = ll_root;
-
-   /* Loop through linked list until the device_fh is found. */
-   while (ll_dev != NULL) {
-   if (ll_dev->dev.device_fh == ctx.fh)
-   return ll_dev;
-   ll_dev = ll_dev->next;
-   }
-
-   return NULL;
-}
-
-/*
- * Searches the configuration core linked list and
- * retrieves the device if it exists.
- */
 struct virtio_net *
 get_device(struct vhost_device_ctx ctx)
 {
-   struct virtio_net_config_ll *ll_dev;
-
-   ll_dev = get_config_ll_entry(ctx);
-
-   if (ll_dev)
-   return &ll_dev->dev;
+   struct virtio_net *dev = vhost_devices[ctx.fh];

-   RTE_LOG(ERR, VHOST_CONFIG,
-   "(%"PRIu64") Device not found in linked list.\n", ctx.fh);
-   return NULL;
-}
-
-/*
- * Add entry containing a device to the device configuration linked list.
- */
-static void
-add_config_ll_entry(struct virtio_net_config_ll *new_ll_dev)
-{
-   struct virtio_net_config_ll *ll_dev = ll_root;
-
-   /* If ll_dev == NULL then this is the first device so go to else */
-   if (ll_dev) {
-   /* If the 1st device_fh != 0 then we insert our device here. */
-   if (ll_dev->dev.device_fh != 0) {
-   new_ll_dev->dev.device_fh = 0;
-   new_ll_dev->next = ll_dev;
-   ll_root = new_ll_dev;
-   } else {
-   /*
-* Increment through the ll until we find un unused
-* device_fh. Insert the device at that entry.
-*/
-   while ((ll_dev->next != NULL) &&
-   (ll_dev->dev.device_fh ==
-   (ll_dev->next->dev.device_fh - 1)))
-   ll_dev = ll_dev->next;
-
-   new_ll_dev->dev.device_fh = ll_dev->dev.device_fh + 1;
-   new_ll_dev->next = ll_dev->next;
-   ll_dev->next = new_ll_dev;
-   }
-   } else {
-   ll_root = new_ll_dev;
-   ll_root->dev.device_fh = 0;
+   if (unlikely(!dev)) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "(%"PRIu64") device not found.\n", ctx.fh);
}

+   return dev;
 }

 static void
@@ -217,43 +150,14 @@ cleanup_device(struct virtio_net *dev, int destroy)
  * Release virtqueues and device memory.
  */
 static void
-free_device(struct virtio_net_config_ll *ll_dev)
+free_device(struct virtio_net *dev)
 {
uint32_t i;

-   for (i = 0; i < ll_dev->dev.virt_qp_nb; i++)
-   rte_free(ll_dev->dev.virtqueue[i * VIRTIO_QNUM]);
-
-   rte_free(ll_dev);
-}
+   for (i = 0; i < dev->virt_qp_nb; i++)
+   rte_free(dev->virtqueue[i * VIRTIO_QNUM]);

-/*
- * Remove an entry from the device configuration linked list.
- */
-static struct virtio_net_config_ll *
-rm_config_ll_entry(struct virtio_net_config_ll *ll_dev,
-   struct virtio_net_config_ll *ll_dev_last)
-{
-   /* First remove the device and then clean it up. */
-   if (ll_

[dpdk-dev] [PATCH v2 2/3] vhost: simplify numa_realloc

2015-12-22 Thread Yuanhan Liu
We could first check if we need realloc vq or not, if so,
reallocate it. We then do similar to vhost dev realloc.

This could get rid of the tons of repeated "if (realloc_dev)"
and "if (realloc_vq)" statements, therefore, makes code
a bit more readable.

Signed-off-by: Yuanhan Liu 
---

v2: fix debug message and check return value of all get_mempolicy()
---
 lib/librte_vhost/virtio-net.c | 77 ---
 1 file changed, 36 insertions(+), 41 deletions(-)

diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index 2f83438..1566c93 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -441,64 +441,59 @@ static struct virtio_net*
 numa_realloc(struct virtio_net *dev, int index)
 {
int oldnode, newnode;
-   struct virtio_net *old_dev, *new_dev = NULL;
-   struct vhost_virtqueue *old_vq, *new_vq = NULL;
+   struct virtio_net *old_dev;
+   struct vhost_virtqueue *old_vq, *vq;
int ret;
-   int realloc_dev = 0, realloc_vq = 0;

old_dev = dev;
-   old_vq  = dev->virtqueue[index];
+   vq = old_vq = dev->virtqueue[index];

-   ret  = get_mempolicy(&newnode, NULL, 0, old_vq->desc,
-   MPOL_F_NODE | MPOL_F_ADDR);
-   ret = ret | get_mempolicy(&oldnode, NULL, 0, old_dev,
+   ret = get_mempolicy(&newnode, NULL, 0, old_vq->desc,
MPOL_F_NODE | MPOL_F_ADDR);
+
+   /* check if we need to reallocate vq */
+   ret |= get_mempolicy(&oldnode, NULL, 0, old_vq, MPOL_F_NODE | 
MPOL_F_ADDR);
if (ret) {
RTE_LOG(ERR, VHOST_CONFIG,
-   "Unable to get vring desc or dev numa information.\n");
+   "Unable to get vq numa information.\n");
return dev;
}
-   if (oldnode != newnode)
-   realloc_dev = 1;
+   if (oldnode != newnode) {
+   RTE_LOG(INFO, VHOST_CONFIG,
+   "reallocate vq from %d to %d node\n", oldnode, newnode);
+   vq = rte_malloc_socket(NULL, sizeof(*vq), 0, newnode);
+   if (!vq)
+   return dev;
+
+   memcpy(vq, old_vq, sizeof(*vq));
+   rte_free(old_vq);
+   }

-   ret = get_mempolicy(&oldnode, NULL, 0, old_vq,
-   MPOL_F_NODE | MPOL_F_ADDR);
+   /* check if we need to reallocate dev */
+   ret = get_mempolicy(&oldnode, NULL, 0, old_dev, MPOL_F_NODE | 
MPOL_F_ADDR);
if (ret) {
RTE_LOG(ERR, VHOST_CONFIG,
-   "Unable to get vq numa information.\n");
-   return dev;
+   "Unable to get dev numa information.\n");
+   goto out;
}
-   if (oldnode != newnode)
-   realloc_vq = 1;
-
-   if (realloc_dev == 0 && realloc_vq == 0)
-   return dev;
-
-   if (realloc_dev)
-   new_dev = rte_malloc_socket(NULL,
-   sizeof(struct virtio_net), 0, newnode);
-   if (realloc_vq)
-   new_vq = rte_malloc_socket(NULL,
-   sizeof(struct vhost_virtqueue), 0, newnode);
-   if (!new_dev && !new_vq)
-   return dev;
-
-   if (realloc_vq)
-   memcpy(new_vq, old_vq, sizeof(*new_vq));
-   if (realloc_dev)
-   memcpy(new_dev, old_dev, sizeof(*new_dev));
+   if (oldnode != newnode) {
+   RTE_LOG(INFO, VHOST_CONFIG,
+   "reallocate dev from %d to %d node\n", oldnode, 
newnode);
+   dev = rte_malloc_socket(NULL, sizeof(*dev), 0, newnode);
+   if (!dev) {
+   dev = old_dev;
+   goto out;
+   }

-   (new_dev ? new_dev : old_dev)->virtqueue[index] =
-   new_vq ? new_vq : old_vq;
-   if (realloc_vq)
-   rte_free(old_vq);
-   if (realloc_dev) {
+   memcpy(dev, old_dev, sizeof(*dev));
rte_free(old_dev);
-
-   vhost_devices[new_dev->device_fh] = new_dev;
}

-   return realloc_dev ? new_dev: dev;
+out:
+   dev->virtqueue[index] = vq;
+   vhost_devices[dev->device_fh] = dev;
+
+   return dev;
 }
 #else
 static struct virtio_net*
-- 
1.9.0



[dpdk-dev] [PATCH v2 3/3] vhost: fix vq realloc at numa_realloc

2015-12-22 Thread Yuanhan Liu
vq is allocated on pairs, hence we should do pair reallocation
at numa_realloc() as well, otherwise an error like following
occurs while do numa reallocation:

VHOST_CONFIG: reallocate vq from 0 to 1 node
PANIC in rte_free():
Fatal error: Invalid memory

The reason we don't catch it is because numa_realloc() will
not take effect when RTE_LIBRTE_VHOST_NUMA is not enabled,
which is the default case.

Fixes: e049ca6d10e0 ("vhost-user: prepare multiple queue setup")

Signed-off-by: Yuanhan Liu 
Acked-by: Huawei Xie 
---
 lib/librte_vhost/virtio-net.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index 1566c93..7469312 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -445,6 +445,13 @@ numa_realloc(struct virtio_net *dev, int index)
struct vhost_virtqueue *old_vq, *vq;
int ret;

+   /*
+* vq is allocated on pairs, we should try to do realloc
+* on first queue of one queue pair only.
+*/
+   if (index % VIRTIO_QNUM != 0)
+   return dev;
+
old_dev = dev;
vq = old_vq = dev->virtqueue[index];

@@ -461,11 +468,12 @@ numa_realloc(struct virtio_net *dev, int index)
if (oldnode != newnode) {
RTE_LOG(INFO, VHOST_CONFIG,
"reallocate vq from %d to %d node\n", oldnode, newnode);
-   vq = rte_malloc_socket(NULL, sizeof(*vq), 0, newnode);
+   vq = rte_malloc_socket(NULL, sizeof(*vq) * VIRTIO_QNUM, 0,
+  newnode);
if (!vq)
return dev;

-   memcpy(vq, old_vq, sizeof(*vq));
+   memcpy(vq, old_vq, sizeof(*vq) * VIRTIO_QNUM);
rte_free(old_vq);
}

@@ -491,6 +499,7 @@ numa_realloc(struct virtio_net *dev, int index)

 out:
dev->virtqueue[index] = vq;
+   dev->virtqueue[index + 1] = vq + 1;
vhost_devices[dev->device_fh] = dev;

return dev;
-- 
1.9.0



[dpdk-dev] [PATCH v4 2/6] fm10k: setup rx queue interrupts for PF and VF

2015-12-22 Thread Qiu, Michael
On 12/21/2015 6:20 PM, Shaopeng He wrote:
> In interrupt mode, each rx queue can have one interrupt to notify the up
> layer application when packets are available in that queue. Some queues
> also can share one interrupt.
> Currently, fm10k needs one separate interrupt for mailbox. So, only those
> drivers which support multiple interrupt vectors e.g. vfio-pci can work
> in fm10k interrupt mode.
> This patch uses the RXINT/INT_MAP registers to map interrupt causes
> (rx queue and other events) to vectors, and enable these interrupts
> through kernel drivers like vfio-pci.
>
> Signed-off-by: Shaopeng He 
> Acked-by: Jing Chen 
> ---
>  doc/guides/rel_notes/release_2_3.rst |   2 +
>  drivers/net/fm10k/fm10k.h|   3 ++
>  drivers/net/fm10k/fm10k_ethdev.c | 101 
> +++
>  3 files changed, 95 insertions(+), 11 deletions(-)
>
> diff --git a/doc/guides/rel_notes/release_2_3.rst 
> b/doc/guides/rel_notes/release_2_3.rst
> index 99de186..2cb5ebd 100644
> --- a/doc/guides/rel_notes/release_2_3.rst
> +++ b/doc/guides/rel_notes/release_2_3.rst
> @@ -4,6 +4,8 @@ DPDK Release 2.3
>  New Features
>  
>  
> +* **Added fm10k Rx interrupt support.**
> +
>  
>  Resolved Issues
>  ---
> diff --git a/drivers/net/fm10k/fm10k.h b/drivers/net/fm10k/fm10k.h
> index e2f677a..770d6ba 100644
> --- a/drivers/net/fm10k/fm10k.h
> +++ b/drivers/net/fm10k/fm10k.h
> @@ -129,6 +129,9 @@
>  #define RTE_FM10K_TX_MAX_FREE_BUF_SZ64
>  #define RTE_FM10K_DESCS_PER_LOOP4
>  
> +#define FM10K_MISC_VEC_ID   RTE_INTR_VEC_ZERO_OFFSET
> +#define FM10K_RX_VEC_START  RTE_INTR_VEC_RXTX_OFFSET
> +
>  #define FM10K_SIMPLE_TX_FLAG ((uint32_t)ETH_TXQ_FLAGS_NOMULTSEGS | \
>   ETH_TXQ_FLAGS_NOOFFLOADS)
>  
> diff --git a/drivers/net/fm10k/fm10k_ethdev.c 
> b/drivers/net/fm10k/fm10k_ethdev.c
> index d39c33b..a34c5e2 100644
> --- a/drivers/net/fm10k/fm10k_ethdev.c
> +++ b/drivers/net/fm10k/fm10k_ethdev.c
> @@ -54,6 +54,8 @@
>  /* Number of chars per uint32 type */
>  #define CHARS_PER_UINT32 (sizeof(uint32_t))
>  #define BIT_MASK_PER_UINT32 ((1 << CHARS_PER_UINT32) - 1)
> +/* default 1:1 map from queue ID to interrupt vector ID */
> +#define Q2V(dev, queue_id) (dev->pci_dev->intr_handle.intr_vec[queue_id])
>  
>  static void fm10k_close_mbx_service(struct fm10k_hw *hw);
>  static void fm10k_dev_promiscuous_enable(struct rte_eth_dev *dev);
> @@ -109,6 +111,8 @@ struct fm10k_xstats_name_off 
> fm10k_hw_stats_tx_q_strings[] = {
>  
>  #define FM10K_NB_XSTATS (FM10K_NB_HW_XSTATS + FM10K_MAX_QUEUES_PF * \
>   (FM10K_NB_RX_Q_XSTATS + FM10K_NB_TX_Q_XSTATS))
> +static int
> +fm10k_dev_rxq_interrupt_setup(struct rte_eth_dev *dev);
>  
>  static void
>  fm10k_mbx_initlock(struct fm10k_hw *hw)
> @@ -687,6 +691,7 @@ static int
>  fm10k_dev_rx_init(struct rte_eth_dev *dev)
>  {
>   struct fm10k_hw *hw = FM10K_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> + struct rte_intr_handle *intr_handle = &dev->pci_dev->intr_handle;
>   int i, ret;
>   struct fm10k_rx_queue *rxq;
>   uint64_t base_addr;
> @@ -694,10 +699,23 @@ fm10k_dev_rx_init(struct rte_eth_dev *dev)
>   uint32_t rxdctl = FM10K_RXDCTL_WRITE_BACK_MIN_DELAY;
>   uint16_t buf_size;
>  
> - /* Disable RXINT to avoid possible interrupt */
> - for (i = 0; i < hw->mac.max_queues; i++)
> + /* enable RXINT for interrupt mode */
> + i = 0;
> + if (rte_intr_dp_is_en(intr_handle)) {
> + for (; i < dev->data->nb_rx_queues; i++) {
> + FM10K_WRITE_REG(hw, FM10K_RXINT(i), Q2V(dev, i));
> + if (hw->mac.type == fm10k_mac_pf)
> + FM10K_WRITE_REG(hw, FM10K_ITR(Q2V(dev, i)),
> + FM10K_ITR_AUTOMASK | 
> FM10K_ITR_MASK_CLEAR);
> + else
> + FM10K_WRITE_REG(hw, FM10K_VFITR(Q2V(dev, i)),
> + FM10K_ITR_AUTOMASK | 
> FM10K_ITR_MASK_CLEAR);
> + }
> + }
> + /* Disable other RXINT to avoid possible interrupt */
> + for (; i < hw->mac.max_queues; i++)
>   FM10K_WRITE_REG(hw, FM10K_RXINT(i),
> - 3 << FM10K_RXINT_TIMER_SHIFT);
> + 3 << FM10K_RXINT_TIMER_SHIFT);
>  
>   /* Setup RX queues */
>   for (i = 0; i < dev->data->nb_rx_queues; ++i) {
> @@ -1053,6 +1071,9 @@ fm10k_dev_start(struct rte_eth_dev *dev)
>   return diag;
>   }
>  
> + if (fm10k_dev_rxq_interrupt_setup(dev))
> + return -EIO;
> +
>   diag = fm10k_dev_rx_init(dev);
>   if (diag) {
>   PMD_INIT_LOG(ERR, "RX init failed: %d", diag);
> @@ -2072,7 +2093,7 @@ fm10k_dev_enable_intr_pf(struct rte_eth_dev *dev)
>   uint32_t int_map = FM10K_INT_MAP_IMMEDIATE;
>  
>   /* Bind all local non-queue interrupt to vector 0 */
> - int_map |= 0;
> + int_map |= FM10K

[dpdk-dev] [PATCH v2 3/6] vhost: log used vring changes

2015-12-22 Thread Peter Xu
On Tue, Dec 22, 2015 at 07:07:25AM +, Xie, Huawei wrote:
> On 12/22/2015 2:56 PM, Peter Xu wrote:
> > Got a question here:
> >
> > I see that we are logging down changes when we are marking
> > used_vring. Do we need to log down buffer copy in rte_memcpy() too?
> > I am not sure whether I understand it correctly, it seems that this
> > is part of DPDK API ops to deliver data to the guest (from, e.g.,
> > OVS?), when we do rte_memcpy(), we seems to be modifying guest
> > memory too. Am I wrong?
> >
> > Peter
> 
> desc buffer logging isn't included in v1, but in the patch 4 of this
> patch set, and actually it is the major work in vhost live migration.

Yes, it is. Thanks to point out.

Peter

> --huawei
> 
> [...]
> 


[dpdk-dev] [PATCH v2 3/6] vhost: log used vring changes

2015-12-22 Thread Peter Xu
On Tue, Dec 22, 2015 at 03:13:49PM +0800, Yuanhan Liu wrote:
> On Tue, Dec 22, 2015 at 02:55:52PM +0800, Peter Xu wrote:
> > On Thu, Dec 17, 2015 at 11:11:58AM +0800, Yuanhan Liu wrote:
> > > +static inline void __attribute__((always_inline))
> > > +vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > > +  uint64_t offset, uint64_t len)
> > > +{
> > 
> > One thing optional: I feel it a little bit confusing regarding to
> > the helper function name here, for the reasons:
> > 
> > 1. It more sounds like "logging all the vrings we used", however,
> >what I understand is that, here we are logging dirty pages for
> >guest memory. Or say, there is merely nothing to do directly with
> >vring (although many vring ops might call this function, we are
> >only passing [buf, len] pairs).
> > 
> > 2. It may also let people think of "vring_used", which is part of
> >virtio protocol. However, it does not mean it too.
> 
> Yes, it does. Here log_guest_addr is set to physical address of
> vring_used. (check code at vhost_virtqueue_set_addr() of qemu).
> 
>  627 static int vhost_virtqueue_set_addr(struct vhost_dev *dev,
>  628 struct vhost_virtqueue *vq,
>  629 unsigned idx, bool enable_log)
>  630 {
>  631 struct vhost_vring_addr addr = {
>  632 .index = idx,
>  633 .desc_user_addr = (uint64_t)(unsigned long)vq->desc,
>  634 .avail_user_addr = (uint64_t)(unsigned long)vq->avail,
>  635 .used_user_addr = (uint64_t)(unsigned long)vq->used,
> ==>  636 .log_guest_addr = vq->used_phys,
>  637 .flags = enable_log ? (1 << VHOST_VRING_F_LOG) : 0,
>  638 };
> 
> So, this function does log changes to used vring.

Yes. I have made a mistake.

Thanks!
Peter


[dpdk-dev] [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature

2015-12-22 Thread Peter Xu
On Thu, Dec 17, 2015 at 11:12:00AM +0800, Yuanhan Liu wrote:
> It's actually a feature already enabled in Linux kernel. What we need to
> do is simply to claim that we support such feature, and nothing else.
> 
> With that, the guest will send GARP messages after live migration.
> 
> Signed-off-by: Yuanhan Liu 
> ---
>  lib/librte_vhost/virtio-net.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> index 03044f6..0ba5045 100644
> --- a/lib/librte_vhost/virtio-net.c
> +++ b/lib/librte_vhost/virtio-net.c
> @@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
>  #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
>   (1ULL << VIRTIO_NET_F_CTRL_VQ) | \
>   (1ULL << VIRTIO_NET_F_CTRL_RX) | \
> + (1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \

Do we really need this? I can understand when guest declare with
this VIRTIO_NET_F_GUEST_ANNOUNCE flag. With that, guest itself will
handle the announcement after migration. However, how could I
understand if it's declared by a vhost-user backend? What does it
mean?

In vhost-user.txt (in QEMU repo docs/specs/), the only place that
mentioned this is SEND_RARP:

 * VHOST_USER_SEND_RARP

  Id: 19
  Equivalent ioctl: N/A
  Master payload: u64

  Ask vhost user backend to broadcast a fake RARP to notify the migration
  is terminated for guest that does not support GUEST_ANNOUNCE.
  ...

Here, it mention the GUEST_ANNOUNCE since when guest support this,
we do not need to send SEND_RARP to vhost-user backend again. It
never explain what does it mean when vhost-user declaring to have
this flag...

Thanks.
Peter

>   (VHOST_SUPPORTS_MQ)| \
>   (1ULL << VIRTIO_F_VERSION_1)   | \
>   (1ULL << VHOST_F_LOG_ALL)  | \
> -- 
> 1.9.0
> 


[dpdk-dev] [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature

2015-12-22 Thread Yuanhan Liu
On Tue, Dec 22, 2015 at 04:11:08PM +0800, Peter Xu wrote:
> On Thu, Dec 17, 2015 at 11:12:00AM +0800, Yuanhan Liu wrote:
> > It's actually a feature already enabled in Linux kernel. What we need to
> > do is simply to claim that we support such feature, and nothing else.
> > 
> > With that, the guest will send GARP messages after live migration.
> > 
> > Signed-off-by: Yuanhan Liu 
> > ---
> >  lib/librte_vhost/virtio-net.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> > index 03044f6..0ba5045 100644
> > --- a/lib/librte_vhost/virtio-net.c
> > +++ b/lib/librte_vhost/virtio-net.c
> > @@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
> >  #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
> > (1ULL << VIRTIO_NET_F_CTRL_VQ) | \
> > (1ULL << VIRTIO_NET_F_CTRL_RX) | \
> > +   (1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
> 
> Do we really need this? I can understand when guest declare with
> this VIRTIO_NET_F_GUEST_ANNOUNCE flag. With that, guest itself will
> handle the announcement after migration. However, how could I
> understand if it's declared by a vhost-user backend? What does it
> mean?
> 
> In vhost-user.txt (in QEMU repo docs/specs/), the only place that
> mentioned this is SEND_RARP:
> 
>  * VHOST_USER_SEND_RARP
> 
>   Id: 19
>   Equivalent ioctl: N/A
>   Master payload: u64
> 
>   Ask vhost user backend to broadcast a fake RARP to notify the migration
>   is terminated for guest that does not support GUEST_ANNOUNCE.
> ...
> 
> Here, it mention the GUEST_ANNOUNCE since when guest support this,
> we do not need to send SEND_RARP to vhost-user backend again. It
> never explain what does it mean when vhost-user declaring to have
> this flag...

The actually work is done in QEMU and guest kernel. QEMU sends a config
interrupt to guest kernel when such flag is enabled after live migration
(see QEMU code virtio_net_load()). Guest kernel generates the GARP
once it receives the interrupt (see the kernel code 
virtnet_config_changed_work()).

Therefore, we could simply claim that we support VIRTIO_NET_F_GUEST_ANNOUNCE
feature and do nothing here.

--yliu


[dpdk-dev] [Question] How pmd virtio works without UIO?

2015-12-22 Thread Peter Xu
On Tue, Dec 22, 2015 at 03:00:29PM +0800, Yuanhan Liu wrote:
> On Tue, Dec 22, 2015 at 11:50:41AM +0800, Peter Xu wrote:
> > Hi,
> > 
> > I got a question related to how virtio pmd driver work without
> > UIO layer.
> > 
> > I see that in virtio PMD driver, DPDK will first try to init the
> > device using UIO interfaces. If it fails, it will try to init by
> > manipulating IO ports directly (see virtio_resource_init()).
> > 
> > For the ioport case, is it okay to do it like this? E.g., in
> > eth_virtio_dev_init(), we are resetting the virtio device, however,
> > this device should still be owned by virtio-pci driver in the
> > kernel.
> > 
> > How is that working? Did I miss anything?
> 
> That's for configuration part: as far as we can read/write the right
> PCI port, virtio pmd configuration will work. Note that on this case,
> uio just provides another way to tell you where the port is.

Will there be any conflict? For example, when we start testpmd in
the guest without UIO (now, guest virtio net device is using
virtio-pci driver), we directly do read/write to IO ports of the
virtio device, reset it, and configure it. During the time,
virtio-pci driver of the virtio device should be still working (we
could see it by doing lspci -k), never knowing that the device is
reset. So... It seems like that both DPDK and virtio-pci are
manipulating the device. After that, for example, when host trigger
interrupt for guest vring, who (DPDK or virtio-pci) will handle it?

I would understand if we unbind the virtio-pci driver before taking
over the device. But it seems not.

Peter

> 
>   --yliu


[dpdk-dev] [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature

2015-12-22 Thread Pavel Fedin
 Hello!

> > diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> > index 03044f6..0ba5045 100644
> > --- a/lib/librte_vhost/virtio-net.c
> > +++ b/lib/librte_vhost/virtio-net.c
> > @@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
> >  #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
> > (1ULL << VIRTIO_NET_F_CTRL_VQ) | \
> > (1ULL << VIRTIO_NET_F_CTRL_RX) | \
> > +   (1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
> 
> Do we really need this? I can understand when guest declare with
> this VIRTIO_NET_F_GUEST_ANNOUNCE flag. With that, guest itself will
> handle the announcement after migration. However, how could I
> understand if it's declared by a vhost-user backend?

 I guess the documentation is unclear. This is due to way how qemu works. It 
queries vhost-user backend for the features, then offers them to the guest. The 
guest then responds with features FROM THE SUGGESTED SET, which it supports. 
So, if the backend does not claim to support this feature, qemu will not offer 
it to the guest, therefore the guest will not try to activate it.
 I think this is done because this feature is only useful for migration. If 
vhost-user backend does not support migration, it needs neither 
VHOST_USER_SEND_RARP nor guest-side announce.
 Actually, i was thinking about patching qemu once, but... The changeset seemed 
too complicated, and i imagined the situation described in the above sentence, 
so decided to abandon it.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia




[dpdk-dev] [Question] How pmd virtio works without UIO?

2015-12-22 Thread Yuanhan Liu
On Tue, Dec 22, 2015 at 04:23:38PM +0800, Peter Xu wrote:
> On Tue, Dec 22, 2015 at 03:00:29PM +0800, Yuanhan Liu wrote:
> > On Tue, Dec 22, 2015 at 11:50:41AM +0800, Peter Xu wrote:
> > > Hi,
> > > 
> > > I got a question related to how virtio pmd driver work without
> > > UIO layer.
> > > 
> > > I see that in virtio PMD driver, DPDK will first try to init the
> > > device using UIO interfaces. If it fails, it will try to init by
> > > manipulating IO ports directly (see virtio_resource_init()).
> > > 
> > > For the ioport case, is it okay to do it like this? E.g., in
> > > eth_virtio_dev_init(), we are resetting the virtio device, however,
> > > this device should still be owned by virtio-pci driver in the
> > > kernel.
> > > 
> > > How is that working? Did I miss anything?
> > 
> > That's for configuration part: as far as we can read/write the right
> > PCI port, virtio pmd configuration will work. Note that on this case,
> > uio just provides another way to tell you where the port is.
> 
> Will there be any conflict? For example, when we start testpmd in
> the guest without UIO (now, guest virtio net device is using
> virtio-pci driver), we directly do read/write to IO ports of the
> virtio device, reset it, and configure it. During the time,
> virtio-pci driver of the virtio device should be still working (we
> could see it by doing lspci -k), never knowing that the device is
> reset. So... It seems like that both DPDK and virtio-pci are
> manipulating the device. After that, for example, when host trigger
> interrupt for guest vring, who (DPDK or virtio-pci) will handle it?
> 
> I would understand if we unbind the virtio-pci driver before taking
> over the device. But it seems not.

Actually, you are right. I mentioned in the last email that this is
for configuration part. To answer your question in this email, you
will not be able to go that further (say initiating virtio pmd) if
you don't unbind the origin virtio-net driver, and bind it to igb_uio
(or something similar).

The start point is from rte_eal_pci_scan, where the sub-function
pci_san_one just initates a DPDK bond driver.

--yliu


[dpdk-dev] [PATCH] hash: fix CRC32c computation

2015-12-22 Thread Didier Pallard
As demonstrated by the following code, CRC32c computation is not valid
when buffer length is not a multiple of 4 bytes:
(Output obtained by code below)

CRC of 1 NULL bytes expected: 0x527d5351
soft: 527d5351
rte accelerated: 48674bc7
rte soft: 48674bc7
CRC of 2 NULL bytes expected: 0xf16177d2
soft: f16177d2
rte accelerated: 48674bc7
rte soft: 48674bc7
CRC of 2x1 NULL bytes expected: 0xf16177d2
soft: f16177d2
rte accelerated: 8c28b28a
rte soft: 8c28b28a
CRC of 3 NULL bytes expected: 0x6064a37a
soft: 6064a37a
rte accelerated: 48674bc7
rte soft: 48674bc7
CRC of 4 NULL bytes expected: 0x48674bc7
soft: 48674bc7
rte accelerated: 48674bc7
rte soft: 48674bc7

Values returned by rte_hash_crc functions does not match the one
computed by a trivial crc32c implementation.

ARM code is a guess, it is not tested, neither compiled.

code showing the problem:

uint8_t null_test[32] = {0};

static uint32_t crc32c_trivial(uint8_t *buffer, uint32_t length, uint32_t crc)
{
uint32_t i, j;
for (i = 0; i < length; ++i)
{
crc = crc ^ buffer[i];
for (j = 0; j < 8; j++)
crc = (crc >> 1) ^ 0x8000 ^ ((~crc & 1) * 0x82f63b78);
}
return crc;
}

void hash_test(void);
void hash_test(void)
{
printf("CRC of 1 nul byte expected: 0x527d5351\n");
printf("soft: %08x\n", crc32c_trivial(null_test, 1, 0));
rte_hash_crc_init_alg();
printf("rte accelerated: %08x\n", ~rte_hash_crc(null_test, 1, 
0x));
rte_hash_crc_set_alg(CRC32_SW);
printf("rte soft: %08x\n", ~rte_hash_crc(null_test, 1, 0x));

printf("CRC of 2 nul bytes expected: 0xf16177d2\n");
printf("soft: %08x\n", crc32c_trivial(null_test, 2, 0));
rte_hash_crc_init_alg();
printf("rte accelerated: %08x\n", ~rte_hash_crc(null_test, 2, 
0x));
rte_hash_crc_set_alg(CRC32_SW);
printf("rte soft: %08x\n", ~rte_hash_crc(null_test, 2, 0x));

printf("CRC of 2x1 nul bytes expected: 0xf16177d2\n");
printf("soft: %08x\n", crc32c_trivial(null_test, 1, 
crc32c_trivial(null_test, 1, 0)));
rte_hash_crc_init_alg();
printf("rte accelerated: %08x\n", ~rte_hash_crc(null_test, 1, 
rte_hash_crc(null_test, 1, 0x)));
rte_hash_crc_set_alg(CRC32_SW);
printf("rte soft: %08x\n", ~rte_hash_crc(null_test, 1, 
rte_hash_crc(null_test, 1, 0x)));

printf("CRC of 3 nul bytes expected: 0x6064a37a\n");
printf("soft: %08x\n", crc32c_trivial(null_test, 3, 0));
rte_hash_crc_init_alg();
printf("rte accelerated: %08x\n", ~rte_hash_crc(null_test, 3, 
0x));
rte_hash_crc_set_alg(CRC32_SW);
printf("rte soft: %08x\n", ~rte_hash_crc(null_test, 3, 0x));

printf("CRC of 4 nul bytes expected: 0x48674bc7\n");
printf("soft: %08x\n", crc32c_trivial(null_test, 4, 0));
rte_hash_crc_init_alg();
printf("rte accelerated: %08x\n", ~rte_hash_crc(null_test, 4, 
0x));
rte_hash_crc_set_alg(CRC32_SW);
printf("rte soft: %08x\n", ~rte_hash_crc(null_test, 4, 0x));
}

Signed-off-by: Didier Pallard 
Acked-by: David Marchand 
---
 lib/librte_hash/rte_crc_arm64.h |  64 
 lib/librte_hash/rte_hash_crc.h  | 125 +++-
 2 files changed, 162 insertions(+), 27 deletions(-)

diff --git a/lib/librte_hash/rte_crc_arm64.h b/lib/librte_hash/rte_crc_arm64.h
index 02e26bc..44ef460 100644
--- a/lib/librte_hash/rte_crc_arm64.h
+++ b/lib/librte_hash/rte_crc_arm64.h
@@ -50,6 +50,28 @@ extern "C" {
 #include 

 static inline uint32_t
+crc32c_arm64_u8(uint8_t data, uint32_t init_val)
+{
+   asm(".arch armv8-a+crc");
+   __asm__ volatile(
+   "crc32cb %w[crc], %w[crc], %b[value]"
+   : [crc] "+r" (init_val)
+   : [value] "r" (data));
+   return init_val;
+}
+
+static inline uint32_t
+crc32c_arm64_u16(uint16_t data, uint32_t init_val)
+{
+   asm(".arch armv8-a+crc");
+   __asm__ volatile(
+   "crc32ch %w[crc], %w[crc], %h[value]"
+   : [crc] "+r" (init_val)
+   : [value] "r" (data));
+   return init_val;
+}
+
+static inline uint32_t
 crc32c_arm64_u32(uint32_t data, uint32_t init_val)
 {
asm(".arch armv8-a+crc");
@@ -103,6 +125,48 @@ rte_hash_crc_init_alg(void)
 }

 /**
+ * Use single crc32 instruction to perform a hash on a 1 byte value.
+ * Fall back to software crc32 implementation in case arm64 crc intrinsics is
+ * not supported
+ *
+ * @param data
+ *   Data to perform hash on.
+ * @param init_val
+ *   Value to initialise hash generator.
+ * @return
+ *   32bit calculated hash value.
+ */
+static inline uint32_t
+rte_hash_crc_1byte(uint8_t data, uint32_t init_val)
+{
+   if (likely(crc32_alg & 

[dpdk-dev] [PATCH v5 1/3] vhost: Add callback and private data for vhost PMD

2015-12-22 Thread Rich Lane
On Mon, Dec 21, 2015 at 9:47 PM, Yuanhan Liu 
wrote:

> On Mon, Dec 21, 2015 at 08:47:28PM -0800, Rich Lane wrote:
> > The queue state change callback is the one new API that needs to be
> > added because
> > normal NICs don't have this behavior.
>
> Again I'd ask, will vring_state_changed() be enough, when above issues
> are resolved: vring_state_changed() will be invoked at new_device()/
> destroy_device(), and of course, ethtool change?


It would be sufficient. It is not a great API though, because it requires
the
application to do the conversion from struct virtio_net to a DPDK port
number,
and from a virtqueue index to a DPDK queue id and direction. Also, the
current
implementation often makes this callback when the vring state has not
actually
changed (enabled -> enabled and disabled -> disabled).

If you're asking about using vring_state_changed() _instead_ of the link
status
event and rte_eth_dev_socket_id(), then yes, it still works. I'd only
consider
that a stopgap until the real ethdev APIs are implemented.

I'd suggest to add RTE_ETH_EVENT_QUEUE_STATE_CHANGE rather than
create another callback registration API.

Perhaps we could merge the basic PMD which I think is pretty solid and then
continue the API discussion with patches to it.


[dpdk-dev] [Question] How pmd virtio works without UIO?

2015-12-22 Thread Peter Xu
On Tue, Dec 22, 2015 at 04:32:46PM +0800, Yuanhan Liu wrote:
> Actually, you are right. I mentioned in the last email that this is
> for configuration part. To answer your question in this email, you
> will not be able to go that further (say initiating virtio pmd) if
> you don't unbind the origin virtio-net driver, and bind it to igb_uio
> (or something similar).
> 
> The start point is from rte_eal_pci_scan, where the sub-function
> pci_san_one just initates a DPDK bond driver.

I am not sure whether I do understand your meaning correctly
(regarding "you willl not be able to go that furture"): The problem
is that, we _can_ run testpmd without unbinding the ports and bind
to UIO or something. What we need to do is boot the guest, reserve
huge pages, and run testpmd (keeping its kernel driver as
"virtio-pci"). In pci_scan_one():

if (!ret) {
if (!strcmp(driver, "vfio-pci"))
dev->kdrv = RTE_KDRV_VFIO;
else if (!strcmp(driver, "igb_uio"))
dev->kdrv = RTE_KDRV_IGB_UIO;
else if (!strcmp(driver, "uio_pci_generic"))
dev->kdrv = RTE_KDRV_UIO_GENERIC;
else
dev->kdrv = RTE_KDRV_UNKNOWN;
} else
dev->kdrv = RTE_KDRV_UNKNOWN;

I think it should be going to RTE_KDRV_UNKNOWN
(driver=="virtio-pci") here. I tried to run IO and it could work,
but I am not sure whether it is safe, and how.

Also, I am not sure whether I need to (at least) unbind the
virtio-pci driver, so that there should have no kernel driver
running for the virtio device before DPDK using it.

Thanks
Peter

> 
>   --yliu


[dpdk-dev] [Question] How pmd virtio works without UIO?

2015-12-22 Thread Xie, Huawei
On 12/22/2015 5:57 PM, Peter Xu wrote:
> On Tue, Dec 22, 2015 at 04:32:46PM +0800, Yuanhan Liu wrote:
>> Actually, you are right. I mentioned in the last email that this is
>> for configuration part. To answer your question in this email, you
>> will not be able to go that further (say initiating virtio pmd) if
>> you don't unbind the origin virtio-net driver, and bind it to igb_uio
>> (or something similar).
>>
>> The start point is from rte_eal_pci_scan, where the sub-function
>> pci_san_one just initates a DPDK bond driver.
> I am not sure whether I do understand your meaning correctly
> (regarding "you willl not be able to go that furture"): The problem
> is that, we _can_ run testpmd without unbinding the ports and bind
> to UIO or something. What we need to do is boot the guest, reserve
> huge pages, and run testpmd (keeping its kernel driver as
> "virtio-pci"). In pci_scan_one():
>
>   if (!ret) {
>   if (!strcmp(driver, "vfio-pci"))
>   dev->kdrv = RTE_KDRV_VFIO;
>   else if (!strcmp(driver, "igb_uio"))
>   dev->kdrv = RTE_KDRV_IGB_UIO;
>   else if (!strcmp(driver, "uio_pci_generic"))
>   dev->kdrv = RTE_KDRV_UIO_GENERIC;
>   else
>   dev->kdrv = RTE_KDRV_UNKNOWN;
>   } else
>   dev->kdrv = RTE_KDRV_UNKNOWN;
>
> I think it should be going to RTE_KDRV_UNKNOWN
> (driver=="virtio-pci") here. I tried to run IO and it could work,
> but I am not sure whether it is safe, and how.

Good catch, peter.
Actually recently customers complain that with this feature, DPDK always
tries to take over this virtio-pci device, which is unwanted behavior.
Using blacklist could workaround this issue.
However, the real serious problem is that kernel driver is still
managing this device.

Changchun, Thomas:
I think we should fix this, but firstly i wonder why using port IO to
get PCI resource is more secure.

>
> Also, I am not sure whether I need to (at least) unbind the
> virtio-pci driver, so that there should have no kernel driver
> running for the virtio device before DPDK using it.
If you unbind, you have no entry under /proc/ioports for virtio IO port.
>
> Thanks
> Peter
>
>>  --yliu



[dpdk-dev] [Question] How pmd virtio works without UIO?

2015-12-22 Thread Xie, Huawei
On 12/22/2015 6:48 PM, Xie, Huawei wrote:
> On 12/22/2015 5:57 PM, Peter Xu wrote:
>> On Tue, Dec 22, 2015 at 04:32:46PM +0800, Yuanhan Liu wrote:
>>> Actually, you are right. I mentioned in the last email that this is
>>> for configuration part. To answer your question in this email, you
>>> will not be able to go that further (say initiating virtio pmd) if
>>> you don't unbind the origin virtio-net driver, and bind it to igb_uio
>>> (or something similar).
>>>
>>> The start point is from rte_eal_pci_scan, where the sub-function
>>> pci_san_one just initates a DPDK bond driver.
>> I am not sure whether I do understand your meaning correctly
>> (regarding "you willl not be able to go that furture"): The problem
>> is that, we _can_ run testpmd without unbinding the ports and bind
>> to UIO or something. What we need to do is boot the guest, reserve
>> huge pages, and run testpmd (keeping its kernel driver as
>> "virtio-pci"). In pci_scan_one():
>>
>>  if (!ret) {
>>  if (!strcmp(driver, "vfio-pci"))
>>  dev->kdrv = RTE_KDRV_VFIO;
>>  else if (!strcmp(driver, "igb_uio"))
>>  dev->kdrv = RTE_KDRV_IGB_UIO;
>>  else if (!strcmp(driver, "uio_pci_generic"))
>>  dev->kdrv = RTE_KDRV_UIO_GENERIC;
>>  else
>>  dev->kdrv = RTE_KDRV_UNKNOWN;
>>  } else
>>  dev->kdrv = RTE_KDRV_UNKNOWN;
>>
>> I think it should be going to RTE_KDRV_UNKNOWN
>> (driver=="virtio-pci") here. I tried to run IO and it could work,
>> but I am not sure whether it is safe, and how.
> Good catch, peter.
> Actually recently customers complain that with this feature, DPDK always
> tries to take over this virtio-pci device, which is unwanted behavior.
> Using blacklist could workaround this issue.
> However, the real serious problem is that kernel driver is still
> managing this device.
>
> Changchun, Thomas:
> I think we should fix this, but firstly i wonder why using port IO to
> get PCI resource is more secure.
>
>> Also, I am not sure whether I need to (at least) unbind the
>> virtio-pci driver, so that there should have no kernel driver
>> running for the virtio device before DPDK using it.
> If you unbind, you have no entry under /proc/ioports for virtio IO port.
>> Thanks
>> Peter
>>
>>> --yliu
>



[dpdk-dev] [PATCH v2] testpmd: fix a bug in tx_vlan set command support

2015-12-22 Thread Wang Xiao W
v2:
* Removed the bug fix unrelated code change to make this patch a pure
  bug fix patch.

* Fixed the "PATCH 1/2" mistake in the patch title, rewrote the subject.

v1:
* Initial version for tx_vlan set command support bug fix.

Wang Xiao W (1):
  testpmd: fix a bug in tx_vlan set command support

 app/test-pmd/cmdline.c | 12 
 app/test-pmd/config.c  | 16 
 2 files changed, 16 insertions(+), 12 deletions(-)

-- 
1.9.3



[dpdk-dev] [PATCH v2] testpmd: fix a bug in tx_vlan set command support

2015-12-22 Thread Wang Xiao W
Now in cmd_tx_vlan_set_parsed function, we check the vlan_offload
capability first, if it's an invalid port_id we'll get a strange
prompt saying "Error, as QinQ has been enabled.". We should always
make sure that we get a valid port_id first before we check other
information. It's the same problem for cmd_tx_vlan_set_qinq_parsed.

Signed-off-by: Wang Xiao W 
---
 app/test-pmd/cmdline.c | 12 
 app/test-pmd/config.c  | 16 
 2 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 73298c9..2adf6ca 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -2952,12 +2952,6 @@ cmd_tx_vlan_set_parsed(void *parsed_result,
   __attribute__((unused)) void *data)
 {
struct cmd_tx_vlan_set_result *res = parsed_result;
-   int vlan_offload = rte_eth_dev_get_vlan_offload(res->port_id);
-
-   if (vlan_offload & ETH_VLAN_EXTEND_OFFLOAD) {
-   printf("Error, as QinQ has been enabled.\n");
-   return;
-   }

tx_vlan_set(res->port_id, res->vlan_id);
 }
@@ -3004,12 +2998,6 @@ cmd_tx_vlan_set_qinq_parsed(void *parsed_result,
__attribute__((unused)) void *data)
 {
struct cmd_tx_vlan_set_qinq_result *res = parsed_result;
-   int vlan_offload = rte_eth_dev_get_vlan_offload(res->port_id);
-
-   if (!(vlan_offload & ETH_VLAN_EXTEND_OFFLOAD)) {
-   printf("Error, as QinQ hasn't been enabled.\n");
-   return;
-   }

tx_qinq_set(res->port_id, res->vlan_id, res->vlan_id_outer);
 }
diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 7088f6f..956d29c 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -1839,10 +1839,18 @@ vlan_tpid_set(portid_t port_id, uint16_t tp_id)
 void
 tx_vlan_set(portid_t port_id, uint16_t vlan_id)
 {
+   int vlan_offload;
if (port_id_is_invalid(port_id, ENABLED_WARN))
return;
if (vlan_id_is_invalid(vlan_id))
return;
+
+   vlan_offload = rte_eth_dev_get_vlan_offload(port_id);
+   if (vlan_offload & ETH_VLAN_EXTEND_OFFLOAD) {
+   printf("Error, as QinQ has been enabled.\n");
+   return;
+   }
+
tx_vlan_reset(port_id);
ports[port_id].tx_ol_flags |= TESTPMD_TX_OFFLOAD_INSERT_VLAN;
ports[port_id].tx_vlan_id = vlan_id;
@@ -1851,12 +1859,20 @@ tx_vlan_set(portid_t port_id, uint16_t vlan_id)
 void
 tx_qinq_set(portid_t port_id, uint16_t vlan_id, uint16_t vlan_id_outer)
 {
+   int vlan_offload;
if (port_id_is_invalid(port_id, ENABLED_WARN))
return;
if (vlan_id_is_invalid(vlan_id))
return;
if (vlan_id_is_invalid(vlan_id_outer))
return;
+
+   vlan_offload = rte_eth_dev_get_vlan_offload(port_id);
+   if (!(vlan_offload & ETH_VLAN_EXTEND_OFFLOAD)) {
+   printf("Error, as QinQ hasn't been enabled.\n");
+   return;
+   }
+
tx_vlan_reset(port_id);
ports[port_id].tx_ol_flags |= TESTPMD_TX_OFFLOAD_INSERT_QINQ;
ports[port_id].tx_vlan_id = vlan_id;
-- 
1.9.3



[dpdk-dev] [Question] How pmd virtio works without UIO?

2015-12-22 Thread Peter Xu
On Tue, Dec 22, 2015 at 10:47:21AM +, Xie, Huawei wrote:
> On 12/22/2015 5:57 PM, Peter Xu wrote:
> > On Tue, Dec 22, 2015 at 04:32:46PM +0800, Yuanhan Liu wrote:
> >> Actually, you are right. I mentioned in the last email that this is
> >> for configuration part. To answer your question in this email, you
> >> will not be able to go that further (say initiating virtio pmd) if
> >> you don't unbind the origin virtio-net driver, and bind it to igb_uio
> >> (or something similar).
> >>
> >> The start point is from rte_eal_pci_scan, where the sub-function
> >> pci_san_one just initates a DPDK bond driver.
> > I am not sure whether I do understand your meaning correctly
> > (regarding "you willl not be able to go that furture"): The problem
> > is that, we _can_ run testpmd without unbinding the ports and bind
> > to UIO or something. What we need to do is boot the guest, reserve
> > huge pages, and run testpmd (keeping its kernel driver as
> > "virtio-pci"). In pci_scan_one():
> >
> > if (!ret) {
> > if (!strcmp(driver, "vfio-pci"))
> > dev->kdrv = RTE_KDRV_VFIO;
> > else if (!strcmp(driver, "igb_uio"))
> > dev->kdrv = RTE_KDRV_IGB_UIO;
> > else if (!strcmp(driver, "uio_pci_generic"))
> > dev->kdrv = RTE_KDRV_UIO_GENERIC;
> > else
> > dev->kdrv = RTE_KDRV_UNKNOWN;
> > } else
> > dev->kdrv = RTE_KDRV_UNKNOWN;
> >
> > I think it should be going to RTE_KDRV_UNKNOWN
> > (driver=="virtio-pci") here. I tried to run IO and it could work,
> > but I am not sure whether it is safe, and how.
> 
> Good catch, peter.
> Actually recently customers complain that with this feature, DPDK always
> tries to take over this virtio-pci device, which is unwanted behavior.
> Using blacklist could workaround this issue.
> However, the real serious problem is that kernel driver is still
> managing this device.
> 
> Changchun, Thomas:
> I think we should fix this, but firstly i wonder why using port IO to
> get PCI resource is more secure.

I am not familiar with this, however, shouldn't port IO from
userspace the most dangerous way to access PCI resource?

> 
> >
> > Also, I am not sure whether I need to (at least) unbind the
> > virtio-pci driver, so that there should have no kernel driver
> > running for the virtio device before DPDK using it.
> If you unbind, you have no entry under /proc/ioports for virtio IO port.

I tried to unbind one of the virtio net device, I see the PCI entry
still there.

Before unbind:

[root at vm proc]# lspci -k -s 00:03.0
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
Subsystem: Red Hat, Inc Device 0001
Kernel driver in use: virtio-pci
[root at vm proc]# cat /proc/ioports | grep c060-c07f
  c060-c07f : :00:03.0
c060-c07f : virtio-pci

After unbind:

[root at vm proc]# lspci -k -s 00:03.0
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
Subsystem: Red Hat, Inc Device 0001
[root at vm proc]# cat /proc/ioports | grep c060-c07f
  c060-c07f : :00:03.0

So... does this means that it is an alternative to black list
solution?

Peter

> >
> > Thanks
> > Peter
> >
> >>--yliu
> 


[dpdk-dev] bnx2x pmd performance expectations

2015-12-22 Thread Alexander Belyakov
Hi,

just tried to forward a lot of tiny packets with testpmd (dpdk-2.2.0)
running on Broadcom Corporation NetXtreme II BCM57810S 10 Gigabit Ethernet
(rev 10) adapter. I see forwarding performance of 2.6Mpps instead of
expected 14.8Mpps. What should be done to achieve better results?

Thank you,
Alexander Belyakov


[dpdk-dev] [PATCH v4 1/6] fm10k: implement rx_descriptor_done function

2015-12-22 Thread He, Shaopeng

> -Original Message-
> From: Qiu, Michael
> Sent: Tuesday, December 22, 2015 2:51 PM
> To: He, Shaopeng; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v4 1/6] fm10k: implement
> rx_descriptor_done function
> 
> On 12/21/2015 6:20 PM, Shaopeng He wrote:
> > rx_descriptor_done is used by interrupt mode example application
> > (l3fwd-power) to check rxd DD bit to decide the RX trend, then
> > l3fwd-power will adjust the cpu frequency according to the result.
> >
> > Signed-off-by: Shaopeng He 
> > Acked-by: Jing Chen 
> > ---
> >  drivers/net/fm10k/fm10k.h|  3 +++
> >  drivers/net/fm10k/fm10k_ethdev.c |  1 +
> >  drivers/net/fm10k/fm10k_rxtx.c   | 25 +
> >  3 files changed, 29 insertions(+)
> >
> > diff --git a/drivers/net/fm10k/fm10k.h b/drivers/net/fm10k/fm10k.h
> > index cd38af2..e2f677a 100644
> > --- a/drivers/net/fm10k/fm10k.h
> > +++ b/drivers/net/fm10k/fm10k.h
> > @@ -345,6 +345,9 @@ uint16_t fm10k_recv_pkts(void *rx_queue, struct
> > rte_mbuf **rx_pkts,  uint16_t fm10k_recv_scattered_pkts(void
> *rx_queue,
> > struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
> >
> > +int
> > +fm10k_dev_rx_descriptor_done(void *rx_queue, uint16_t offset);
> > +
> >  uint16_t fm10k_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
> > uint16_t nb_pkts);
> >
> > diff --git a/drivers/net/fm10k/fm10k_ethdev.c
> > b/drivers/net/fm10k/fm10k_ethdev.c
> > index e4aed94..d39c33b 100644
> > --- a/drivers/net/fm10k/fm10k_ethdev.c
> > +++ b/drivers/net/fm10k/fm10k_ethdev.c
> > @@ -2435,6 +2435,7 @@ static const struct eth_dev_ops
> fm10k_eth_dev_ops = {
> > .rx_queue_release   = fm10k_rx_queue_release,
> > .tx_queue_setup = fm10k_tx_queue_setup,
> > .tx_queue_release   = fm10k_tx_queue_release,
> > +   .rx_descriptor_done = fm10k_dev_rx_descriptor_done,
> > .reta_update= fm10k_reta_update,
> > .reta_query = fm10k_reta_query,
> > .rss_hash_update= fm10k_rss_hash_update,
> > diff --git a/drivers/net/fm10k/fm10k_rxtx.c
> > b/drivers/net/fm10k/fm10k_rxtx.c index e958865..36d3002 100644
> > --- a/drivers/net/fm10k/fm10k_rxtx.c
> > +++ b/drivers/net/fm10k/fm10k_rxtx.c
> > @@ -369,6 +369,31 @@ fm10k_recv_scattered_pkts(void *rx_queue,
> struct rte_mbuf **rx_pkts,
> > return nb_rcv;
> >  }
> >
> > +int
> > +fm10k_dev_rx_descriptor_done(void *rx_queue, uint16_t offset) {
> > +   volatile union fm10k_rx_desc *rxdp;
> > +   struct fm10k_rx_queue *rxq = rx_queue;
> > +   uint16_t desc;
> > +   int ret;
> > +
> > +   if (unlikely(offset >= rxq->nb_desc)) {
> > +   PMD_DRV_LOG(ERR, "Invalid RX queue id %u", offset);
> 
> Sorry, here makes my confuse: offset for RX queue id?

Good catch, will fix it in next version

> 
> > +   return 0;
> > +   }
> > +
> > +   desc = rxq->next_dd + offset;
> > +   if (desc >= rxq->nb_desc)
> > +   desc -= rxq->nb_desc;
> > +
> > +   rxdp = &rxq->hw_ring[desc];
> > +
> > +   ret = !!(rxdp->w.status &
> > +   rte_cpu_to_le_16(FM10K_RXD_STATUS_DD));
> > +
> > +   return ret;
> > +}
> > +
> >  static inline void tx_free_descriptors(struct fm10k_tx_queue *q)  {
> > uint16_t next_rs, count = 0;



[dpdk-dev] [PATCH v4 3/6] fm10k: remove rx queue interrupts when dev stops

2015-12-22 Thread He, Shaopeng
> -Original Message-
> From: Qiu, Michael
> Sent: Tuesday, December 22, 2015 2:55 PM
> To: He, Shaopeng; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v4 3/6] fm10k: remove rx queue interrupts
> when dev stops
> 
> On 12/21/2015 6:20 PM, Shaopeng He wrote:
> > Previous dev_stop function stops the rx/tx queues. This patch adds
> > logic to disable rx queue interrupt, clean the datapath event and
> queue/vec map.
> >
> > Signed-off-by: Shaopeng He 
> > Acked-by: Jing Chen 
> > ---
> >  drivers/net/fm10k/fm10k_ethdev.c | 22 ++
> >  1 file changed, 22 insertions(+)
> >
> > diff --git a/drivers/net/fm10k/fm10k_ethdev.c
> > b/drivers/net/fm10k/fm10k_ethdev.c
> > index a34c5e2..b5b809c 100644
> > --- a/drivers/net/fm10k/fm10k_ethdev.c
> > +++ b/drivers/net/fm10k/fm10k_ethdev.c
> > @@ -1125,6 +1125,8 @@ fm10k_dev_start(struct rte_eth_dev *dev)  static
> > void  fm10k_dev_stop(struct rte_eth_dev *dev)  {
> > +   struct fm10k_hw *hw = FM10K_DEV_PRIVATE_TO_HW(dev->data-
> >dev_private);
> > +   struct rte_intr_handle *intr_handle = &dev->pci_dev->intr_handle;
> > int i;
> >
> > PMD_INIT_FUNC_TRACE();
> > @@ -1136,6 +1138,26 @@ fm10k_dev_stop(struct rte_eth_dev *dev)
> > if (dev->data->rx_queues)
> > for (i = 0; i < dev->data->nb_rx_queues; i++)
> > fm10k_dev_rx_queue_stop(dev, i);
> > +
> > +   /* Disable datapath event */
> > +   if (rte_intr_dp_is_en(intr_handle)) {
> > +   for (i = 0; i < dev->data->nb_rx_queues; i++) {
> > +   FM10K_WRITE_REG(hw, FM10K_RXINT(i),
> > +   3 << FM10K_RXINT_TIMER_SHIFT);
> > +   if (hw->mac.type == fm10k_mac_pf)
> > +   FM10K_WRITE_REG(hw,
> FM10K_ITR(Q2V(dev, i)),
> > +   FM10K_ITR_MASK_SET);
> > +   else
> > +   FM10K_WRITE_REG(hw,
> FM10K_VFITR(Q2V(dev, i)),
> > +   FM10K_ITR_MASK_SET);
> > +   }
> > +   }
> > +   /* Clean datapath event and queue/vec mapping */
> > +   rte_intr_efd_disable(intr_handle);
> > +   if (intr_handle->intr_vec != NULL) {
> 
> This line could be removed, because rte_free already do the check, see
> below:
> void rte_free(void *addr)
> {
> if (addr == NULL) return;
> if (malloc_elem_free(malloc_elem_from_data(addr)) < 0)
> rte_panic("Fatal error: Invalid memory\n"); }

Yes, it could be removed. Thanks.

> 
> > +   rte_free(intr_handle->intr_vec);
> > +   intr_handle->intr_vec = NULL;
> > +   }
> >  }
> >
> >  static void



[dpdk-dev] [PATCH v2 7/7] doc: release note update for fm10k intr mode

2015-12-22 Thread He, Shaopeng

> -Original Message-
> From: Qiu, Michael
> Sent: Tuesday, December 22, 2015 3:00 PM
> To: He, Shaopeng; dev at dpdk.org
> Cc: Chen, Jing D
> Subject: Re: [PATCH v2 7/7] doc: release note update for fm10k intr mode
> 
> On 10/26/2015 11:48 AM, He, Shaopeng wrote:
> > Signed-off-by: Shaopeng He 
> > ---
> >  doc/guides/rel_notes/release_2_2.rst | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/doc/guides/rel_notes/release_2_2.rst
> > b/doc/guides/rel_notes/release_2_2.rst
> > index 73dba47..44b3aea 100644
> > --- a/doc/guides/rel_notes/release_2_2.rst
> > +++ b/doc/guides/rel_notes/release_2_2.rst
> > @@ -5,11 +5,13 @@ New Features
> >  
> >
> >  * Support interrupt mode on i40e
> > +* Support interrupt mode on fm10k
> >
> 
> Do we need to declare the limitation like only VFIO support for RRC, and
> other limitations like could not work on the management port for BR(maybe
> I'm wrong, but need BAR4 to run Testpoint)

As we discussed, we will update limitations together in nics/fm10k.rst.

> 
> >  Resolved Issues
> >  ---
> >
> >  * Fix ixgbe/igb rx interrupt compatible issue with mbox
> > +* Fix l3fwd-power memory leak for non-ip packet
> >
> >  Known Issues
> >  



[dpdk-dev] [PATCH] mk: Fix examples install path

2015-12-22 Thread Christian Ehrhardt
Depending on non-doc targets being built before and the setting of DESTDIR
the examples dir could in some cases not end up in the right target.
Reason is just a typo variable reference in the copy target.

Signed-off-by: Christian Ehrhardt 
---

[diffstat]
 rte.sdkinstall.mk |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

[diff]
diff --git a/mk/rte.sdkinstall.mk b/mk/rte.sdkinstall.mk
index c159bf7..0db15c6 100644
--- a/mk/rte.sdkinstall.mk
+++ b/mk/rte.sdkinstall.mk
@@ -157,4 +157,4 @@ ifneq ($(wildcard $O/doc/*/*/*pdf),)
$(Q)$(call rte_mkdir, $(DESTDIR)$(docdir)/guides)
$(Q)cp -a $O/doc/*/*/*pdf $(DESTDIR)$(docdir)/guides
 endif
-   $(Q)cp -a $(RTE_SDK)/examples $(DESTDIR)$(datadir)
+   $(Q)cp -a $(RTE_SDK)/examples $(DESTDIR)$(docdir)


[dpdk-dev] [Question] How pmd virtio works without UIO?

2015-12-22 Thread Xie, Huawei
On 12/22/2015 7:39 PM, Peter Xu wrote:
> On Tue, Dec 22, 2015 at 10:47:21AM +, Xie, Huawei wrote:
>> On 12/22/2015 5:57 PM, Peter Xu wrote:
>>> On Tue, Dec 22, 2015 at 04:32:46PM +0800, Yuanhan Liu wrote:
 Actually, you are right. I mentioned in the last email that this is
 for configuration part. To answer your question in this email, you
 will not be able to go that further (say initiating virtio pmd) if
 you don't unbind the origin virtio-net driver, and bind it to igb_uio
 (or something similar).

 The start point is from rte_eal_pci_scan, where the sub-function
 pci_san_one just initates a DPDK bond driver.
>>> I am not sure whether I do understand your meaning correctly
>>> (regarding "you willl not be able to go that furture"): The problem
>>> is that, we _can_ run testpmd without unbinding the ports and bind
>>> to UIO or something. What we need to do is boot the guest, reserve
>>> huge pages, and run testpmd (keeping its kernel driver as
>>> "virtio-pci"). In pci_scan_one():
>>>
>>> if (!ret) {
>>> if (!strcmp(driver, "vfio-pci"))
>>> dev->kdrv = RTE_KDRV_VFIO;
>>> else if (!strcmp(driver, "igb_uio"))
>>> dev->kdrv = RTE_KDRV_IGB_UIO;
>>> else if (!strcmp(driver, "uio_pci_generic"))
>>> dev->kdrv = RTE_KDRV_UIO_GENERIC;
>>> else
>>> dev->kdrv = RTE_KDRV_UNKNOWN;
>>> } else
>>> dev->kdrv = RTE_KDRV_UNKNOWN;
>>>
>>> I think it should be going to RTE_KDRV_UNKNOWN
>>> (driver=="virtio-pci") here. I tried to run IO and it could work,
>>> but I am not sure whether it is safe, and how.
>> Good catch, peter.
>> Actually recently customers complain that with this feature, DPDK always
>> tries to take over this virtio-pci device, which is unwanted behavior.
>> Using blacklist could workaround this issue.
>> However, the real serious problem is that kernel driver is still
>> managing this device.
>>
>> Changchun, Thomas:
>> I think we should fix this, but firstly i wonder why using port IO to
>> get PCI resource is more secure.
> I am not familiar with this, however, shouldn't port IO from
> userspace the most dangerous way to access PCI resource?
>
>>> Also, I am not sure whether I need to (at least) unbind the
>>> virtio-pci driver, so that there should have no kernel driver
>>> running for the virtio device before DPDK using it.
>> If you unbind, you have no entry under /proc/ioports for virtio IO port.
> I tried to unbind one of the virtio net device, I see the PCI entry
> still there.
>
> Before unbind:
>
> [root at vm proc]# lspci -k -s 00:03.0
> 00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
> Subsystem: Red Hat, Inc Device 0001
> Kernel driver in use: virtio-pci
> [root at vm proc]# cat /proc/ioports | grep c060-c07f
>   c060-c07f : :00:03.0
> c060-c07f : virtio-pci
>
> After unbind:
>
> [root at vm proc]# lspci -k -s 00:03.0
> 00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
> Subsystem: Red Hat, Inc Device 0001
> [root at vm proc]# cat /proc/ioports | grep c060-c07f
>   c060-c07f : :00:03.0
>
> So... does this means that it is an alternative to black list
> solution?
Users wants to use virtio-net device for normal communication, so they
still want virtio-net driver to manipulate this device. They don't want
DPDK to take over this device blindly.
>
> Peter
>
>>> Thanks
>>> Peter
>>>
--yliu



[dpdk-dev] [PATCH v2 2/3] vhost: simplify numa_realloc

2015-12-22 Thread Xie, Huawei
On 12/22/2015 3:27 PM, Yuanhan Liu wrote:
> We could first check if we need realloc vq or not, if so,
> reallocate it. We then do similar to vhost dev realloc.
>
> This could get rid of the tons of repeated "if (realloc_dev)"
> and "if (realloc_vq)" statements, therefore, makes code
> a bit more readable.
>
> Signed-off-by: Yuanhan Liu 

Acked-by: Huawei Xie 
[...]


[dpdk-dev] [Question] How pmd virtio works without UIO?

2015-12-22 Thread Xie, Huawei
On 12/22/2015 7:39 PM, Peter Xu wrote:
> On Tue, Dec 22, 2015 at 10:47:21AM +, Xie, Huawei wrote:
>> On 12/22/2015 5:57 PM, Peter Xu wrote:
>>> On Tue, Dec 22, 2015 at 04:32:46PM +0800, Yuanhan Liu wrote:
 Actually, you are right. I mentioned in the last email that this is
 for configuration part. To answer your question in this email, you
 will not be able to go that further (say initiating virtio pmd) if
 you don't unbind the origin virtio-net driver, and bind it to igb_uio
 (or something similar).

 The start point is from rte_eal_pci_scan, where the sub-function
 pci_san_one just initates a DPDK bond driver.
>>> I am not sure whether I do understand your meaning correctly
>>> (regarding "you willl not be able to go that furture"): The problem
>>> is that, we _can_ run testpmd without unbinding the ports and bind
>>> to UIO or something. What we need to do is boot the guest, reserve
>>> huge pages, and run testpmd (keeping its kernel driver as
>>> "virtio-pci"). In pci_scan_one():
>>>
>>> if (!ret) {
>>> if (!strcmp(driver, "vfio-pci"))
>>> dev->kdrv = RTE_KDRV_VFIO;
>>> else if (!strcmp(driver, "igb_uio"))
>>> dev->kdrv = RTE_KDRV_IGB_UIO;
>>> else if (!strcmp(driver, "uio_pci_generic"))
>>> dev->kdrv = RTE_KDRV_UIO_GENERIC;
>>> else
>>> dev->kdrv = RTE_KDRV_UNKNOWN;
>>> } else
>>> dev->kdrv = RTE_KDRV_UNKNOWN;
>>>
>>> I think it should be going to RTE_KDRV_UNKNOWN
>>> (driver=="virtio-pci") here. I tried to run IO and it could work,
>>> but I am not sure whether it is safe, and how.
>> Good catch, peter.
>> Actually recently customers complain that with this feature, DPDK always
>> tries to take over this virtio-pci device, which is unwanted behavior.
>> Using blacklist could workaround this issue.
>> However, the real serious problem is that kernel driver is still
>> managing this device.
>>
>> Changchun, Thomas:
>> I think we should fix this, but firstly i wonder why using port IO to
>> get PCI resource is more secure.
> I am not familiar with this, however, shouldn't port IO from
> userspace the most dangerous way to access PCI resource?
>
>>> Also, I am not sure whether I need to (at least) unbind the
>>> virtio-pci driver, so that there should have no kernel driver
>>> running for the virtio device before DPDK using it.
>> If you unbind, you have no entry under /proc/ioports for virtio IO port.
> I tried to unbind one of the virtio net device, I see the PCI entry
> still there.
>
> Before unbind:
>
> [root at vm proc]# lspci -k -s 00:03.0
> 00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
> Subsystem: Red Hat, Inc Device 0001
> Kernel driver in use: virtio-pci
> [root at vm proc]# cat /proc/ioports | grep c060-c07f
>   c060-c07f : :00:03.0
> c060-c07f : virtio-pci
>
> After unbind:
>
> [root at vm proc]# lspci -k -s 00:03.0
> 00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
> Subsystem: Red Hat, Inc Device 0001
> [root at vm proc]# cat /proc/ioports | grep c060-c07f
>   c060-c07f : :00:03.0
>
> So... does this means that it is an alternative to black list
> solution?
Oh, we could firstly check if this port is manipulated by kernel driver
in virtio_resource_init/eth_virtio_dev_init, as long as it is not too late.
>
> Peter
>
>>> Thanks
>>> Peter
>>>
--yliu



[dpdk-dev] [PATCH v3] vfio: Include No-IOMMU mode

2015-12-22 Thread Alex Williamson
There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system.  There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines.  However, there are still those users
that want userspace drivers even under those conditions.  The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has.  In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
should make it very clear that this mode is not safe.  Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode.  Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported.  This patch includes no-iommu support for the vfio-pci bus
driver only.

Signed-off-by: Alex Williamson 
Acked-by: Michael S. Tsirkin 
---

v3: Version 2 was dropped from kernel v4.4 due to lack of a user.  We
now have a working DPDK port to this interface, so I'm proposing
it again for v4.5.  The changes since v2 can be found split out
in the dpdk archive here:

http://dpdk.org/ml/archives/dev/2015-December/030561.html

The problem was that the NOIOMMU extension was only advertised
once a group was attached to a container.  While we want the
no-iommu backed to be used exclusively for no-iommu groups, we
should still advertise it when the module option is enabled.
Handling the no-iommu iommu driver less as a special case
accomplishes this.  Also fixed a mismatch in naming between module
parameter and description and tagged a struct as const.


 drivers/vfio/Kconfig|   15 
 drivers/vfio/pci/vfio_pci.c |8 +-
 drivers/vfio/vfio.c |  181 ++-
 include/linux/vfio.h|3 +
 include/uapi/linux/vfio.h   |7 ++
 5 files changed, 207 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 850d86c..da6e2ce 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -31,6 +31,21 @@ menuconfig VFIO

  If you don't know what to do here, say N.

+menuconfig VFIO_NOIOMMU
+   bool "VFIO No-IOMMU support"
+   depends on VFIO
+   help
+ VFIO is built on the ability to isolate devices using the IOMMU.
+ Only with an IOMMU can userspace access to DMA capable devices be
+ considered secure.  VFIO No-IOMMU mode enables IOMMU groups for
+ devices without IOMMU backing for the purpose of re-using the VFIO
+ infrastructure in a non-secure mode.  Use of this mode will result
+ in an unsupportable kernel and will therefore taint the kernel.
+ Device assignment to virtual machines is also not possible with
+ this mode since there is no IOMMU to provide DMA translation.
+
+ If you don't know what to do here, say N.
+
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 56bf6db..2760a7b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -940,13 +940,13 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
return -EINVAL;

-   group = iommu_group_get(&pdev->dev);
+   group = vfio_iommu_group_get(&pdev->dev);
if (!group)
return -EINVAL;

vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
if (!vdev) {
-   iommu_group_put(group);
+   vfio_iommu_group_put(group, &pdev->dev);
return -ENOMEM;
}

@@ -957,7 +957,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)

ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
if (ret) {
-   iommu_group_put(group);
+   vfio_iommu_group_put(group, &pdev->dev);
kfree(vdev);
return ret;
}
@@ -993,7 +993,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
if (!vdev)
return;

-   iommu_group_put(pdev->dev.iommu_group);
+   vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
kfree(vdev);

if (vfio_pci_is_vga(pdev)) {
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6070b79..5c7ebf2 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -62,6 +62,7 @@ struct vfio_container {
struct rw_semaphore 

[dpdk-dev] SIGILL while calling ixgbevf_vlan_filter_set from secondary process

2015-12-22 Thread Shaham Fridenberg
Hey guys,

I'm using dpdk 2.1.0, and get SIGILL|SIGSEGV when calling 
ixgbevf_vlan_filter_set from secondary process.

Is it possible to do that in the first place?

Thanks,
Shaham


[dpdk-dev] [PATCH v4] vfio: Include No-IOMMU mode

2015-12-22 Thread Alex Williamson
There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system.  There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines.  However, there are still those users
that want userspace drivers even under those conditions.  The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has.  In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
should make it very clear that this mode is not safe.  Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode.  Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported.  This patch includes no-iommu support for the vfio-pci bus
driver only.

Signed-off-by: Alex Williamson 
Acked-by: Michael S. Tsirkin 
---

v4: Fix build without CONFIG_VFIO_NOIOMMU (oops).  Also avoid local
noiommu variable in vfio_create_group() to avoid scope confusion
with global of the same name.

 drivers/vfio/Kconfig|   15 
 drivers/vfio/pci/vfio_pci.c |8 +-
 drivers/vfio/vfio.c |  184 ++-
 include/linux/vfio.h|3 +
 include/uapi/linux/vfio.h   |7 ++
 5 files changed, 210 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 850d86c..da6e2ce 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -31,6 +31,21 @@ menuconfig VFIO

  If you don't know what to do here, say N.

+menuconfig VFIO_NOIOMMU
+   bool "VFIO No-IOMMU support"
+   depends on VFIO
+   help
+ VFIO is built on the ability to isolate devices using the IOMMU.
+ Only with an IOMMU can userspace access to DMA capable devices be
+ considered secure.  VFIO No-IOMMU mode enables IOMMU groups for
+ devices without IOMMU backing for the purpose of re-using the VFIO
+ infrastructure in a non-secure mode.  Use of this mode will result
+ in an unsupportable kernel and will therefore taint the kernel.
+ Device assignment to virtual machines is also not possible with
+ this mode since there is no IOMMU to provide DMA translation.
+
+ If you don't know what to do here, say N.
+
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 56bf6db..2760a7b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -940,13 +940,13 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
return -EINVAL;

-   group = iommu_group_get(&pdev->dev);
+   group = vfio_iommu_group_get(&pdev->dev);
if (!group)
return -EINVAL;

vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
if (!vdev) {
-   iommu_group_put(group);
+   vfio_iommu_group_put(group, &pdev->dev);
return -ENOMEM;
}

@@ -957,7 +957,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)

ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
if (ret) {
-   iommu_group_put(group);
+   vfio_iommu_group_put(group, &pdev->dev);
kfree(vdev);
return ret;
}
@@ -993,7 +993,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
if (!vdev)
return;

-   iommu_group_put(pdev->dev.iommu_group);
+   vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
kfree(vdev);

if (vfio_pci_is_vga(pdev)) {
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6070b79..82f25cc 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -62,6 +62,7 @@ struct vfio_container {
struct rw_semaphore group_lock;
struct vfio_iommu_driver*iommu_driver;
void*iommu_data;
+   boolnoiommu;
 };

 struct vfio_unbound_dev {
@@ -84,6 +85,7 @@ struct vfio_group {
struct list_headunbound_list;
struct mutexunbound_lock;
atomic_topened;
+   boolnoiommu;
 };

 struct vfio_device {
@@ -95,6 +97,128 @@ struct vfio_device {
void*device_data;
 };

+#ifdef CONFI

[dpdk-dev] [PATCH 0/2] Some enhancement of the tilegx mpipe driver

2015-12-22 Thread Liming Sun
1. Implement rte_vect.h and enable LPM lookup;
2. It also has some code cleanup and fixes for potential crash when quiting pmd
   under high traffic;
3. Changes have dependency on serie (http://dpdk.org/dev/patchwork/patch/9571)
   which is pending in the patchwork queue.

Liming Sun (2):
  driver/net/mpipe: add rte_vect.h and enable CONFIG_RTE_LIBRTE_LPM
  driver/net/mpipe: fix the crash/hung issue when testpmd quits

 config/defconfig_tile-tilegx-linuxapp-gcc  |2 +-
 drivers/net/mpipe/mpipe_tilegx.c   |   64 ++
 .../common/include/arch/tile/rte_rwlock.h  |1 +
 lib/librte_eal/common/include/arch/tile/rte_vect.h |   93 
 4 files changed, 124 insertions(+), 36 deletions(-)
 create mode 100644 lib/librte_eal/common/include/arch/tile/rte_vect.h



[dpdk-dev] [PATCH 1/2] driver/net/mpipe: add rte_vect.h and enable CONFIG_RTE_LIBRTE_LPM

2015-12-22 Thread Liming Sun
rte_vect.h was missing earlier thus LPM was disabled and l3fwd is
not able to compile. This commit implements the vector api and
enable LPM in the tilegx configuration by default. It also includes
a minor optimization to use __insn_fetchadd4() instead of
rte_atomic32_xxx() in mpipe_dp_enter/mpipe_dp_exit to avoid the
unnecessary memory fence.

Signed-off-by: Liming Sun 
---
 config/defconfig_tile-tilegx-linuxapp-gcc  |2 +-
 drivers/net/mpipe/mpipe_tilegx.c   |   18 +++-
 lib/librte_eal/common/include/arch/tile/rte_vect.h |   93 
 3 files changed, 107 insertions(+), 6 deletions(-)
 create mode 100644 lib/librte_eal/common/include/arch/tile/rte_vect.h

diff --git a/config/defconfig_tile-tilegx-linuxapp-gcc 
b/config/defconfig_tile-tilegx-linuxapp-gcc
index fb61bcd..39794f6 100644
--- a/config/defconfig_tile-tilegx-linuxapp-gcc
+++ b/config/defconfig_tile-tilegx-linuxapp-gcc
@@ -64,7 +64,7 @@ CONFIG_RTE_LIBRTE_ENIC_PMD=n

 # This following libraries are not available on the tile architecture.
 # So they're turned off.
-CONFIG_RTE_LIBRTE_LPM=n
+CONFIG_RTE_LIBRTE_LPM=y
 CONFIG_RTE_LIBRTE_ACL=n
 CONFIG_RTE_LIBRTE_SCHED=n
 CONFIG_RTE_LIBRTE_PORT=n
diff --git a/drivers/net/mpipe/mpipe_tilegx.c b/drivers/net/mpipe/mpipe_tilegx.c
index 5845511..8d006fa 100644
--- a/drivers/net/mpipe/mpipe_tilegx.c
+++ b/drivers/net/mpipe/mpipe_tilegx.c
@@ -451,13 +451,13 @@ static inline void
 mpipe_dp_enter(struct mpipe_dev_priv *priv)
 {
__insn_mtspr(SPR_DSTREAM_PF, 0);
-   rte_atomic32_inc(&priv->dp_count);
+   __insn_fetchadd4(&priv->dp_count, 1);
 }

 static inline void
 mpipe_dp_exit(struct mpipe_dev_priv *priv)
 {
-   rte_atomic32_dec(&priv->dp_count);
+   __insn_fetchadd4(&priv->dp_count, -1);
 }

 static inline void
@@ -484,12 +484,20 @@ mpipe_recv_mbuf(struct mpipe_dev_priv *priv, 
gxio_mpipe_idesc_t *idesc,
uint16_t size = gxio_mpipe_idesc_get_xfer_size(idesc);
struct rte_mbuf *mbuf = RTE_PTR_SUB(va, priv->rx_offset);

-   rte_pktmbuf_reset(mbuf);
mbuf->data_off = (uintptr_t)va - (uintptr_t)mbuf->buf_addr;
-   mbuf->port = in_port;
-   mbuf->data_len = size;
+   mbuf->nb_segs = 1;
+   mbuf->port = in_port;
+   mbuf->ol_flags = 0;
+   if (gxio_mpipe_idesc_get_ethertype(idesc) == ETHER_TYPE_IPv4)
+   mbuf->packet_type = RTE_PTYPE_L3_IPV4 | RTE_PTYPE_L2_ETHER;
+   else if (gxio_mpipe_idesc_get_ethertype(idesc) == ETHER_TYPE_IPv6)
+   mbuf->packet_type = RTE_PTYPE_L3_IPV6 | RTE_PTYPE_L2_ETHER;
+   else
+   mbuf->packet_type = RTE_PTYPE_UNKNOWN;
mbuf->pkt_len  = size;
+   mbuf->data_len = size;
mbuf->hash.rss = gxio_mpipe_idesc_get_flow_hash(idesc);
+   mbuf->next = NULL;

PMD_DEBUG_RX("%s: RX mbuf %p, buffer %p, buf_addr %p, size %d\n",
 mpipe_name(priv), mbuf, va, mbuf->buf_addr, size);
diff --git a/lib/librte_eal/common/include/arch/tile/rte_vect.h 
b/lib/librte_eal/common/include/arch/tile/rte_vect.h
new file mode 100644
index 000..32d768a
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/tile/rte_vect.h
@@ -0,0 +1,93 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) EZchip Semiconductor Ltd. 2015.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in
+ *   the documentation and/or other materials provided with the
+ *   distribution.
+ * * Neither the name of EZchip Semiconductor nor the names of its
+ *   contributors may be used to endorse or promote products derived
+ *   from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#ifndef _RTE_VECT_H_
+#define _RTE_VECT_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+typedef __int128 __m128i;
+
+#def

[dpdk-dev] [PATCH 2/2] driver/net/mpipe: fix the crash/hung issue when testpmd quits

2015-12-22 Thread Liming Sun
1. Fixed the compiling issue of the ethtool example on tilegx
   platform.
2. Fixed the hung/crash issue when quitting testpmd under high
   traffic rate. The buffer error bit needs to be checked before
   processing the idesc and releasing the buffer. Code logic is
   also simplified.

Signed-off-by: Liming Sun 
---
 drivers/net/mpipe/mpipe_tilegx.c   |   46 +++-
 .../common/include/arch/tile/rte_rwlock.h  |1 +
 2 files changed, 17 insertions(+), 30 deletions(-)

diff --git a/drivers/net/mpipe/mpipe_tilegx.c b/drivers/net/mpipe/mpipe_tilegx.c
index 8d006fa..4cb54c3 100644
--- a/drivers/net/mpipe/mpipe_tilegx.c
+++ b/drivers/net/mpipe/mpipe_tilegx.c
@@ -134,7 +134,6 @@ struct mpipe_dev_priv {
struct rte_mempool *rx_mpool;   /* mpool used by the rx queues. */
unsigned rx_offset; /* Receive head room. */
unsigned rx_size_code;  /* mPIPE rx buffer size code. */
-   unsigned rx_buffers;/* receive buffers on stack. */
int is_xaui:1,  /* Is this an xgbe or gbe? */
initialized:1,  /* Initialized port? */
running:1;  /* Running port? */
@@ -529,7 +528,6 @@ mpipe_recv_fill_stack(struct mpipe_dev_priv *priv, int 
count)
mpipe_recv_push(priv, mbuf);
}

-   priv->rx_buffers += count;
PMD_DEBUG_RX("%s: Filled %d/%d buffers\n", mpipe_name(priv), i, count);
 }

@@ -539,10 +537,9 @@ mpipe_recv_flush_stack(struct mpipe_dev_priv *priv)
const int offset = priv->rx_offset & ~RTE_MEMPOOL_ALIGN_MASK;
uint8_t in_port = priv->port_id;
struct rte_mbuf *mbuf;
-   unsigned count;
void *va;

-   for (count = 0; count < priv->rx_buffers; count++) {
+   while (1) {
va = gxio_mpipe_pop_buffer(priv->context, priv->stack);
if (!va)
break;
@@ -561,10 +558,6 @@ mpipe_recv_flush_stack(struct mpipe_dev_priv *priv)

__rte_mbuf_raw_free(mbuf);
}
-
-   PMD_DEBUG_RX("%s: Returned %d/%d buffers\n",
-mpipe_name(priv), count, priv->rx_buffers);
-   priv->rx_buffers -= count;
 }

 static void
@@ -1246,31 +1239,23 @@ mpipe_recv_flush(struct mpipe_dev_priv *priv)
gxio_mpipe_iqueue_t *iqueue;
gxio_mpipe_idesc_t idesc;
struct rte_mbuf *mbuf;
-   int retries = 0;
unsigned queue;

-   do {
-   mpipe_recv_flush_stack(priv);
-
-   /* Flush packets sitting in recv queues. */
-   for (queue = 0; queue < priv->nb_rx_queues; queue++) {
-   rx_queue = mpipe_rx_queue(priv, queue);
-   iqueue = &rx_queue->iqueue;
-   while (gxio_mpipe_iqueue_try_get(iqueue, &idesc) >= 0) {
-   mbuf = mpipe_recv_mbuf(priv, &idesc, in_port);
-   rte_pktmbuf_free(mbuf);
-   priv->rx_buffers--;
-   }
-   rte_free(rx_queue->rx_ring_mem);
-   }
-   } while (retries++ < 10 && priv->rx_buffers);
+   /* Release packets on the buffer stack. */
+   mpipe_recv_flush_stack(priv);

-   if (priv->rx_buffers) {
-   RTE_LOG(ERR, PMD, "%s: Leaked %d receive buffers.\n",
-   mpipe_name(priv), priv->rx_buffers);
-   } else {
-   PMD_DEBUG_RX("%s: Returned all receive buffers.\n",
-mpipe_name(priv));
+   /* Flush packets sitting in recv queues. */
+   for (queue = 0; queue < priv->nb_rx_queues; queue++) {
+   rx_queue = mpipe_rx_queue(priv, queue);
+   iqueue = &rx_queue->iqueue;
+   while (gxio_mpipe_iqueue_try_get(iqueue, &idesc) >= 0) {
+   /* Skip idesc with the 'buffer error' bit set. */
+   if (idesc.be)
+   continue;
+   mbuf = mpipe_recv_mbuf(priv, &idesc, in_port);
+   rte_pktmbuf_free(mbuf);
+   }
+   rte_free(rx_queue->rx_ring_mem);
}
 }

@@ -1339,6 +1324,7 @@ mpipe_do_xmit(struct mpipe_tx_queue *tx_queue, struct 
rte_mbuf **tx_pkts,
.xfer_size = rte_pktmbuf_data_len(mbuf),
.bound = next ? 0 : 1,
.stack_idx = mpipe_mbuf_stack_index(priv, mbuf),
+   .size  = priv->rx_size_code,
} };
if (mpipe_local.mbuf_push_debt[port_id] > 0) {
mpipe_local.mbuf_push_debt[port_id]--;
diff --git a/lib/librte_eal/common/include/arch/tile/rte_rwlock.h 
b/lib/librte_eal/common/include/arch/tile/rte_rwlock.h
index 8f67a19..6d609e8 100644
--- a/lib/librte_eal/common/include/arch/tile/rte_rwlock.h
+++ b/lib/li

[dpdk-dev] [PATCH 1/3] vmxnet3: support mult-segment receive

2015-12-22 Thread Yong Wang
On 12/3/15, 5:05 PM, "Stephen Hemminger"  wrote:



>From: Stephen Hemminger 
>
>The vmxnet3 interface specification supports having multiple
>receive rings. The first ring has buffers of BTYPE_HEAD which
>are used for the start of the packet, the second ring has buffers
>of type BTYPE_BODY which are used only if the received packet
>exceeds the available space of the head buffer.  There can
>zero or more body buffers for one packet.
>
>What the driver does is post mbuf's to both rings. If a jumbo
>frame is received; then a multi-segment mbuf is created and this
>is returned. The previous version of the driver was filling both
>rings, but the second ring was never used. It would error out
>if it ever received a multi-segment mbuf.
>
>The logic is very similar to existing methods used in other
>OS's. This patch has been used internally by several companies
>and users for several releases and tested with Jumbo frames.


I am fine with this change but just want to point it out that
the logic here deviates from the other OSes somewhat as they
populate the two rings differently:

The first ring has type (HEAD, BODY, BODY, HEAD, BODY, BODY)
assuming 3 descriptors are needed to hold a pkt of MTU size.  
The 2nd ring is all BODY type.  Only if BODY type from 1st ring
runs out will the 2nd ring be used.

Here the first ring is exclusively of HEAD type and any pkts
that cannot hold in the HEAD descriptor will go to the 2nd
ring.  This will work but then the number of descriptors that
actually are needed from the first ring is smaller.  If we
populates the 1st ring based on MTU as other OSes do, then
we just need 1 ring to handle jumbo frame, which ends up with
smaller working set, etc.

>
>Signed-off-by: Stephen Hemminger 
>---
> drivers/net/vmxnet3/vmxnet3_ring.h |  2 +
> drivers/net/vmxnet3/vmxnet3_rxtx.c | 78 +++---
> 2 files changed, 42 insertions(+), 38 deletions(-)
>
>diff --git a/drivers/net/vmxnet3/vmxnet3_ring.h 
>b/drivers/net/vmxnet3/vmxnet3_ring.h
>index 612487e..55ceadf 100644
>--- a/drivers/net/vmxnet3/vmxnet3_ring.h
>+++ b/drivers/net/vmxnet3/vmxnet3_ring.h
>@@ -171,6 +171,8 @@ typedef struct vmxnet3_rx_queue {
>   uint32_tqid1;
>   uint32_tqid2;
>   Vmxnet3_RxQueueDesc *shared;
>+  struct rte_mbuf *start_seg;
>+  struct rte_mbuf *last_seg;
>   struct vmxnet3_rxq_statsstats;
>   boolstopped;
>   uint16_tqueue_id;  /**< Device RX queue index. 
> */
>diff --git a/drivers/net/vmxnet3/vmxnet3_rxtx.c 
>b/drivers/net/vmxnet3/vmxnet3_rxtx.c
>index 4de5d89..00980a5 100644
>--- a/drivers/net/vmxnet3/vmxnet3_rxtx.c
>+++ b/drivers/net/vmxnet3/vmxnet3_rxtx.c
>@@ -575,35 +575,11 @@ vmxnet3_recv_pkts(void *rx_queue, struct rte_mbuf 
>**rx_pkts, uint16_t nb_pkts)
>   rxd = (Vmxnet3_RxDesc *)rxq->cmd_ring[ring_idx].base + idx;
>   rbi = rxq->cmd_ring[ring_idx].buf_info + idx;
> 
>-  if (unlikely(rcd->sop != 1 || rcd->eop != 1)) {
>-  rte_pktmbuf_free_seg(rbi->m);
>-  PMD_RX_LOG(DEBUG, "Packet spread across multiple 
>buffers\n)");
>-  goto rcd_done;
>-  }
>-
>   PMD_RX_LOG(DEBUG, "rxd idx: %d ring idx: %d.", idx, ring_idx);
> 
>   VMXNET3_ASSERT(rcd->len <= rxd->len);
>   VMXNET3_ASSERT(rbi->m);
> 
>-  if (unlikely(rcd->len == 0)) {
>-  PMD_RX_LOG(DEBUG, "Rx buf was skipped. 
>rxring[%d][%d]\n)",
>- ring_idx, idx);
>-  VMXNET3_ASSERT(rcd->sop && rcd->eop);
>-  rte_pktmbuf_free_seg(rbi->m);
>-  goto rcd_done;
>-  }
>-
>-  /* Assuming a packet is coming in a single packet buffer */
>-  if (unlikely(rxd->btype != VMXNET3_RXD_BTYPE_HEAD)) {
>-  PMD_RX_LOG(DEBUG,
>- "Alert : Misbehaving device, incorrect "
>- " buffer type used. iPacket dropped.");
>-  rte_pktmbuf_free_seg(rbi->m);
>-  goto rcd_done;
>-  }
>-  VMXNET3_ASSERT(rxd->btype == VMXNET3_RXD_BTYPE_HEAD);
>-
>   /* Get the packet buffer pointer from buf_info */
>   rxm = rbi->m;
> 
>@@ -615,7 +591,7 @@ vmxnet3_recv_pkts(void *rx_queue, struct rte_mbuf 
>**rx_pkts, uint16_t nb_pkts)
>   rxq->cmd_ring[ring_idx].next2comp = idx;
> 
>   /* For RCD with EOP set, check if there is frame error */
>-  if (unlikely(rcd->err)) {
>+  if (unlikely(rcd->eop && rcd->err)) {
>   rxq->stats.drop_total++;
>   rxq->stats.drop_err++;
> 
>@@ -641,9 +617,45 @@ vmxnet3_recv_pkts(void *rx_queue, struct rte_mbuf 
>**rx_pkts, uint16_t nb_pkts)
>  

[dpdk-dev] VFIO no-iommu

2015-12-22 Thread Alex Williamson
On Mon, 2015-12-21 at 12:22 -0700, Alex Williamson wrote:
> On Mon, 2015-12-21 at 11:46 +, Yigit, Ferruh wrote:
> > On Fri, Dec 18, 2015 at 02:50:17PM -0700, Alex Williamson wrote:
> > > On Fri, 2015-12-18 at 07:38 -0700, Alex Williamson wrote:
> > > > On Fri, 2015-12-18 at 10:43 +, Yigit, Ferruh wrote:
> > > > > On Thu, Dec 17, 2015 at 09:43:59AM -0700, Alex Williamson
> > > > > wrote:
> > > > > <...>
> > > > > > > > > > > 
> > > > > > > > > > > Also I need to disable VFIO_CHECK_EXTENSION
> > > > > > > > > > > ioctl,
> > > > > > > > > > > because in
> > > > > > > > > > > vfio
> > > > > > > > > > > module,
> > > > > > > > > > > container->noiommu is not set before doing a
> > > > > > > > > > > vfio_group_set_container()
> > > > > > > > > > > and vfio_for_each_iommu_driver selects wrong
> > > > > > > > > > > driver.
> > > > > > > > > > 
> > > > > > > > > > Running CHECK_EXTENSION on a container without the
> > > > > > > > > > group
> > > > > > > > > > attached is
> > > > > > > > > > only going to tell you what extensions vfio is
> > > > > > > > > > capable
> > > > > > > > > > of,
> > > > > > > > > > not
> > > > > > > > > > necessarily what extensions are available to you
> > > > > > > > > > with
> > > > > > > > > > that
> > > > > > > > > > group.
> > > > > > > > > > Is this just a general dpdk- vfio ordering bug?
> > > > > > > > > 
> > > > > > > > > Yes, that is how VFIO was implemented in DPDK. I was
> > > > > > > > > under
> > > > > > > > > the
> > > > > > > > > impression that checking extension before assigning
> > > > > > > > > devices
> > > > > > > > > was
> > > > > > > > > the
> > > > > > > > > correct way to do things, so as to not to try
> > > > > > > > > anything
> > > > > > > > > we
> > > > > > > > > know
> > > > > > > > > would
> > > > > > > > > fail anyway. Does this imply that CHECK_EXTENSION
> > > > > > > > > needs
> > > > > > > > > to
> > > > > > > > > be
> > > > > > > > > called
> > > > > > > > > on both container and groups (or just on groups)?
> > > > > > > > 
> > > > > > > > Hmm, in Documentation/vfio.txt we do give the following
> > > > > > > > algorithm:
> > > > > > > > 
> > > > > > > > if (ioctl(container, VFIO_GET_API_VERSION) !=
> > > > > > > > VFIO_API_VERSION)
> > > > > > > > /* Unknown API version */
> > > > > > > > 
> > > > > > > > if (!ioctl(container, VFIO_CHECK_EXTENSION,
> > > > > > > > VFIO_TYPE1_IOMMU))
> > > > > > > > /* Doesn't support the IOMMU driver we
> > > > > > > > want.
> > > > > > > > */
> > > > > > > > ...
> > > > > > > > 
> > > > > > > > That's just going to query each iommu driver and we
> > > > > > > > can't
> > > > > > > > yet
> > > > > > > > say
> > > > > > > > whether
> > > > > > > > the group the user attaches to the container later will
> > > > > > > > actually
> > > > > > > > support that
> > > > > > > > extension until we try to do it, that would come at
> > > > > > > > VFIO_SET_IOMMU.
> > > > > > > > ?So is
> > > > > > > > it perhaps a vfio bug that we're not advertising no-
> > > > > > > > iommu
> > > > > > > > until
> > > > > > > > the
> > > > > > > > group is
> > > > > > > > attached? ?After all, we are capable of it with just an
> > > > > > > > empty
> > > > > > > > container, just
> > > > > > > > like we are with type1, but we're going to fail
> > > > > > > > SET_IOMMU
> > > > > > > > for
> > > > > > > > the
> > > > > > > > wrong
> > > > > > > > combination.
> > > > > > > > ?This is exactly the sort of thing that makes me glad
> > > > > > > > we
> > > > > > > > reverted
> > > > > > > > it without
> > > > > > > > feedback from a working user driver. ?Thanks,
> > > > > > > 
> > > > > > > Whether it should be considered a "bug" in VFIO or "by
> > > > > > > design"
> > > > > > > is
> > > > > > > up
> > > > > > > to you, of course, but at least according to the VFIO
> > > > > > > documentation,
> > > > > > > we are meant to check for type 1 extension and then
> > > > > > > attach
> > > > > > > devices,
> > > > > > > so it would be expected to get VFIO_NOIOMMU_IOMMU marked
> > > > > > > as
> > > > > > > supported
> > > > > > > even without any devices attached to the container (just
> > > > > > > like
> > > > > > > we
> > > > > > > get
> > > > > > > type 1 as supported without any devices attached). Having
> > > > > > > said
> > > > > > > that,
> > > > > > > if it was meant to attach devices first and then check
> > > > > > > the
> > > > > > > extensions, then perhaps the documentation should also
> > > > > > > point
> > > > > > > out
> > > > > > > that
> > > > > > > fact (or perhaps I missed that detail in my readings of
> > > > > > > the
> > > > > > > docs,
> > > > > > > in
> > > > > > > which case my apologies).
> > > > > > 
> > > > > > Hi Anatoly,
> > > > > > 
> > > > > > Does the below patch make it behave more like you'd expect.
> > > > > > ?This
> > > > > > applies to v4.4-rc4, I'd fold this into the base patch if
> > > > > > we
> > > > > > reincorporate it to a future kernel. ?Than

[dpdk-dev] Vmxnet3 activation of device fails in DPDK1.7

2015-12-22 Thread Yong Wang
I checked with our engineers internally and it turns out that the changes to 
lower the max ring2 size never got checked-in in 6.0. This should explain why 
it works on ESXi6.0.  And for 5.5, the plan is to revert the changes in 55p08 
patch to bring it back to 4096.




On 12/13/15, 11:06 PM, "Dey, Souvik"  wrote:
>Thanks for the update. Yes I tried with both 1024 and 2048 it worked fine. But 
>interestingly 4096 is working fine in ESXi6.0. Are you aware of some 
>configuration in ESXi which can solve it issue or bring back nb_rx_desc to max 
>of 4096 ?. Or any idea of any fixes from vmware which fixes this issue ?
>
>-Original Message-
>From: Yong Wang [mailto:yongwang at vmware.com] 
>Sent: Monday, December 14, 2015 11:57 AM
>To: Dey, Souvik ; dev at dpdk.org
>Subject: Re: [dpdk-dev] Vmxnet3 activation of device fails in DPDK1.7
>
>nb_rx_desc should be less or equal to 2048 in update 3 due to a change in 
>vmxnet3 backend. More details can be found at http://kb.vmware.com/kb/2136932.
>
>
>On 12/13/15, 9:01 PM, "Dey, Souvik"  wrote:
>
>>Not sure about but it definitely worked in ESXi5.5 Update1, and also prior 
>>versions. Also on ESXi6.0 it works fine. 
>>We are using 1rx queue and 8 tx queues per port. The nb_rx_desc  is set to 
>>4096  and the each mbuf size is set to 2048. The rx_conf struct has the 
>>following values during the init time .
>>
>>static const struct rte_eth_rxconf rx_conf = {
>>   .rx_thresh = {
>>  .pthresh = RX_PTHRESH,
>>  .hthresh = RX_HTHRESH,
>>  .wthresh = RX_WTHRESH,
>>   },
>>};
>>
>>Do you suspect anything wrong in this ?
>>
>>-Original Message-
>>From: Yong Wang [mailto:yongwang at vmware.com] 
>>Sent: Friday, December 11, 2015 12:07 AM
>>To: Dey, Souvik ; dev at dpdk.org
>>Subject: Re: [dpdk-dev] Vmxnet3 activation of device fails in DPDK1.7
>>
>>On 12/10/15, 2:22 AM, "dev on behalf of Dey, Souvik" >on behalf of sodey at sonusnet.com> wrote:
>>
>>
>>
>>>Hi,
>>>In DPDK 1.7 , while using the vmxnet3 pmd on vmware Esxi 5.5 
>>> update 3 we are seeing that activation of the device fails.
>>>
>>>status = VMXNET3_READ_BAR1_REG(hw, VMXNET3_REG_CMD); return a non zero 
>>>status. Though the normal vmxnet3.ko works fine in the same system. Any idea 
>>>if anyone has faced this type of issue.
>>>
>>>--
>>>Regards,
>>>Souvik
>>
>>Did it work for ESXi 5.5 update 2 or some earlier version?
>>
>>Can you also post the rx ring size you used to config the rx queue?
>>In update 3, there are some changes to the max allowed rx ring size for 
>>ring1. Even ring1 is not used yet in the pmd, the setup code sets ring1?s 
>>size the same as ring0?s.


[dpdk-dev] [PATCH] The VMXNET3 PMD can't receive packet suddenly after a lot of traffic coming in

2015-12-22 Thread Yong Wang
On 7/23/15, 5:53 PM, "dev on behalf of Marco Lee"  wrote:


>The RX of VMXNET3 PMD will have deadlock when a lot of traffic coming in.
>The root cause is due to mbuf allocation fail in vmxnet3_post_rx_bufs()
>and there is no error handling when it is called from vmxnet3_recv_pkts().
>The RXD will not have "free" mbuf for it but the counter still increment.

Can you describe what counter this refers to?

>Finally, no packet can be received.
>
>This fix is allocate the mbuf first, if the allocation is failed,
>then reuse the old mbuf. If the allocation is success,
>the vmxnet3_post_rx_bufs() will call vmxnet3_renew_desc()
>and RXD will be renew inside.

I didn?t see this part of logic implemented.

>
>Signed-off-by: Marco Lee ruckuswireless.com>
>---
> drivers/net/vmxnet3/vmxnet3_rxtx.c |   37 +++-
> 1 file changed, 36 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/vmxnet3/vmxnet3_rxtx.c 
>b/drivers/net/vmxnet3/vmxnet3_rxtx.c
>index 39ad6ef..cbed438 100644
>--- a/drivers/net/vmxnet3/vmxnet3_rxtx.c
>+++ b/drivers/net/vmxnet3/vmxnet3_rxtx.c
>@@ -421,6 +421,35 @@ vmxnet3_xmit_pkts(void *tx_queue, struct rte_mbuf 
>**tx_pkts,
>   return nb_tx;
> }
> 
>+static inline void
>+vmxnet3_renew_desc(vmxnet3_rx_queue_t *rxq, uint8_t ring_id,
>+  struct rte_mbuf *mbuf)
>+{
>+  uint32_t  val = 0;
>+  struct vmxnet3_cmd_ring *ring = &rxq->cmd_ring[ring_id];
>+

Remove this blank line.

>+  struct Vmxnet3_RxDesc *rxd;
>+  vmxnet3_buf_info_t *buf_info = &ring->buf_info[ring->next2fill];
>+
>+  rxd = (struct Vmxnet3_RxDesc *)(ring->base + ring->next2fill);
>+

nit: this can be merged with the above definition.

>+  if (ring->rid == 0)
>+  val = VMXNET3_RXD_BTYPE_HEAD;
>+  else
>+  val = VMXNET3_RXD_BTYPE_BODY;
>+  

Remove the trailing space here.

>+
>+  buf_info->m = mbuf;
>+  buf_info->len = (uint16_t)(mbuf->buf_len - RTE_PKTMBUF_HEADROOM);
>+  buf_info->bufPA = RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mbuf);
>+
>+  rxd->addr = buf_info->bufPA;
>+  rxd->btype = val;
>+  rxd->len = buf_info->len;
>+  rxd->gen = ring->gen;
>+
>+  vmxnet3_cmd_ring_adv_next2fill(ring);
>+}
> /*
>  *  Allocates mbufs and clusters. Post rx descriptors with buffer details
>  *  so that device can receive packets in those buffers.
>@@ -578,6 +607,8 @@ vmxnet3_recv_pkts(void *rx_queue, struct rte_mbuf 
>**rx_pkts, uint16_t nb_pkts)
>   if (nb_rx >= nb_pkts)
>   break;
> 
>+  struct rte_mbuf *rep;

What does rep mean? Can you rename it to something easier to understand
(say newm, or m2)?

Also, please move the definition to the top of this block.

>+  rep = rte_rxmbuf_alloc(rxq->mb_pool);
>   idx = rcd->rxdIdx;
>   ring_idx = (uint8_t)((rcd->rqID == rxq->qid1) ? 0 : 1);
>   rxd = (Vmxnet3_RxDesc *)rxq->cmd_ring[ring_idx].base + idx;
>@@ -651,13 +682,17 @@ vmxnet3_recv_pkts(void *rx_queue, struct rte_mbuf 
>**rx_pkts, uint16_t nb_pkts)
> 
>   vmxnet3_rx_offload(rcd, rxm);
> 
>+  if (unlikely(rep == NULL)) {
>+  rep = rxm;
>+  goto rcd_done;
>+  }

Should this be moved earlier? Also need to update the rx_buf_alloc_failure
counter.

>   rx_pkts[nb_rx++] = rxm;
> rcd_done:
>   rxq->cmd_ring[ring_idx].next2comp = idx;
>   VMXNET3_INC_RING_IDX_ONLY(rxq->cmd_ring[ring_idx].next2comp, 
> rxq->cmd_ring[ring_idx].size);
> 
>   /* It's time to allocate some new buf and renew descriptors */
>-  vmxnet3_post_rx_bufs(rxq, ring_idx);
>+  vmxnet3_renew_desc(rxq, ring_idx, rep);
>   if (unlikely(rxq->shared->ctrl.updateRxProd)) {
>   VMXNET3_WRITE_BAR0_REG(hw, rxprod_reg[ring_idx] + 
> (rxq->queue_id * VMXNET3_REG_ALIGN),
>  
> rxq->cmd_ring[ring_idx].next2fill);
>-- 
>1.7.9.5
>