[dpdk-dev] [PATCH] i40e: fix unintended sign extension

2015-12-16 Thread Jingjing Wu
Coverity issue reported like
CID 119268 (#1 of 1): Unintended sign extension
(SIGN_EXTENSION)sign_extension: Suspicious implicit sign extension:
vsi_id with type unsigned short (16 bits, unsigned) is promoted in
vsi_id << 23 to type int (32 bits, signed), then sign-extended to type
unsigned long (64 bits, unsigned). If vsi_id << 23 is greater than
0x7FFF, the upper bits of the result will all be 1.

Fixes: 88ebc2b7f976 ("i40e: extend flow director to support VF")
Signed-off-by: Jingjing Wu 
---
 drivers/net/i40e/i40e_fdir.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/i40e/i40e_fdir.c b/drivers/net/i40e/i40e_fdir.c
index 43a39ec..9ad6981 100644
--- a/drivers/net/i40e/i40e_fdir.c
+++ b/drivers/net/i40e/i40e_fdir.c
@@ -1091,7 +1091,7 @@ i40e_fdir_filter_programming(struct i40e_pf *pf,
/* Use LAN VSI Id by default */
vsi_id = pf->main_vsi->vsi_id;
fdirdp->qindex_flex_ptype_vsi |=
-   rte_cpu_to_le_32((vsi_id <<
+   rte_cpu_to_le_32(((uint32_t)vsi_id <<
  I40E_TXD_FLTR_QW0_DEST_VSI_SHIFT) &
  I40E_TXD_FLTR_QW0_DEST_VSI_MASK);

-- 
2.4.0



[dpdk-dev] problem vhost-user sockets

2015-12-16 Thread Yuanhan Liu
On Tue, Dec 15, 2015 at 05:21:25PM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > I'm thinking you can't simply unlink a file given by a user inside
> > a libraray unconditionaly. Say, what if a user gives a wrong socket
> > path?
> 
>  Well... We can improve the security by checking that:
> 
> a) The file exists and it's a socket.
> b) Nobody is listening on it.

I don't think that's enough. And the fact of the matter is you should
not remove a file inside a libraray that is not created by itself.

> > I normally write a short script to handle it automatically.
> 
>  I know, you can always hack up some kludges, just IMHO it's not 
> production-grade solution. What if you are cloud administrator, and
> you have 1000 users, each of them using 100 vhost-user interfaces? List all 
> of them in some script? Too huge job, i would say.
>  And without it the thing just appears to be too fragile, requiring manual 
> maintenance after a single stupid failure.

You need fix the application then. The file path is constructed there
after all. And if it's an open source project (say ovs), you are free
to fix it then, isn't it? ;)

--yliu


[dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support

2015-12-16 Thread Peter Xu
On Tue, Dec 15, 2015 at 04:07:57PM +0100, Thibaut Collet wrote:
> After a migration, to avoid netwotk outage, all interfaces of the guest
> must send a packet to update switches mapping (ideally a GARP).
> As some interfaces do not do it QEMU does it in behalf of the guest by
> sending a RARP (his RARP is not forged by the guest but by QEMU). This is
> the qemu_self_announce purpose that "spoofs" a RARP to all backend of guest
> ethernet interfaces. For vhost-user backend, QEMU can not do it directly
> and asks to the vhost-user backend to do it with the VHOST_USER_SEND_RARP
> request that contains the MAC address of the guest interface.
> 
> Thibaut.

Hi, Thibaut,

Thanks for the explaination.

Two more questions:

1. if vhost-user backend (or say, DPDK) supports GUEST_ANNOUNCE, and
   send another RARP (or say, GARP, I will use RARP as example),
   then there will be two RARP later on the line, right? (since the
   QEMU one is sent unconditionally from qemu_announce_self).

2. if the only thing vhost-user backend is to send another same RARP
   when got SEND_RARP request, why would it bother if QEMU will
   unconditionally send one? (or say, I still do not know why we
   need this SEND_RARP request, if the vhost-user backend is going
   to do the same thing again as QEMU already does)

Thanks in advance.
Peter


[dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support

2015-12-16 Thread Yuanhan Liu
On Wed, Dec 16, 2015 at 10:38:03AM +0800, Peter Xu wrote:
> On Tue, Dec 15, 2015 at 04:07:57PM +0100, Thibaut Collet wrote:
> > After a migration, to avoid netwotk outage, all interfaces of the guest
> > must send a packet to update switches mapping (ideally a GARP).
> > As some interfaces do not do it QEMU does it in behalf of the guest by
> > sending a RARP (his RARP is not forged by the guest but by QEMU). This is
> > the qemu_self_announce purpose that "spoofs" a RARP to all backend of guest
> > ethernet interfaces. For vhost-user backend, QEMU can not do it directly
> > and asks to the vhost-user backend to do it with the VHOST_USER_SEND_RARP
> > request that contains the MAC address of the guest interface.
> > 
> > Thibaut.
> 
> Hi, Thibaut,
> 
> Thanks for the explaination.
> 
> Two more questions:
> 
> 1. if vhost-user backend (or say, DPDK) supports GUEST_ANNOUNCE, and
>send another RARP (or say, GARP, I will use RARP as example),
>then there will be two RARP later on the line, right? (since the
>QEMU one is sent unconditionally from qemu_announce_self).

The one send by qemu_announce_self() will be caught by
vhost_user_receive(), which ends up invoking vhost_user_migration_done().
And it will be dropped when VIRTIO_NET_F_GUEST_ANNOUNCE is negotiated
there.

> 2. if the only thing vhost-user backend is to send another same RARP
>when got SEND_RARP request, why would it bother if QEMU will
>unconditionally send one? (or say, I still do not know why we
>need this SEND_RARP request, if the vhost-user backend is going
>to do the same thing again as QEMU already does)

Because that one is caught by vhost-user, and vhost-user just relays
it to the backend when necessary (say when GUEST_ANNOUNCE is not
supported)?

--yliu


[dpdk-dev] VFIO no-iommu

2015-12-16 Thread Ferruh Yigit
On Tue, Dec 15, 2015 at 09:53:18AM -0700, Alex Williamson wrote:
> On Tue, 2015-12-15 at 13:43 +, O'Driscoll, Tim wrote:
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Alex
> > > Williamson
> > > Sent: Friday, December 11, 2015 11:03 PM
> > > To: Vincent JARDIN; dev at dpdk.org
> > > Subject: Re: [dpdk-dev] VFIO no-iommu
> > > 
> > > On Fri, 2015-12-11 at 23:12 +0100, Vincent JARDIN wrote:
> > > > Thanks Thomas for putting back this topic.
> > > > 
> > > > Alex,
> > > > 
> > > > I'd like to hear more about the impacts of "unsupported":
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/c
> > > > ommi
> > > > t/?id=033291eccbdb1b70ffc02641edae19ac825dc75d
> > > > ???Use of this mode, specifically binding a device without a
> > > > native
> > > > ???IOMMU group to a VFIO bus driver will taint the kernel and
> > > > should
> > > > ???therefore not be considered supported.
> > > > 
> > > > It means that we get ride of uio; so it is a nice code cleanup:
> > > > but
> > > > why
> > > > would VFIO/NO IOMMU be better if the bottomline is "unsupported"?
> > > 
> > > How supportable do you think the uio method is? ?Fundamentally we
> > > have
> > > a userspace driver doing unrestricted DMA; it can access and modify
> > > any
> > > memory in the system. ?This is the reason uio won't provide a
> > > mechanism
> > > to enable MSI and if you ask the uio maintainer, they don't support
> > > DMA
> > > at all, it's only intended as a programmed IO interface to the
> > > device.
> > > ?Unless we can sandbox a user owned device within an IOMMU
> > > protected
> > > container, it's not supportable. ?The VFIO no-iommu mode can simply
> > > provide you that unsupported mode more easily since it leverages
> > > code
> > > from the supported mode, which is IOMMU protected. ?Thanks,
> > 
> > Thanks for clarifying.
> > 
> > This does seem like it would be useful for DPDK. We're doing some
> > further investigation to see if it works out of the box with DPDK or
> > if we need to make any changes to support it.
> 
> The iommu model is different, there's no type1 interface available when
> using this mode since we have no ability to provide translation. ?The
> no-iommu iommu model really does nothing, which is a possible issue for
> userspace. ?Is it sufficient? ?We stopped short of creating a page
> pinning interface through the no-iommu model because it requires code
> and adding piles of new code for an interface we claim is unsupported
> doesn't make a lot of sense. ?The device interface should be identical
> to existing vfio support.
> 
> > Thomas highlighted that your original commit for this had been
> > reverted. What specifically would you need from us in order to re-
> > submit the VFIO No-IOMMU support?
> 
> No API changes should ever go into the kernel without being validated
> by a user. ?Without that we're risking that the kernel interface is
> broken and we're stuck supporting it. ?In this case I tried to make
> sure we had a working user before it went it, gambled that it was close
> enough to put in anyway, then paid the price when development went
> silent on the user side. ?To get it back in, I'm going to need a
> working use first. ?You can re-apply 033291eccbdb or re-
> revert?ae5515d66362 for development of that. ?I need to see that it
> works and that there's some consensus from the dpdk community that it's
> a worthwhile path forward for cases without an iommu. ?There's no point
> in merging it if it only becomes a userspace proof of concept. ?Thanks,
> 
I tested the DPDK (HEAD of master) with the patch, with help of Anatoly,
and DPDK works in no-iommu environment with a little modification.

Basically the only modification is adapt new group naming (noiommu-$) and
disable dma mapping (VFIO_IOMMU_MAP_DMA)

Also I need to disable VFIO_CHECK_EXTENSION ioctl, because in vfio module,
container->noiommu is not set before doing a vfio_group_set_container()
and vfio_for_each_iommu_driver selects wrong driver.

What I test is bind two different type of NICs into VFIO driver, and use
testpmd to confirm transfer is working.  Kernel booted without iommu enabled,
vfio module inserted with "enable_unsafe_noiommu_support" parameter.

Thanks,
ferruh


[dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support

2015-12-16 Thread Pavel Fedin
 Hello!

> 1. if vhost-user backend (or say, DPDK) supports GUEST_ANNOUNCE, and
>send another RARP (or say, GARP, I will use RARP as example),
>then there will be two RARP later on the line, right? (since the
>QEMU one is sent unconditionally from qemu_announce_self).

 qemu_announce_self() is NOT unconditional. It applies only to emulated 
physical NICs and bypasses virtio/vhost. So it will not send anything at all 
for vhost-user.

> 2. if the only thing vhost-user backend is to send another same RARP
>when got SEND_RARP request, why would it bother if QEMU will
>unconditionally send one?

 See above, it won't send one.
 It looks to me like qemu_announce_self() is just a poor man's solution which 
even doesn't always work (because GARP should reassociate an existing IP with 
new MAC, shouldn't it? and qemu doesn't know the IP and just sets both src and 
dst to 0.0.0.0).

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia




[dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support

2015-12-16 Thread Yuanhan Liu
On Tue, Dec 15, 2015 at 05:58:28PM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > No idea. Maybe you have changed some other configures (such as of ovs)
> > without notice? Or, the ovs bridge interface resets?
> 
>  I don't touch the ovs at all. Just shut down the guest, rebuild the qemu, 
> reinstall it, run the guest.
> 
> > 
> > BTW, would you please try my v1 patch set with above diff applied to
> > see if the ping loss is still there. You might also want to run tcpdump
> > with the dest host ovs bridge, to see if GARP is actually sent.
> 
>  Retested with wireshark running on the host. I used my qemu patch instead, 
> but it should not matter at all:
> --- cut ---
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 1b6c5ac..5ca2987 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -480,7 +480,12 @@ static int vhost_user_get_u64(struct vhost_dev *dev, int 
> request, uint64_t *u64)
> 
>  static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features)
>  {
> -return vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features);
> +int ret = vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features);
> +
> +if (!ret) {
> +virtio_add_feature(features, VIRTIO_NET_F_GUEST_ANNOUNCE);
> +}
> +return ret;
>  }
> 
>  static int vhost_user_set_owner(struct vhost_dev *dev)
> --- cut ---
> 
>  So, here are both wireshark captures on the host side:

Pavel,

I can reproduce your issue on my side with above patch (and only when
F_GUEST_ANNOUNCE is not set at DPDK vhost lib). TBH, I don't know
why that happened, the cause could be subtle, and I don't think it's
worthwhile to dig it, especially it's not the right way to do it.

So, would you please try to set the F_GUEST_ANNOUNCE flag on DPDK vhost
lib side, as my early diff showed and have another test?

On the other hand, I failed to find two identical server, the two closet
I found are E5-2695 and E5-2699, However, the MSI lost fatal bug still
occurred. I'm out of thoughts what could be the root cause. I'm asking
help from som KVM gurus; hopefully they could shine some lights on.
Meanwhile, I may need try to debug it.

Since you don't meet such issue, I'd hope you could have a test and
tell me how it works :)

Thanks.

--yliu


[dpdk-dev] [ [PATCH v2] 00/13] Add virtio support in arm/arm64

2015-12-16 Thread Santosh Shukla
On Mon, Dec 14, 2015 at 6:30 PM, Santosh Shukla  wrote:

> This patch set add basic infrastrucure to run virtio-net-pci pmd driver for
> arm64/arm. Tested on ThunderX platfrom. Verified for existing dpdk(s) test
> applications like:
> - ovs-dpdk-vhost-user: across the VM's, for the use-cases like guest2guest
> and
>   Host2Guest
> - testpmd application: Tested for max virtio-net-pci interface currently
>   supported in kernel i.e. 31 interface.
>
> Builds successfully for armv7/v8/thunderX and x86_64/i686 platforms. Made
> sure
> that patch changes donot break for x86_64 case. Done similar tests for
> x86_64
> too.
>
> Patch History:
> v2:
> - Removed ifdef arm.. clutter from igb_uio / virtio_ethedev files
> - Introduced rte_io.h header file in generic/ and arch specifics i.e.. for
>   armv7 --> rte_io_32.h, for armv8 --> rte_io_64.h.
> - Removed RTE_ARCH_X86 ifdef clutter too and added rte_io.h header which
> nothing
>   but wraps sys/io.h for x86_64 and i686
> - Moved all the RTE_ARCH_ARM/64 dependancy for igb_uio case to separate
> header
>   file named igbuio_ioport_misc.h. Now igb_uio.c will call only three
> function
>- igbuio_iomap
>- igbuio_ioport_register
>- igbuio_ioport_unregister
> - Moved ARM/64 specific definition to include/exec-env/rte_virt_ioport.h
> header
> - Included virtio_ioport.c/h; has all private and public api required to
> map
>   iopci bar for non-x86 arch. Tested on thunderX and x86_64 both.
>   Private api includes:
> - virtio_map_ioport
> - virtio_set_ioport_addr
>   Public api includes:
> - virtio_ioport_init
> - virtio_ioport_unmap
>
> - Last patch is the miscllanious format specifier fix identifid for 64bit
> case
>   during regression.
>
>
>
Hi Yuanhan, Huawei and Others.

I got arch specific review comment from arm maintainers and I am waiting
for your review feedback on virtio specific patches? Is v3 patch and virtio
iopci bar mapping to user-space approach ok with all? Thanks.



> v1:
> - First patch adds RTE_VIRTIO_INC_VECTOR config, much needed for archs like
>   arm/arm64 as they don't support vectored implementation, also wont able
> to
>   build.
> - Second patch is in-general fix for i686.
> - Third patch is to emulate x86-style of {in,out}[b,w,l] api support for
> armv7/v8.
>   As virtio-net-pci pmd driver uses those apis for port rd/wr {b,w,l}
> - Fourth patch to enable VIRTIO_PMD feature in armv7/v8/thunderX config.
> - Fifth patch to disable iopl syscall, As arm/arm64 linux kernel doesn't
> support
>   them.
> - Sixth patch introduces ioport memdevice called /dev/igb_ioport by which
> virtio
>   pmd driver could able to rd/wr PCI_IOBAR.
>   {applicable for arm/arm64 only, tested for arm64 as of now}
>
>
> Santosh Shukla (13):
>   virtio: Introduce config RTE_VIRTIO_INC_VECTOR
>   config: i686: set RTE_VIRTIO_INC_VECTOR=n
>   rte_io: armv7/v8: Introduce api to emulate x86-style of PCI/ISA
> ioport access
>   virtio_pci: use rte_io.h for non-x86 arch
>   virtio: change io_base datatype from uint32_t to uint64_type
>   config: armv7/v8: Enable RTE_LIBRTE_VIRTIO_PMD
>   linuxapp: eal: arm: Always return 0 for rte_eal_iopl_init()
>   rte_io: x86: Remove sys/io.h ifdef x86 clutter
>   igb_uio: ioport: map iopci region for armv7/v8
>   include/exec-env: ioport: add rte_virt_ioport header file
>   virtio_ioport: armv7/v8: mmap virtio iopci bar region
>   virtio_ethdev: use virtio_ioport api at device init/close
>   virtio_ethdev : fix format specifier error for 64bit addr case
>


[dpdk-dev] VFIO no-iommu

2015-12-16 Thread Burakov, Anatoly
Hi Alex,

> On Wed, 2015-12-16 at 04:04 +, Ferruh Yigit wrote:
> > On Tue, Dec 15, 2015 at 09:53:18AM -0700, Alex Williamson wrote:
> > I tested the DPDK (HEAD of master) with the patch, with help of
> > Anatoly, and DPDK works in no-iommu environment with a little
> > modification.
> >
> > Basically the only modification is adapt new group naming (noiommu-$)
> > and
> 
> Sorry, forgot to mention that one. ?The intention with the modified group
> name is that I want to be very certain that a user intending to only support
> properly iommu isolated devices doesn't accidentally need to deal with these
> no-iommu mode devices.
> 
> > disable dma mapping (VFIO_IOMMU_MAP_DMA)
> >
> > Also I need to disable VFIO_CHECK_EXTENSION ioctl, because in vfio
> > module,
> > container->noiommu is not set before doing a
> > vfio_group_set_container()
> > and vfio_for_each_iommu_driver selects wrong driver.
> 
> Running CHECK_EXTENSION on a container without the group attached is
> only going to tell you what extensions vfio is capable of, not necessarily 
> what
> extensions are available to you with that group. ?Is this just a general dpdk-
> vfio ordering bug?

Yes, that is how VFIO was implemented in DPDK. I was under the impression that 
checking extension before assigning devices was the correct way to do things, 
so as to not to try anything we know would fail anyway. Does this imply that 
CHECK_EXTENSION needs to be called on both container and groups (or just on 
groups)?

> 
> > What I test is bind two different type of NICs into VFIO driver, and
> > use testpmd to confirm transfer is working.??Kernel booted without
> > iommu enabled, vfio module inserted with
> > "enable_unsafe_noiommu_support" parameter.
> 
> So it works. ?Is it acceptable? ?Useful? ?Sufficiently complete? ?Does it 
> imply
> deprecating the uio interface? ?I believe the feature that started this
> discussion was support for MSI/X interrupts so that VFs can support some
> kind of interrupt (uio only supports INTx since it doesn't allow
> DMA). ?Implementing that would be the ultimate test of whether this
> provides dpdk with not only a more consistent interface, but the feature
> dpdk wants that's missing in uio. Thanks,

More testing will be needed, especially regarding interrupts, we will keep you 
updated.

Thanks,
Anatoly

> 
> Alex


[dpdk-dev] [PATCH v1 0/2] Virtio-net PMD Extension to work on host

2015-12-16 Thread Tetsuya Mukawa
[Change log]

PATCH v1:
(Just listing functionality changes and important bug fix)
* Support virtio-net interrupt handling.
  (It means virtio-net PMD on host and guest have same virtio-net features)
* Fix memory allocation method to allocate contiguous memory correctly.
* Port Hotplug is supported.
* Rebase on DPDK-2.2.


[Abstraction]

Normally, virtio-net PMD only works on VM, because there is no virtio-net 
device on host.
This RFC patch extends virtio-net PMD to be able to work on host as virtual PMD.
But we didn't implement virtio-net device as a part of virtio-net PMD.
To prepare virtio-net device for the PMD, start QEMU process with special QTest 
mode, then connect it from virtio-net PMD through unix domain socket.

The virtio-net PMD on host is fully compatible with the PMD on guest.
We can use same functionalities, and connect to anywhere QEMU virtio-net device 
can.
For example, the PMD can use virtio-net multi queues function. Also it can 
connects to vhost-net kernel module and vhost-user backend application.
Similar to virtio-net PMD on QEMU, application memory that uses virtio-net PMD 
will be shared between vhost backend application. But vhost backend application 
memory will not be shared.

Main target of this PMD is container like docker, rkt, lxc and etc.
We can isolate related processes(virtio-net PMD process, QEMU and vhost-user 
backend process) by container.
But, to communicate through unix domain socket, shared directory will be needed.


[How to use]

So far, we need QEMU patch to connect to vhost-user backend.
See below patch.
 - http://patchwork.ozlabs.org/patch/552549/
To know how to use, check commit log.


[Detailed Description]

 - virtio-net device implementation
This host mode PMD uses QEMU virtio-net device. To do that, QEMU QTest 
functionality is used.
QTest is a test framework of QEMU devices. It allows us to implement a device 
driver outside of QEMU.
With QTest, we can implement DPDK application and virtio-net PMD as standalone 
process on host.
When QEMU is invoked as QTest mode, any guest code will not run.
To know more about QTest, see below.
 - http://wiki.qemu.org/Features/QTest

 - probing devices
QTest provides a unix domain socket. Through this socket, driver process can 
access to I/O port and memory of QEMU virtual machine.
The PMD will send I/O port accesses to probe pci devices.
If we can find virtio-net and ivshmem device, initialize the devices.
Also, I/O port accesses of virtio-net PMD will be sent through socket, and 
virtio-net PMD can initialize vitio-net device on QEMU correctly.

 - ivshmem device to share memory
To share memory that virtio-net PMD process uses, ivshmem device will be used.
Because ivshmem device can only handle one file descriptor, shared memory 
should be consist of one file.
To allocate such a memory, EAL has new option called "--contig-mem".
If the option is specified, EAL will open a file and allocate memory from 
hugepages.
While initializing ivshmem device, we can set BAR(Base Address Register).
It represents which memory QEMU vcpu can access to this shared memory.
We will specify host physical address of shared memory as this address.
It is very useful because we don't need to apply patch to QEMU to calculate 
address offset.
(For example, if virtio-net PMD process will allocate memory from shared 
memory, then specify the physical address of it to virtio-net register, QEMU 
virtio-net device can understand it without calculating address offset.)


[Known issues]

 - vhost-user
So far, to use vhost-user, we need to apply a patch to QEMU.
This is because, QEMU will not send memory information and file descriptor of 
ivshmem device to vhost-user backend.
I have submitted the patch to QEMU.
See "http://patchwork.ozlabs.org/patch/552549/";.
Also, we may have an issue in DPDK vhost library to handle kickfd and callfd.
The patch for this issue is needed. I have a workaround patch, but let me check 
it more.
If someone wants to check vhost-user behavior, I will describe it more in later 
email.




Tetsuya Mukawa (2):
  EAL: Add new EAL "--contig-mem" option
  virtio: Extend virtio-net PMD to support container environment

 config/common_linuxapp |1 +
 drivers/net/virtio/Makefile|4 +
 drivers/net/virtio/qtest.c | 1107 
 drivers/net/virtio/virtio_ethdev.c |  341 -
 drivers/net/virtio/virtio_ethdev.h |   12 +
 drivers/net/virtio/virtio_pci.h|   25 +
 lib/librte_eal/common/eal_common_options.c |7 +
 lib/librte_eal/common/eal_internal_cfg.h   |1 +
 lib/librte_eal/common/eal_options.h|2 +
 lib/librte_eal/linuxapp/eal/eal_memory.c   |   77 +-
 10 files changed, 1543 insertions(+), 34 deletions(-)
 create mode 100644 drivers/net/virtio/qtest.c

-- 
2.1.4



[dpdk-dev] [PATCH v1 1/2] EAL: Add new EAL "--contig-mem" option

2015-12-16 Thread Tetsuya Mukawa
This option is for allocating physically contiguous memory for EAL.
EAL will provide only one file descriptor for the memory.
So far, this memory will be used by virtio-net PMD on host or container.

DPDK already has had "RTE_EAL_SINGLE_FILE_SEGMENTS" compile option.
It allows us to create one file descriptor for each contiguous memory
regions. But with this option, DPDK may allocate memory that consists of
multiple contiguous memory regions.

The patch adds "--contig-mem" option. It is only valid if
"RTE_EAL_SINGLE_FILE_SEGMENTS" is enabled.
If this option is specified, EAL memory will consist of
only one contiguous memory.

To implement this option, EAL implementation is changed like below.
 - In calc_num_pages_per_socket(), EAL checks whether we can allocate
   memory that has enough size and consists of one contiguous memory.
 - In unmap_unneeded_hugepages(), EAL unmap memory that doesn't have
   enough memory size.

Signed-off-by: Tetsuya Mukawa 
---
 lib/librte_eal/common/eal_common_options.c |  7 +++
 lib/librte_eal/common/eal_internal_cfg.h   |  1 +
 lib/librte_eal/common/eal_options.h|  2 +
 lib/librte_eal/linuxapp/eal/eal_memory.c   | 77 --
 4 files changed, 82 insertions(+), 5 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_options.c 
b/lib/librte_eal/common/eal_common_options.c
index 29942ea..55d537e 100644
--- a/lib/librte_eal/common/eal_common_options.c
+++ b/lib/librte_eal/common/eal_common_options.c
@@ -95,6 +95,7 @@ eal_long_options[] = {
{OPT_VFIO_INTR, 1, NULL, OPT_VFIO_INTR_NUM},
{OPT_VMWARE_TSC_MAP,0, NULL, OPT_VMWARE_TSC_MAP_NUM   },
{OPT_XEN_DOM0,  0, NULL, OPT_XEN_DOM0_NUM },
+   {OPT_CONTIG_MEM,0, NULL, OPT_CONTIG_MEM_NUM   },
{0, 0, NULL, 0}
 };

@@ -854,6 +855,12 @@ eal_parse_common_option(int opt, const char *optarg,
conf->process_type = eal_parse_proc_type(optarg);
break;

+#ifdef RTE_EAL_SINGLE_FILE_SEGMENTS
+   case OPT_CONTIG_MEM_NUM:
+   conf->contig_mem = 1;
+   break;
+#endif
+
case OPT_MASTER_LCORE_NUM:
if (eal_parse_master_lcore(optarg) < 0) {
RTE_LOG(ERR, EAL, "invalid parameter for --"
diff --git a/lib/librte_eal/common/eal_internal_cfg.h 
b/lib/librte_eal/common/eal_internal_cfg.h
index 5f1367e..c02220d 100644
--- a/lib/librte_eal/common/eal_internal_cfg.h
+++ b/lib/librte_eal/common/eal_internal_cfg.h
@@ -66,6 +66,7 @@ struct internal_config {
volatile unsigned no_hugetlbfs;   /**< true to disable hugetlbfs */
unsigned hugepage_unlink; /**< true to unlink backing files */
volatile unsigned xen_dom0_support; /**< support app running on Xen 
Dom0*/
+   volatile unsigned contig_mem; /**< true to create contiguous eal 
memory */
volatile unsigned no_pci; /**< true to disable PCI */
volatile unsigned no_hpet;/**< true to disable HPET */
volatile unsigned vmware_tsc_map; /**< true to use VMware TSC mapping
diff --git a/lib/librte_eal/common/eal_options.h 
b/lib/librte_eal/common/eal_options.h
index a881c62..a58e371 100644
--- a/lib/librte_eal/common/eal_options.h
+++ b/lib/librte_eal/common/eal_options.h
@@ -55,6 +55,8 @@ enum {
OPT_HUGE_DIR_NUM,
 #define OPT_HUGE_UNLINK   "huge-unlink"
OPT_HUGE_UNLINK_NUM,
+#define OPT_CONTIG_MEM"contig-mem"
+   OPT_CONTIG_MEM_NUM,
 #define OPT_LCORES"lcores"
OPT_LCORES_NUM,
 #define OPT_LOG_LEVEL "log-level"
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c 
b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 846fd31..63e5296 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -851,9 +851,21 @@ unmap_unneeded_hugepages(struct hugepage_file *hugepg_tbl,
/* find a page that matches the criteria */
if ((hp->size == hpi[size].hugepage_sz) &&
(hp->socket_id == (int) 
socket)) {
+#ifdef RTE_EAL_SINGLE_FILE_SEGMENTS
+   int nr_pg_left = 
hpi[size].num_pages[socket] - pages_found;

+   /*
+* if contig_mem is enabled and the 
page doesn't have
+* requested space, unmap it.
+* Also, if we skipped enough pages, 
unmap the rest.
+*/
+   if ((pages_found == 
hpi[size].num_pages[socket]) ||
+   
((internal_config.contig_mem) &&
+   (hp->repeated < 
nr_pg_left))) {
+#else
 

[dpdk-dev] [PATCH v1 2/2] virtio: Extend virtio-net PMD to support container environment

2015-12-16 Thread Tetsuya Mukawa
The patch adds a new virtio-net PMD configuration that allows the PMD to
work on host as if the PMD is in VM.
Here is new configuration for virtio-net PMD.
 - CONFIG_RTE_LIBRTE_VIRTIO_HOST_MODE
To use this mode, EAL needs physically contiguous memory. To allocate
such memory, enable below option, and add "--contig-mem" option to
application command line.
 - CONFIG_RTE_EAL_SINGLE_FILE_SEGMENTS

To prepare virtio-net device on host, the users need to invoke QEMU process
in special qtest mode. This mode is mainly used for testing QEMU devices
from outer process. In this mode, no guest runs.
Here is QEMU command line.

 $ qemu-system-x86_64 \
-machine pc-i440fx-1.4,accel=qtest \
-display none -qtest-log /dev/null \
-qtest unix:/tmp/socket,server \
-netdev type=tap,script=/etc/qemu-ifup,id=net0,queues=1 \
-device virtio-net-pci,netdev=net0,mq=on \
-chardev socket,id=chr1,path=/tmp/ivshmem,server \
-device ivshmem,size=1G,chardev=chr1,vectors=1

* QEMU process is needed per port.
* In most cases, just using above command is enough.
* The vhost backends like vhost-net and vhost-user can be specified.
* Only checked "pc-i440fx-1.4" machine, but may work with other
  machines. It depends on a machine has piix3 south bridge.
  If the machine doesn't have, virtio-net PMD cannot receive status
  changed interrupts.
* Should not add "--enable-kvm" to QEMU command line.

After invoking QEMU, the PMD can connect to QEMU process using unix
domain sockets. Over these sockets, virtio-net, ivshmem and piix3
device in QEMU are probed by the PMD.
Here is example of command line.

 $ testpmd -c f -n 1 -m 1024 --contig-mem \
 --vdev="eth_virtio_net0,qtest=/tmp/socket,ivshmem=/tmp/ivshmem" \
 -- --disable-hw-vlan --txqflags=0xf00 -i

Please specify same unix domain sockets and memory size in both QEMU and
DPDK command lines like above.
The share memory size should be power of 2, because ivshmem only accepts
such memry size.

Also, "--contig-mem" option is needed for the PMD like above. This option
allocates contiguous memory, and create one hugepage file on hugetlbfs.
If there is no enough contiguous memory, initialization will be failed.

This contiguous memory is used as shared memory between DPDK application
and ivshmem device in QEMU.

Signed-off-by: Tetsuya Mukawa 
---
 config/common_linuxapp |1 +
 drivers/net/virtio/Makefile|4 +
 drivers/net/virtio/qtest.c | 1107 
 drivers/net/virtio/virtio_ethdev.c |  341 ++-
 drivers/net/virtio/virtio_ethdev.h |   12 +
 drivers/net/virtio/virtio_pci.h|   25 +
 6 files changed, 1461 insertions(+), 29 deletions(-)
 create mode 100644 drivers/net/virtio/qtest.c

diff --git a/config/common_linuxapp b/config/common_linuxapp
index 74bc515..eaa720c 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -269,6 +269,7 @@ CONFIG_RTE_LIBRTE_PMD_SZEDATA2=n
 # Compile burst-oriented VIRTIO PMD driver
 #
 CONFIG_RTE_LIBRTE_VIRTIO_PMD=y
+CONFIG_RTE_LIBRTE_VIRTIO_HOST_MODE=n
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_INIT=n
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 43835ba..697e629 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -52,6 +52,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c

+ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_HOST_MODE),y)
+   SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += qtest.c
+endif
+
 # this lib depends upon:
 DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_eal lib/librte_ether
 DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_mempool lib/librte_mbuf
diff --git a/drivers/net/virtio/qtest.c b/drivers/net/virtio/qtest.c
new file mode 100644
index 000..4ffdefb
--- /dev/null
+++ b/drivers/net/virtio/qtest.c
@@ -0,0 +1,1107 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2015 IGEL Co., Ltd. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in
+ *   the documentation and/or other materials provided with the
+ *   distribution.
+ * * Neither the name of IGEL Co., Ltd. nor the names of its
+ *   contributors may be used to endorse or promote products derived
+ *   from this software without specific prior written permission.
+ *
+ *   THI

[dpdk-dev] [ [PATCH v2] 00/13] Add virtio support in arm/arm64

2015-12-16 Thread David Marchand
Hello Santosh,

On Wed, Dec 16, 2015 at 8:48 AM, Santosh Shukla  wrote:

> Hi Yuanhan, Huawei and Others.
>
> I got arch specific review comment from arm maintainers and I am waiting
> for your review feedback on virtio specific patches? Is v3 patch and virtio
> iopci bar mapping to user-space approach ok with all? Thanks.
>

Please, don't forget to CC me as well.

I did something similar for powerpc, but there was no need to add any
remapping in igb_uio.
Is there something specific to arm that makes it impossible to reuse
resources mmapping from /sys ?
I can send a patch that should do the job for eal.

I am a bit short on time and will be on holidays for two weeks, so I can't
look at these patches before January.


Regards,
-- 
David Marchand


[dpdk-dev] DPDK OVS on Ubuntu 14.04# Issue's Resolved# Successfully setup DPDK OVS with vhostuser

2015-12-16 Thread Abhijeet Karve
Hi Przemek,


We have configured the accelerated data path between a physical interface 
to the VM using openvswitch netdev-dpdk with vhost-user support. The VM 
created with this special data path and vhost library, I am calling as 
DPDK instance. 

If assigning ip manually to the newly created Cirros VM instance, We are 
able to make 2 VM's to communicate on the same compute node. Else it's not 
associating any ip through DHCP though DHCP is in compute node only.

Yes it's a compute + controller node setup and we are using following 
software platform on compute node:
_
Openstack: Kilo
Distribution: Ubuntu 14.04
OVS Version: 2.4.0
DPDK 2.0.0
_

We are following the intel guide 
https://software.intel.com/en-us/blogs/2015/06/09/building-vhost-user-for-ovs-today-using-dpdk-200

When doing "ovs-vsctl show" in compute node, it shows below output:
_
ovs-vsctl show
c2ec29a5-992d-4875-8adc-1265c23e0304
Bridge br-ex
Port phy-br-ex
Interface phy-br-ex
type: patch
options: {peer=int-br-ex}
Port br-ex
Interface br-ex
type: internal
Bridge br-tun
fail_mode: secure
Port br-tun
Interface br-tun
type: internal
Port patch-int
Interface patch-int
type: patch
options: {peer=patch-tun}
Bridge br-int
fail_mode: secure
Port "qvo0ae19a43-b6"
tag: 2
Interface "qvo0ae19a43-b6"
Port br-int
Interface br-int
type: internal
Port "qvo31c89856-a2"
tag: 1
Interface "qvo31c89856-a2"
Port patch-tun
Interface patch-tun
type: patch
options: {peer=patch-int}
Port int-br-ex
Interface int-br-ex
type: patch
options: {peer=phy-br-ex}
Port "qvo97fef28a-ec"
tag: 2
Interface "qvo97fef28a-ec"
Bridge br-dpdk
Port br-dpdk
Interface br-dpdk
type: internal
Bridge "br0"
Port "br0"
Interface "br0"
type: internal
Port "dpdk0"
Interface "dpdk0"
type: dpdk
Port "vhost-user-2"
Interface "vhost-user-2"
type: dpdkvhostuser
Port "vhost-user-0"
Interface "vhost-user-0"
type: dpdkvhostuser
Port "vhost-user-1"
Interface "vhost-user-1"
type: dpdkvhostuser
ovs_version: "2.4.0"
root at dpdk:~# 
_

Open flows output in bridge in compute node are as below:
_
root at dpdk:~# ovs-ofctl dump-flows br-tun
NXST_FLOW reply (xid=0x4):
 cookie=0x0, duration=71796.741s, table=0, n_packets=519, n_bytes=33794, 
idle_age=19982, hard_age=65534, priority=1,in_port=1 actions=resubmit(,2)
 cookie=0x0, duration=71796.700s, table=0, n_packets=0, n_bytes=0, 
idle_age=65534, hard_age=65534, priority=0 actions=drop
 cookie=0x0, duration=71796.649s, table=2, n_packets=0, n_bytes=0, 
idle_age=65534, hard_age=65534, 
priority=0,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 
actions=resubmit(,20)
 cookie=0x0, duration=71796.610s, table=2, n_packets=519, n_bytes=33794, 
idle_age=19982, hard_age=65534, 
priority=0,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00 
actions=resubmit(,22)
 cookie=0x0, duration=71794.631s, table=3, n_packets=0, n_bytes=0, 
idle_age=65534, hard_age=65534, priority=1,tun_id=0x5c 
actions=mod_vlan_vid:2,resubmit(,10)
 cookie=0x0, duration=71794.316s, table=3, n_packets=0, n_bytes=0, 
idle_age=65534, hard_age=65534, priority=1,tun_id=0x57 
actions=mod_vlan_vid:1,resubmit(,10)
 cookie=0x0, duration=71796.565s, table=3, n_packets=0, n_bytes=0, 
idle_age=65534, hard_age=65534, priority=0 actions=drop
 cookie=0x0, duration=71796.522s, table=4, n_packets=0, n_bytes=0, 
idle_age=65534, hard_age=65534, priority=0 actions=drop
 cookie=0x0, duration=71796.481s, table=10, n_packets=0, n_bytes=0, 
idle_age=65534, hard_age=65534, priority=1 
actions=learn(table=20,hard_timeout=300,priority=1,NXM_OF_VLAN_TCI[0..11],NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:0->NXM_OF_VLAN_TCI[],load:NXM_NX_TUN_ID[]->NXM_NX_TUN_ID[],output:NXM_OF_IN_PORT[]),output:1
 cookie=0x0, duration=71796.439s, table=20, n_packets=0, n_bytes=0, 
idle_age=65534, hard_age=65534, priority=0 actions=resubmit(,22)
 cookie=0x0, duration=71796.398s, table=22, n_packets=519, n_bytes=33794, 
idle_age=19982, hard_age=65534, priority=0 actions=drop
root at dpdk:~# 
root at dpdk:~# 
root at dpdk:~# 
root at dpdk:~# ovs-ofctl dump-flows br-tun
int NXST_FLOW reply (xid=0x4):
 cookie=0x0, duration=71801.275s, table=0, n_packets=0, n_bytes=0, 
idle_age=65534, hard_age=65534, priority=2,in_port=10 actions=drop
 cookie=0x0, duration=7

[dpdk-dev] [PATCH] doc: fix missing link target

2015-12-16 Thread John McNamara
Fix missing link in the Linux GSG, accidentally removed
in previous merge:

  WARNING: undefined label: linux_gsg_compiling_dpdk
  Fixes: 29c673401c4d ("doc: improve Linux guide layout")

Signed-off-by: John McNamara 
---
 doc/guides/linux_gsg/build_dpdk.rst | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/doc/guides/linux_gsg/build_dpdk.rst 
b/doc/guides/linux_gsg/build_dpdk.rst
index 1f4c1f7..198c0b6 100644
--- a/doc/guides/linux_gsg/build_dpdk.rst
+++ b/doc/guides/linux_gsg/build_dpdk.rst
@@ -28,6 +28,8 @@
 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

+.. _linux_gsg_compiling_dpdk:
+
 Compiling the DPDK Target from Source
 =

-- 
2.5.0



[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Bruce Richardson
On Mon, Dec 14, 2015 at 05:36:13PM -0500, Matthew Hall wrote:
> On Mon, Dec 14, 2015 at 04:29:41PM -0500, Kyle Larose wrote:
> > I've seen lots of ideas and options tossed around which would solve
> > some or all of the above items, but nobody actually committing to
> > anything. What can we do to actually agree on a solution to go and
> > implement? I'm relatively new to the community, so I don't really know
> > how this stuff works. Do people typically form a working group where
> > they go off and discuss the problem, and then come back to the main
> > community with a proposal? Or do people just submit RFCs independently
> > with their own ideas?
> > 
> > Thanks,
> > Kyle
> 
> I am getting the impression of a misplaced sense of urgency / panic. I don't 
> think anybody came up with a reason why we have to answer all these questions 
> tremendously quickly. It will take some more time, particularly with the 
> holidays, for the developers to finish the last bug fixes on the current 
> release before they have time to discuss 2.3 features.
> 
> When that happens, someone working on DPDK full time will be identified as 
> the 
> leader for the feature, that will lead the effort on PCAP, and help us 
> formulate the plan. Until then, what we really could use at this point is not 
> necessarily more writings and speculation, but an answer on some key tech 
> questions, particularly from some kernel guys:
> 
> 1) How do we get the pcap filter string and/or BPF opcode vector from libpcap 
> / tcpdump / tshark / wireshark, into the DPDK application? There we can 
> compile it using the user-space bpfjit, so we can filter the packets at very 
> high speeds and not end up breaking everything doing a ton of stupid copies 
> when somebody does a capture of one flow on his i40e device or such. libpcap 
> is crappy about this, as it sends it all over syscalls which are always 
> assuming the kernel is on the other end, which is a bad assumption on their 
> part but many decades old and not so easy to fix.
> 
> 2) How do we get the matched packets back out to the extcap or libpcap? From 
> what I saw extcap is tshark / wireshark only, which are 1) GPL licensed in 
> various ways, 2) not as widely used as libpcap. So using only extcap might be 
> kind of crappy.
> 
> 3) For libpcap to work, maybe it will help if some of our kernel guys can 
> help 
> us find out how to "detect" the kernel put a BPF capture filter onto a TUN / 
> TAP interface, and copy that filter to the DPDK app. Then, take any matched 
> packets and write them back onto the TUN / TAP. This would also be super 
> efficient and work with more off-the-shelf tools besides just tshark / 
> wireshark.
> 
> If we don't find the answers for these items I don't think we have a path to 
> a 
> working solution, forgetting about all the nice-to-have points such as UX 
> issues, troubleshooting, debugging, etc.
> 
> Matthew.

Hi,

we are currently doing some investigation and prototyping for this feature.
Our current thinking is the following:
* to allow dynamic control of the filtering, we are thinking of making use of
  the multi-process infrastructure in DPDK. A secondary process can attach to a
  primary at runtime and provide the packet filtering and dumping capability.
* ideally we want to create a generic packet mirroring callback inside the EAL,
  that can be set up to mirror packets going through Rx/Tx on an ethdev.
* using this, packets being received on the port to be monitored are sent via
  an rte_ring (ring ethdev) to the secondary process which takes those packets
  and does any filtering on them. [This would be where BPF could fit into
  things, but it's not something we have looked at yet.]
* initially we plan to have the secondary process then write packets to a pcap
  file using a pcap PMD, but down the road if we get other PMDs, like a KNI PMD
  or a TAP device PMD, those could be used as targets instead.

This implementation we hope should provide enough hooks to enable the standard
tools to be used for monitoring and capturing packets. We will send out draft
implementation code for various parts of this as soon as we have it.

Additional feedback welcome, as always. :-)

Regards,
/Bruce



[dpdk-dev] [PATCH] doc: fix missing link target

2015-12-16 Thread Iremonger, Bernard
> -Original Message-
> From: Mcnamara, John
> Sent: Wednesday, December 16, 2015 10:43 AM
> To: dev at dpdk.org
> Cc: Iremonger, Bernard ; Mcnamara, John
> 
> Subject: [PATCH] doc: fix missing link target
> 
> Fix missing link in the Linux GSG, accidentally removed in previous merge:
> 
>   WARNING: undefined label: linux_gsg_compiling_dpdk
>   Fixes: 29c673401c4d ("doc: improve Linux guide layout")
> 
> Signed-off-by: John McNamara 
Acked-by: Bernard Iremonger 


[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Arnon Warshavsky
2 points from our experience in saving pcap files from a dpdk 10G fire hose:

1)
Our capture module provides a small "bit-vector" to the code that handles
the packets.
Since our packet processing code is already finding out basic stuff about
the packet traversing it (is it IPv4? v6?  is it TCP? is it fragmented?
..etc), it sets the relevant bits ON as it goes ,so that the capture module
can later quickly (mask against desired filters) decide if the a packet
needs to be captured.
Point is - when a capture layer exposes a slim API that lets it utilize
info coming from other modules , its easier and less expensive to handle
the fire hose.

2)
In many cases we are interested in capturing complete TCP flows, or at
least the first X packets of them.
In this case, A more expensive filter may be applied only on the SYN packet
and when matches, turns ON a bit on the tcp flow applicative context that
says we want to capture any packet falling under this tuple.
Point is - applicative filters at different costs are applied on different
packet types utilizing the mask from the previous bullet

Such a model should obviously need to be optional on a formal capture layer,
but when dealing with a fire hose - I find it very useful.

/Arnon

-

On Wed, Dec 16, 2015 at 12:45 PM, Bruce Richardson <
bruce.richardson at intel.com> wrote:

> On Mon, Dec 14, 2015 at 05:36:13PM -0500, Matthew Hall wrote:
> > On Mon, Dec 14, 2015 at 04:29:41PM -0500, Kyle Larose wrote:
> > > I've seen lots of ideas and options tossed around which would solve
> > > some or all of the above items, but nobody actually committing to
> > > anything. What can we do to actually agree on a solution to go and
> > > implement? I'm relatively new to the community, so I don't really know
> > > how this stuff works. Do people typically form a working group where
> > > they go off and discuss the problem, and then come back to the main
> > > community with a proposal? Or do people just submit RFCs independently
> > > with their own ideas?
> > >
> > > Thanks,
> > > Kyle
> >
> > I am getting the impression of a misplaced sense of urgency / panic. I
> don't
> > think anybody came up with a reason why we have to answer all these
> questions
> > tremendously quickly. It will take some more time, particularly with the
> > holidays, for the developers to finish the last bug fixes on the current
> > release before they have time to discuss 2.3 features.
> >
> > When that happens, someone working on DPDK full time will be identified
> as the
> > leader for the feature, that will lead the effort on PCAP, and help us
> > formulate the plan. Until then, what we really could use at this point
> is not
> > necessarily more writings and speculation, but an answer on some key tech
> > questions, particularly from some kernel guys:
> >
> > 1) How do we get the pcap filter string and/or BPF opcode vector from
> libpcap
> > / tcpdump / tshark / wireshark, into the DPDK application? There we can
> > compile it using the user-space bpfjit, so we can filter the packets at
> very
> > high speeds and not end up breaking everything doing a ton of stupid
> copies
> > when somebody does a capture of one flow on his i40e device or such.
> libpcap
> > is crappy about this, as it sends it all over syscalls which are always
> > assuming the kernel is on the other end, which is a bad assumption on
> their
> > part but many decades old and not so easy to fix.
> >
> > 2) How do we get the matched packets back out to the extcap or libpcap?
> From
> > what I saw extcap is tshark / wireshark only, which are 1) GPL licensed
> in
> > various ways, 2) not as widely used as libpcap. So using only extcap
> might be
> > kind of crappy.
> >
> > 3) For libpcap to work, maybe it will help if some of our kernel guys
> can help
> > us find out how to "detect" the kernel put a BPF capture filter onto a
> TUN /
> > TAP interface, and copy that filter to the DPDK app. Then, take any
> matched
> > packets and write them back onto the TUN / TAP. This would also be super
> > efficient and work with more off-the-shelf tools besides just tshark /
> > wireshark.
> >
> > If we don't find the answers for these items I don't think we have a
> path to a
> > working solution, forgetting about all the nice-to-have points such as UX
> > issues, troubleshooting, debugging, etc.
> >
> > Matthew.
>
> Hi,
>
> we are currently doing some investigation and prototyping for this feature.
> Our current thinking is the following:
> * to allow dynamic control of the filtering, we are thinking of making use
> of
>   the multi-process infrastructure in DPDK. A secondary process can attach
> to a
>   primary at runtime and provide the packet filtering and dumping
> capability.
> * ideally we want to create a generic packet mirroring callback inside the
> EAL,
>   that can be set up to mirror packets going through Rx/Tx on an ethdev.
> * using this, packets being received on the port to be monitored are sent
> via
>   an rte_ring (ring ethd

[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Morten Brørup
Bruce,

This doesn't really sound like tcpdump to me; it sounds like port mirroring.

Your suggestion is limited to physical ports only, and cannot be attached 
further inside the application, e.g. for mirroring packets related to a 
specific VLAN.

Furthermore, it doesn't sound like the filtering part scales well. Consider a 
fully loaded 40 Gbit/s port. You would need to copy all packets into a single 
rte_ring to the attached filtering process, which would then require its own 
set of lcores to probably discard most of these packets when filtering. I agree 
with Matthew that the filtering needs to happen as close to the source as 
possible, and must be scalable to multiple lcores.

On the positive side, your idea has the advantage that the filter can be any 
application, and is not limited to BPF. However if the purpose is "tcpdump", we 
should probably consider BPF, which is the type of filtering offered by tcpdump.

I would prefer having a BPF library available that the application can use at 
any point, either at the lowest level (when receiving/transmitting Ethernet 
packets) or at a higher level (e.g. when working with packets that go into or 
come out of a tunnel). The BPF library should implement packet length and 
relevant ancillary data, such as SKF_AD_VLAN_TAG etc. based on metadata in the 
mbuf.

Transferring a BPF filter from an outside application could be done by using a 
simple text format, e.g. the output format of "tcpdump -ddd". This also opens 
an easy roadmap for Wireshark integration by simply extending excap to include 
such a BPF filter format.


Lots of negativity above. I very much like the idea of attaching the secondary 
process and going through an rte_ring. This allows the secondary process to 
pass the filtered and captured packets on in any format it likes to any 
destination it likes.


Med venlig hilsen / kind regards
- Morten Br?rup

-Original Message-
From: Bruce Richardson [mailto:bruce.richard...@intel.com] 
Sent: 16. december 2015 11:45

Hi,

we are currently doing some investigation and prototyping for this feature.
Our current thinking is the following:
* to allow dynamic control of the filtering, we are thinking of making use of
  the multi-process infrastructure in DPDK. A secondary process can attach to a
  primary at runtime and provide the packet filtering and dumping capability.
* ideally we want to create a generic packet mirroring callback inside the EAL,
  that can be set up to mirror packets going through Rx/Tx on an ethdev.
* using this, packets being received on the port to be monitored are sent via
  an rte_ring (ring ethdev) to the secondary process which takes those packets
  and does any filtering on them. [This would be where BPF could fit into
  things, but it's not something we have looked at yet.]
* initially we plan to have the secondary process then write packets to a pcap
  file using a pcap PMD, but down the road if we get other PMDs, like a KNI PMD
  or a TAP device PMD, those could be used as targets instead.

This implementation we hope should provide enough hooks to enable the standard 
tools to be used for monitoring and capturing packets. We will send out draft 
implementation code for various parts of this as soon as we have it.

Additional feedback welcome, as always. :-)

Regards,
/Bruce




[dpdk-dev] [ [PATCH v2] 00/13] Add virtio support in arm/arm64

2015-12-16 Thread Santosh Shukla
On Wed, Dec 16, 2015 at 2:17 PM, David Marchand
 wrote:
> Hello Santosh,
>
> On Wed, Dec 16, 2015 at 8:48 AM, Santosh Shukla  wrote:
>>
>> Hi Yuanhan, Huawei and Others.
>>
>> I got arch specific review comment from arm maintainers and I am waiting
>> for your review feedback on virtio specific patches? Is v3 patch and
>> virtio
>> iopci bar mapping to user-space approach ok with all? Thanks.
>
>
> Please, don't forget to CC me as well.
>
> I did something similar for powerpc, but there was no need to add any
> remapping in igb_uio.

Is it for mapping iopci bar? does that includes virtio

For detailed explanation refer [1]

[1] http://dpdk.org/dev/patchwork/patch/9365/

> Is there something specific to arm that makes it impossible to reuse
> resources mmapping from /sys ?

/sysfs wont map resource0, it could map resource1 i.e. iomem but
virtio header resides in iopci bar region so iomem memory wont be
effective /  invalid addr. For that someone to explicitly map iopci
region thus this code/pach.

> I can send a patch that should do the job for eal.
>

Pl. send then, My patches are waiting for review for quite a long
time. It will be good if you send now.

> I am a bit short on time and will be on holidays for two weeks, so I can't
> look at these patches before January.
>
>
> Regards,
> --
> David Marchand


[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Bruce Richardson
On Wed, Dec 16, 2015 at 12:40:43PM +0100, Morten Br?rup wrote:
> Bruce,
> 
> This doesn't really sound like tcpdump to me; it sounds like port mirroring.

It's actually a bit of both, in my opinion, it's designed to allow basic 
mirroring
of traffic on a port to allow that traffic to be sent to a tcpdump destination.
By going with a more generic approach, we hope to enable more possible use
cases than just focusing on TCP.


> 
> Your suggestion is limited to physical ports only, and cannot be attached 
> further inside the application, e.g. for mirroring packets related to a 
> specific VLAN.

Yes, the lack of attachment inside the app is a limitation. There are two types
of scenarios that could be considered for packet capture:
* ones where the application can be modified to do it's own filtering and
capturing.
* ones where you want a generic capture mechanism which can be used on any
application without modification.
We have chosen to focus more on the second one, as that is where a generic
solution for DPDK is likely to lie. For the first case, the application writer
himself knows the type of traffic and how best to capture and filter it, so I
don't think a generic one-size-fits-all solution is possible. [Though a couple
of helper libraries may be of use]

As for physical ports, the scheme should work for any ethdev - why do you see
it only being limited to physical ports? What would you want to see monitored
that we are missing.

> 
> Furthermore, it doesn't sound like the filtering part scales well. Consider a 
> fully loaded 40 Gbit/s port. You would need to copy all packets into a single 
> rte_ring to the attached filtering process, which would then require its own 
> set of lcores to probably discard most of these packets when filtering. I 
> agree with Matthew that the filtering needs to happen as close to the source 
> as possible, and must be scalable to multiple lcores.

Without modifying the application itself to do it's own filtering I suspect
scalability is always going to be a problem. That being said, there is no
particular reason why a single rte_ring needs to be used - we could allow one
ring per NIC queue for instance. The trouble with filtering at the source itself
is that you put extra load on the IO cores. By using a ring, we put the 
filtering
load on extra cores in a secondary process which can be scaled by the user 
without
touching the main app.

> 
> On the positive side, your idea has the advantage that the filter can be any 
> application, and is not limited to BPF. However if the purpose is "tcpdump", 
> we should probably consider BPF, which is the type of filtering offered by 
> tcpdump.

Having this work with any application is one of our primary targets here. The
app author should not have to worry too much about getting basic debug support.
Even if it doesn't work at 40G small packet rates, you can get a lot of benefit
from a scheme that provides functional debugging for an app. Obviously, though
we aim to make this as scalable as possible, which is why we want to allow 
fitlering
in userspace before sending packets externally to DPDK.

> 
> I would prefer having a BPF library available that the application can use at 
> any point, either at the lowest level (when receiving/transmitting Ethernet 
> packets) or at a higher level (e.g. when working with packets that go into or 
> come out of a tunnel). The BPF library should implement packet length and 
> relevant ancillary data, such as SKF_AD_VLAN_TAG etc. based on metadata in 
> the mbuf.
> 
> Transferring a BPF filter from an outside application could be done by using 
> a simple text format, e.g. the output format of "tcpdump -ddd". This also 
> opens an easy roadmap for Wireshark integration by simply extending excap to 
> include such a BPF filter format.
> 
> 
> Lots of negativity above. I very much like the idea of attaching the 
> secondary process and going through an rte_ring. This allows the secondary 
> process to pass the filtered and captured packets on in any format it likes 
> to any destination it likes.

Good, so we're not completely off-base here. :-)

/Bruce

> 
> 
> Med venlig hilsen / kind regards
> - Morten Br?rup
> 
> -Original Message-
> From: Bruce Richardson [mailto:bruce.richardson at intel.com] 
> Sent: 16. december 2015 11:45
> 
> Hi,
> 
> we are currently doing some investigation and prototyping for this feature.
> Our current thinking is the following:
> * to allow dynamic control of the filtering, we are thinking of making use of
>   the multi-process infrastructure in DPDK. A secondary process can attach to 
> a
>   primary at runtime and provide the packet filtering and dumping capability.
> * ideally we want to create a generic packet mirroring callback inside the 
> EAL,
>   that can be set up to mirror packets going through Rx/Tx on an ethdev.
> * using this, packets being received on the port to be monitored are sent via
>   an rte_ring (ring ethdev) to the secondary proce

[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Morten Brørup
Great idea, Arnon. Let?s look at existing use cases from the real world.



Our company makes network appliances. They are not running GNU/Linux or 
similar, so they do not offer a BASH prompt or any other BSD/Linux like command 
line interface.



Here?s a simplified description of how the user interacts with the packet 
capture feature in our appliances:



Our GUI allows you to input a filter, e.g. a MAC address, an IP address or a 
compiled BPF program as a single hexadecimal string (roughly ?tcpdump ?ddd? 
output), and start capturing. The captured packets can then be downloaded from 
the GUI in pcap format.



The other packet filters our appliance needs, e.g. DHCP, ARP etc., are not 
provided by the user (or by any other external interaction), but are hardcoded 
in C, just like any other part of our firmware.





Med venlig hilsen / kind regards



Morten Br?rup

CTO







SmartShare Systems A/S

Tonsbakken 16-18

DK-2740 Skovlunde

Denmark



Office  +45 70 20 00 93

Direct  +45 89 93 50 22

Mobile  +45 25 40 82 12



mb at smartsharesystems.com  

www.smartsharesystems.com  



From: Arnon Warshavsky [mailto:ar...@qwilt.com] 
Sent: 16. december 2015 12:37
To: Bruce Richardson
Cc: Matthew Hall; dev at dpdk.org; Morten Br?rup
Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3



2 points from our experience in saving pcap files from a dpdk 10G fire hose:


1) 
Our capture module provides a small "bit-vector" to the code that handles the 
packets. 
Since our packet processing code is already finding out basic stuff about the 
packet traversing it (is it IPv4? v6?  is it TCP? is it fragmented? ..etc), it 
sets the relevant bits ON as it goes ,so that the capture module can later 
quickly (mask against desired filters) decide if the a packet needs to be 
captured.

Point is - when a capture layer exposes a slim API that lets it utilize info 
coming from other modules , its easier and less expensive to handle the fire 
hose.

2)

In many cases we are interested in capturing complete TCP flows, or at least 
the first X packets of them.

In this case, A more expensive filter may be applied only on the SYN packet and 
when matches, turns ON a bit on the tcp flow applicative context that says we 
want to capture any packet falling under this tuple.

Point is - applicative filters at different costs are applied on different 
packet types utilizing the mask from the previous bullet 



Such a model should obviously need to be optional on a formal capture layer,

but when dealing with a fire hose - I find it very useful.



/Arnon



[dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support

2015-12-16 Thread Pavel Fedin
 Hello!

> I can reproduce your issue on my side with above patch (and only when
> F_GUEST_ANNOUNCE is not set at DPDK vhost lib). TBH, I don't know
> why that happened, the cause could be subtle, and I don't think it's
> worthwhile to dig it, especially it's not the right way to do it.

 May be not right, may be it can be done... Actually, i found what was wrong. 
qemu tries to feed features back to vhost-user via
VHOST_USER_SET_FEATURES, and DPDK barfs on the unknown bit. More tweaking is 
needed for qemu to do the trick correctly.

> So, would you please try to set the F_GUEST_ANNOUNCE flag on DPDK vhost
> lib side, as my early diff showed and have another test?

 Tried it, works fine, thank you.
 I have almost implemented the workaround in qemu... However now i start to 
think that you are right. Theoretically, the application
may want to suppress GUEST_ANNOUNCE for some reason. So, let it stay this way. 
Please include this bit into your v2.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia




[dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support

2015-12-16 Thread Yuanhan Liu
On Wed, Dec 16, 2015 at 02:57:15PM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > I can reproduce your issue on my side with above patch (and only when
> > F_GUEST_ANNOUNCE is not set at DPDK vhost lib). TBH, I don't know
> > why that happened, the cause could be subtle, and I don't think it's
> > worthwhile to dig it, especially it's not the right way to do it.
> 
>  May be not right, may be it can be done... Actually, i found what was wrong. 
> qemu tries to feed features back to vhost-user via
> VHOST_USER_SET_FEATURES, and DPDK barfs on the unknown bit. More tweaking is 
> needed for qemu to do the trick correctly.
> 
> > So, would you please try to set the F_GUEST_ANNOUNCE flag on DPDK vhost
> > lib side, as my early diff showed and have another test?
> 
>  Tried it, works fine, thank you.

Thanks for the test.

However, I'm more curious about the ping loss? Did you still see
that? And to be more specific, have the wireshark captured the
GRAP from the guest?  And what's the output of 'grep virtio /proc/interrupts'
inside guest?

--yliu


>  I have almost implemented the workaround in qemu... However now i start to 
> think that you are right. Theoretically, the application
> may want to suppress GUEST_ANNOUNCE for some reason. So, let it stay this 
> way. Please include this bit into your v2.
> 
> Kind regards,
> Pavel Fedin
> Expert Engineer
> Samsung Electronics Research center Russia
> 


[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Morten Brørup
Bruce,

Please note that tcpdump is a stupid name for a packet capture application that 
supports much more than just TCP.

I had missed the point about ethdev supporting virtual interfaces, so thank you 
for pointing that out. That covers my concerns about capturing packets inside 
tunnels.

I will gladly admit that you Intel guys are probably much more competent in the 
field of DPDK performance and scalability than I am. So Matthew and I have been 
asking you to kindly ensure that your solution scales well at very high packet 
rates too, and pointing out that filtering before copying is probably cheaper 
than copying before filtering. You mention that it leads to an important choice 
about which lcores get to do the work of filtering the packets, so that might 
be worth some discussion.

:-)

Med venlig hilsen / kind regards
- Morten Br?rup


-Original Message-
From: Bruce Richardson [mailto:bruce.richard...@intel.com] 
Sent: 16. december 2015 12:56
To: Morten Br?rup
Cc: Matthew Hall; Kyle Larose; dev at dpdk.org
Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3

On Wed, Dec 16, 2015 at 12:40:43PM +0100, Morten Br?rup wrote:
> Bruce,
> 
> This doesn't really sound like tcpdump to me; it sounds like port mirroring.

It's actually a bit of both, in my opinion, it's designed to allow basic 
mirroring of traffic on a port to allow that traffic to be sent to a tcpdump 
destination.
By going with a more generic approach, we hope to enable more possible use 
cases than just focusing on TCP.


> 
> Your suggestion is limited to physical ports only, and cannot be attached 
> further inside the application, e.g. for mirroring packets related to a 
> specific VLAN.

Yes, the lack of attachment inside the app is a limitation. There are two types 
of scenarios that could be considered for packet capture:
* ones where the application can be modified to do it's own filtering and 
capturing.
* ones where you want a generic capture mechanism which can be used on any 
application without modification.
We have chosen to focus more on the second one, as that is where a generic 
solution for DPDK is likely to lie. For the first case, the application writer 
himself knows the type of traffic and how best to capture and filter it, so I 
don't think a generic one-size-fits-all solution is possible. [Though a couple 
of helper libraries may be of use]

As for physical ports, the scheme should work for any ethdev - why do you see 
it only being limited to physical ports? What would you want to see monitored 
that we are missing.

> 
> Furthermore, it doesn't sound like the filtering part scales well. Consider a 
> fully loaded 40 Gbit/s port. You would need to copy all packets into a single 
> rte_ring to the attached filtering process, which would then require its own 
> set of lcores to probably discard most of these packets when filtering. I 
> agree with Matthew that the filtering needs to happen as close to the source 
> as possible, and must be scalable to multiple lcores.

Without modifying the application itself to do it's own filtering I suspect 
scalability is always going to be a problem. That being said, there is no 
particular reason why a single rte_ring needs to be used - we could allow one 
ring per NIC queue for instance. The trouble with filtering at the source 
itself is that you put extra load on the IO cores. By using a ring, we put the 
filtering load on extra cores in a secondary process which can be scaled by the 
user without touching the main app.

> 
> On the positive side, your idea has the advantage that the filter can be any 
> application, and is not limited to BPF. However if the purpose is "tcpdump", 
> we should probably consider BPF, which is the type of filtering offered by 
> tcpdump.

Having this work with any application is one of our primary targets here. The 
app author should not have to worry too much about getting basic debug support.
Even if it doesn't work at 40G small packet rates, you can get a lot of benefit 
from a scheme that provides functional debugging for an app. Obviously, though 
we aim to make this as scalable as possible, which is why we want to allow 
fitlering in userspace before sending packets externally to DPDK.

> 
> I would prefer having a BPF library available that the application can use at 
> any point, either at the lowest level (when receiving/transmitting Ethernet 
> packets) or at a higher level (e.g. when working with packets that go into or 
> come out of a tunnel). The BPF library should implement packet length and 
> relevant ancillary data, such as SKF_AD_VLAN_TAG etc. based on metadata in 
> the mbuf.
> 
> Transferring a BPF filter from an outside application could be done by using 
> a simple text format, e.g. the output format of "tcpdump -ddd". This also 
> opens an easy roadmap for Wireshark integration by simply extending excap to 
> include such a BPF filter format.
> 
> 
> Lots of negativity above. I very much like the idea of attachi

[dpdk-dev] [PATCH] eal: map io resources for non x86 architectures

2015-12-16 Thread David Marchand
x86 requires a special set of instructions to access ioports, but other
architectures let you remap io resources.
So let eal remap io resources by accepting IORESOURCE_IO flag for
architectures other than x86.

Signed-off-by: David Marchand 
---
 lib/librte_eal/common/include/rte_pci.h |3 ++-
 lib/librte_eal/linuxapp/eal/eal_pci.c   |   21 +++--
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/lib/librte_eal/common/include/rte_pci.h 
b/lib/librte_eal/common/include/rte_pci.h
index 334c12e..8aaab4a 100644
--- a/lib/librte_eal/common/include/rte_pci.h
+++ b/lib/librte_eal/common/include/rte_pci.h
@@ -105,7 +105,8 @@ extern struct pci_device_list pci_device_list; /**< Global 
list of PCI devices.
 /** Nb. of values in PCI resource format. */
 #define PCI_RESOURCE_FMT_NVAL 3

-/** IO resource type: memory address space */
+/** IO resource type: */
+#define IORESOURCE_IO 0x0100
 #define IORESOURCE_MEM0x0200

 /**
diff --git a/lib/librte_eal/linuxapp/eal/eal_pci.c 
b/lib/librte_eal/linuxapp/eal/eal_pci.c
index bc5b5be..9c4651d 100644
--- a/lib/librte_eal/linuxapp/eal/eal_pci.c
+++ b/lib/librte_eal/linuxapp/eal/eal_pci.c
@@ -236,12 +236,21 @@ pci_parse_sysfs_resource(const char *filename, struct 
rte_pci_device *dev)
goto error;
}

-   if (flags & IORESOURCE_MEM) {
-   dev->mem_resource[i].phys_addr = phys_addr;
-   dev->mem_resource[i].len = end_addr - phys_addr + 1;
-   /* not mapped for now */
-   dev->mem_resource[i].addr = NULL;
-   }
+   /* we only care about IORESOURCE_IO or IORESOURCE_MEM */
+   if (!(flags & IORESOURCE_IO) &&
+   !(flags & IORESOURCE_MEM))
+   continue;
+
+#if defined(RTE_ARCH_X86_64) || defined(RTE_ARCH_I686)
+   /* x86 can not remap ioports, so skip it, remapping code will
+* look at dev->mem_resource[i].phys_addr == 0 and skip it */
+   if (flags & IORESOURCE_IO)
+   continue;
+#endif
+   dev->mem_resource[i].phys_addr = phys_addr;
+   dev->mem_resource[i].len = end_addr - phys_addr + 1;
+   /* not mapped for now */
+   dev->mem_resource[i].addr = NULL;
}
fclose(f);
return 0;
-- 
1.7.10.4



[dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support

2015-12-16 Thread Pavel Fedin
 Hello!

> However, I'm more curious about the ping loss? Did you still see
> that? And to be more specific, have the wireshark captured the
> GRAP from the guest?

 Yes, everything is fine.

root at nfv_test_x86_64 /var/log/libvirt/qemu # tshark -i ovs-br0
Running as user "root" and group "root". This could be dangerous.
Capturing on 'ovs-br0'
  1   0.00 RealtekU_3b:83:1a -> BroadcastARP 42 Gratuitous ARP for 
192.168.6.2 (Request)
  2   0.24 fe80::5054:ff:fe3b:831a -> ff02::1  ICMPv6 86 Neighbor 
Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
  3   0.049490 RealtekU_3b:83:1a -> BroadcastARP 42 Gratuitous ARP for 
192.168.6.2 (Request)
  4   0.049497 fe80::5054:ff:fe3b:831a -> ff02::1  ICMPv6 86 Neighbor 
Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
  5   0.199485 RealtekU_3b:83:1a -> BroadcastARP 42 Gratuitous ARP for 
192.168.6.2 (Request)
  6   0.199492 fe80::5054:ff:fe3b:831a -> ff02::1  ICMPv6 86 Neighbor 
Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
  7   0.449500 RealtekU_3b:83:1a -> BroadcastARP 42 Gratuitous ARP for 
192.168.6.2 (Request)
  8   0.449508 fe80::5054:ff:fe3b:831a -> ff02::1  ICMPv6 86 Neighbor 
Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
  9   0.517229  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
id=0x04af, seq=70/17920, ttl=64
 10   0.517277  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
id=0x04af, seq=70/17920, ttl=64 (request in 9)
 11   0.799521 RealtekU_3b:83:1a -> BroadcastARP 42 Gratuitous ARP for 
192.168.6.2 (Request)
 12   0.799553 fe80::5054:ff:fe3b:831a -> ff02::1  ICMPv6 86 Neighbor 
Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
 13   1.517210  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
id=0x04af, seq=71/18176, ttl=64
 14   1.517238  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
id=0x04af, seq=71/18176, ttl=64 (request in 13)
 15   2.517219  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
id=0x04af, seq=72/18432, ttl=64
 16   2.517256  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
id=0x04af, seq=72/18432, ttl=64 (request in 15)
 17   3.517497  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
id=0x04af, seq=73/18688, ttl=64
 18   3.517518  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
id=0x04af, seq=73/18688, ttl=64 (request in 17)
 19   4.517219  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
id=0x04af, seq=74/18944, ttl=64
 20   4.517237  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
id=0x04af, seq=74/18944, ttl=64 (request in 19)
 21   5.517222  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
id=0x04af, seq=75/19200, ttl=64
 22   5.517242  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
id=0x04af, seq=75/19200, ttl=64 (request in 21)
 23   6.517235  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
id=0x04af, seq=76/19456, ttl=64
 24   6.517256  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
id=0x04af, seq=76/19456, ttl=64 (request in 23)
 25   6.531466 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 Who has 
192.168.6.2?  Tell 192.168.6.1
 26   6.531619 RealtekU_3b:83:1a -> be:e1:71:c1:47:4d ARP 42 192.168.6.2 is at 
52:54:00:3b:83:1a
 27   7.517212  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
id=0x04af, seq=77/19712, ttl=64
 28   7.517229  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
id=0x04af, seq=77/19712, ttl=64 (request in 27)

 But there's one important detail here. Any replicated network interfaces 
(LOCAL port in my example) should be fully cloned on both
hosts, including MAC addresses. Otherwise after the migration the guest 
continues to send packets to old MAC, and, obvious, there's
still ping loss until it redoes the ARP for its ping target.

>  And what's the output of 'grep virtio /proc/interrupts' inside guest?

11:  0  0  0  0   IO-APIC  11-fasteoi   
uhci_hcd:usb1, virtio3
 24:  0  0  0  0   PCI-MSI 114688-edge  
virtio2-config
 25:   3544  0  0  0   PCI-MSI 114689-edge  
virtio2-req.0
 26: 10  0  0  0   PCI-MSI 49152-edge  
virtio0-config
 27:852  0  0  0   PCI-MSI 49153-edge  
virtio0-input.0
 28:  3  0  0  0   PCI-MSI 49154-edge  
virtio0-output.0
 29: 10  0  0  0   PCI-MSI 65536-edge  
virtio1-config
 30:172  0  0  0   PCI-MSI 65537-edge  
virtio1-input.0
 31:  1  0  0  0   PCI-MSI 65538-edge  
virtio1-output.0

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia




[dpdk-dev] [PATCH] eal: map io resources for non x86 architectures

2015-12-16 Thread Yuanhan Liu
On Wed, Dec 16, 2015 at 01:31:04PM +0100, David Marchand wrote:
> x86 requires a special set of instructions to access ioports, but other
> architectures let you remap io resources.
> So let eal remap io resources by accepting IORESOURCE_IO flag for
> architectures other than x86.

One question: this patch could be a replacement of the igbuio_iomap patch
from Santosh? If so, I like it: It's more elegant.

--yliu

> 
> Signed-off-by: David Marchand 
> ---
>  lib/librte_eal/common/include/rte_pci.h |3 ++-
>  lib/librte_eal/linuxapp/eal/eal_pci.c   |   21 +++--
>  2 files changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/librte_eal/common/include/rte_pci.h 
> b/lib/librte_eal/common/include/rte_pci.h
> index 334c12e..8aaab4a 100644
> --- a/lib/librte_eal/common/include/rte_pci.h
> +++ b/lib/librte_eal/common/include/rte_pci.h
> @@ -105,7 +105,8 @@ extern struct pci_device_list pci_device_list; /**< 
> Global list of PCI devices.
>  /** Nb. of values in PCI resource format. */
>  #define PCI_RESOURCE_FMT_NVAL 3
>  
> -/** IO resource type: memory address space */
> +/** IO resource type: */
> +#define IORESOURCE_IO 0x0100
>  #define IORESOURCE_MEM0x0200
>  
>  /**
> diff --git a/lib/librte_eal/linuxapp/eal/eal_pci.c 
> b/lib/librte_eal/linuxapp/eal/eal_pci.c
> index bc5b5be..9c4651d 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_pci.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_pci.c
> @@ -236,12 +236,21 @@ pci_parse_sysfs_resource(const char *filename, struct 
> rte_pci_device *dev)
>   goto error;
>   }
>  
> - if (flags & IORESOURCE_MEM) {
> - dev->mem_resource[i].phys_addr = phys_addr;
> - dev->mem_resource[i].len = end_addr - phys_addr + 1;
> - /* not mapped for now */
> - dev->mem_resource[i].addr = NULL;
> - }
> + /* we only care about IORESOURCE_IO or IORESOURCE_MEM */
> + if (!(flags & IORESOURCE_IO) &&
> + !(flags & IORESOURCE_MEM))
> + continue;
> +
> +#if defined(RTE_ARCH_X86_64) || defined(RTE_ARCH_I686)
> + /* x86 can not remap ioports, so skip it, remapping code will
> +  * look at dev->mem_resource[i].phys_addr == 0 and skip it */
> + if (flags & IORESOURCE_IO)
> + continue;
> +#endif
> + dev->mem_resource[i].phys_addr = phys_addr;
> + dev->mem_resource[i].len = end_addr - phys_addr + 1;
> + /* not mapped for now */
> + dev->mem_resource[i].addr = NULL;
>   }
>   fclose(f);
>   return 0;
> -- 
> 1.7.10.4


[dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support

2015-12-16 Thread Yuanhan Liu
On Wed, Dec 16, 2015 at 03:43:06PM +0300, Pavel Fedin wrote:
> rYR8N8f/ookveMRL7BfPnj5lw+EJZd+uG+v/lZnBuWidyQ4r
>   g586/P1rPsQw8p6wT+M7LnqvMLZM9eWq2ht53Bd5liqxFGckGmoxFxUnAgC5sFKthAIAAA==
> Status: O
> Content-Length: 4853
> Lines: 66
> 
>  Hello!
> 
> > However, I'm more curious about the ping loss? Did you still see
> > that? And to be more specific, have the wireshark captured the
> > GRAP from the guest?
> 
>  Yes, everything is fine.

Great!

> 
> root at nfv_test_x86_64 /var/log/libvirt/qemu # tshark -i ovs-br0
> Running as user "root" and group "root". This could be dangerous.
> Capturing on 'ovs-br0'
>   1   0.00 RealtekU_3b:83:1a -> BroadcastARP 42 Gratuitous ARP for 
> 192.168.6.2 (Request)
>   2   0.24 fe80::5054:ff:fe3b:831a -> ff02::1  ICMPv6 86 Neighbor 
> Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
> 52:54:00:3b:83:1a
>   3   0.049490 RealtekU_3b:83:1a -> BroadcastARP 42 Gratuitous ARP for 
> 192.168.6.2 (Request)
>   4   0.049497 fe80::5054:ff:fe3b:831a -> ff02::1  ICMPv6 86 Neighbor 
> Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
> 52:54:00:3b:83:1a
>   5   0.199485 RealtekU_3b:83:1a -> BroadcastARP 42 Gratuitous ARP for 
> 192.168.6.2 (Request)
>   6   0.199492 fe80::5054:ff:fe3b:831a -> ff02::1  ICMPv6 86 Neighbor 
> Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
> 52:54:00:3b:83:1a
>   7   0.449500 RealtekU_3b:83:1a -> BroadcastARP 42 Gratuitous ARP for 
> 192.168.6.2 (Request)
>   8   0.449508 fe80::5054:ff:fe3b:831a -> ff02::1  ICMPv6 86 Neighbor 
> Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
> 52:54:00:3b:83:1a
>   9   0.517229  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
> id=0x04af, seq=70/17920, ttl=64
>  10   0.517277  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
> id=0x04af, seq=70/17920, ttl=64 (request in 9)
>  11   0.799521 RealtekU_3b:83:1a -> BroadcastARP 42 Gratuitous ARP for 
> 192.168.6.2 (Request)
>  12   0.799553 fe80::5054:ff:fe3b:831a -> ff02::1  ICMPv6 86 Neighbor 
> Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
> 52:54:00:3b:83:1a
>  13   1.517210  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
> id=0x04af, seq=71/18176, ttl=64
>  14   1.517238  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
> id=0x04af, seq=71/18176, ttl=64 (request in 13)
>  15   2.517219  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
> id=0x04af, seq=72/18432, ttl=64
>  16   2.517256  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
> id=0x04af, seq=72/18432, ttl=64 (request in 15)
>  17   3.517497  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
> id=0x04af, seq=73/18688, ttl=64
>  18   3.517518  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
> id=0x04af, seq=73/18688, ttl=64 (request in 17)
>  19   4.517219  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
> id=0x04af, seq=74/18944, ttl=64
>  20   4.517237  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
> id=0x04af, seq=74/18944, ttl=64 (request in 19)
>  21   5.517222  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
> id=0x04af, seq=75/19200, ttl=64
>  22   5.517242  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
> id=0x04af, seq=75/19200, ttl=64 (request in 21)
>  23   6.517235  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
> id=0x04af, seq=76/19456, ttl=64
>  24   6.517256  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
> id=0x04af, seq=76/19456, ttl=64 (request in 23)
>  25   6.531466 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 Who has 
> 192.168.6.2?  Tell 192.168.6.1
>  26   6.531619 RealtekU_3b:83:1a -> be:e1:71:c1:47:4d ARP 42 192.168.6.2 is 
> at 52:54:00:3b:83:1a
>  27   7.517212  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  
> id=0x04af, seq=77/19712, ttl=64
>  28   7.517229  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply
> id=0x04af, seq=77/19712, ttl=64 (request in 27)
> 
>  But there's one important detail here. Any replicated network interfaces 
> (LOCAL port in my example) should be fully cloned on both
> hosts, including MAC addresses. Otherwise after the migration the guest 
> continues to send packets to old MAC, and, obvious, there's
> still ping loss until it redoes the ARP for its ping target.

I see. And here I care more about whether we can get the GARP from the
target guest just after the migration. If you can, everything should
be fine.
> 
> >  And what's the output of 'grep virtio /proc/interrupts' inside guest?
> 
> 11:  0  0  0  0   IO-APIC  11-fasteoi   
> uhci_hcd:usb1, virtio3
>  24:  0  0  0  0   PCI-MSI 114688-edge  
> virtio2-config
>  25:   3544  0  0  0   PCI-MSI 114689-edge  
> virtio2-req.0
>  26: 10  0  0  0   PCI-MSI 49152-edge  
> virtio0-config

The GUEST_ANNOUNCE has indeed been triggered. That's great! I just have

[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Bruce Richardson
On Wed, Dec 16, 2015 at 01:26:11PM +0100, Morten Br?rup wrote:
> Bruce,
> 
> Please note that tcpdump is a stupid name for a packet capture application 
> that supports much more than just TCP.
> 
> I had missed the point about ethdev supporting virtual interfaces, so thank 
> you for pointing that out. That covers my concerns about capturing packets 
> inside tunnels.
> 
> I will gladly admit that you Intel guys are probably much more competent in 
> the field of DPDK performance and scalability than I am. So Matthew and I 
> have been asking you to kindly ensure that your solution scales well at very 
> high packet rates too, and pointing out that filtering before copying is 
> probably cheaper than copying before filtering. You mention that it leads to 
> an important choice about which lcores get to do the work of filtering the 
> packets, so that might be worth some discussion.
> 
> :-)
> 
> Med venlig hilsen / kind regards
> - Morten Br?rup
> 

Thanks for your support.

We may look at having a certain amount of flexibility in the configuration of
the setup, so as to avoid limiting the use of the functionality.

For scalability at very high packet rates, it's something we'll need you guys to
give us pointers on too - what's acceptable or not inside an app, and what
level of scalabilty is needed. I'd admit that most of our initial thinking in 
this
area was for debugging apps at less than line rate i.e. for functional testing.
For full line rate introspection, we'll have to see when we get some working 
code.

/Bruce

> 
> -Original Message-
> From: Bruce Richardson [mailto:bruce.richardson at intel.com] 
> Sent: 16. december 2015 12:56
> To: Morten Br?rup
> Cc: Matthew Hall; Kyle Larose; dev at dpdk.org
> Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3
> 
> On Wed, Dec 16, 2015 at 12:40:43PM +0100, Morten Br?rup wrote:
> > Bruce,
> > 
> > This doesn't really sound like tcpdump to me; it sounds like port mirroring.
> 
> It's actually a bit of both, in my opinion, it's designed to allow basic 
> mirroring of traffic on a port to allow that traffic to be sent to a tcpdump 
> destination.
> By going with a more generic approach, we hope to enable more possible use 
> cases than just focusing on TCP.
> 
> 
> > 
> > Your suggestion is limited to physical ports only, and cannot be attached 
> > further inside the application, e.g. for mirroring packets related to a 
> > specific VLAN.
> 
> Yes, the lack of attachment inside the app is a limitation. There are two 
> types of scenarios that could be considered for packet capture:
> * ones where the application can be modified to do it's own filtering and 
> capturing.
> * ones where you want a generic capture mechanism which can be used on any 
> application without modification.
> We have chosen to focus more on the second one, as that is where a generic 
> solution for DPDK is likely to lie. For the first case, the application 
> writer himself knows the type of traffic and how best to capture and filter 
> it, so I don't think a generic one-size-fits-all solution is possible. 
> [Though a couple of helper libraries may be of use]
> 
> As for physical ports, the scheme should work for any ethdev - why do you see 
> it only being limited to physical ports? What would you want to see monitored 
> that we are missing.
> 
> > 
> > Furthermore, it doesn't sound like the filtering part scales well. Consider 
> > a fully loaded 40 Gbit/s port. You would need to copy all packets into a 
> > single rte_ring to the attached filtering process, which would then require 
> > its own set of lcores to probably discard most of these packets when 
> > filtering. I agree with Matthew that the filtering needs to happen as close 
> > to the source as possible, and must be scalable to multiple lcores.
> 
> Without modifying the application itself to do it's own filtering I suspect 
> scalability is always going to be a problem. That being said, there is no 
> particular reason why a single rte_ring needs to be used - we could allow one 
> ring per NIC queue for instance. The trouble with filtering at the source 
> itself is that you put extra load on the IO cores. By using a ring, we put 
> the filtering load on extra cores in a secondary process which can be scaled 
> by the user without touching the main app.
> 
> > 
> > On the positive side, your idea has the advantage that the filter can be 
> > any application, and is not limited to BPF. However if the purpose is 
> > "tcpdump", we should probably consider BPF, which is the type of filtering 
> > offered by tcpdump.
> 
> Having this work with any application is one of our primary targets here. The 
> app author should not have to worry too much about getting basic debug 
> support.
> Even if it doesn't work at 40G small packet rates, you can get a lot of 
> benefit from a scheme that provides functional debugging for an app. 
> Obviously, though we aim to make this as scalable as possible, which is why 
> we wan

[dpdk-dev] [PATCH] eal: map io resources for non x86 architectures

2015-12-16 Thread Bruce Richardson
On Wed, Dec 16, 2015 at 01:31:04PM +0100, David Marchand wrote:
> x86 requires a special set of instructions to access ioports, but other
> architectures let you remap io resources.
> So let eal remap io resources by accepting IORESOURCE_IO flag for
> architectures other than x86.
> 
> Signed-off-by: David Marchand 
> ---
>  lib/librte_eal/common/include/rte_pci.h |3 ++-
>  lib/librte_eal/linuxapp/eal/eal_pci.c   |   21 +++--
>  2 files changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/librte_eal/common/include/rte_pci.h 
> b/lib/librte_eal/common/include/rte_pci.h
> index 334c12e..8aaab4a 100644
> --- a/lib/librte_eal/common/include/rte_pci.h
> +++ b/lib/librte_eal/common/include/rte_pci.h
> @@ -105,7 +105,8 @@ extern struct pci_device_list pci_device_list; /**< 
> Global list of PCI devices.
>  /** Nb. of values in PCI resource format. */
>  #define PCI_RESOURCE_FMT_NVAL 3
>  
> -/** IO resource type: memory address space */
> +/** IO resource type: */
> +#define IORESOURCE_IO 0x0100
>  #define IORESOURCE_MEM0x0200
>  
>  /**
> diff --git a/lib/librte_eal/linuxapp/eal/eal_pci.c 
> b/lib/librte_eal/linuxapp/eal/eal_pci.c
> index bc5b5be..9c4651d 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_pci.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_pci.c
> @@ -236,12 +236,21 @@ pci_parse_sysfs_resource(const char *filename, struct 
> rte_pci_device *dev)
>   goto error;
>   }
>  
> - if (flags & IORESOURCE_MEM) {
> - dev->mem_resource[i].phys_addr = phys_addr;
> - dev->mem_resource[i].len = end_addr - phys_addr + 1;
> - /* not mapped for now */
> - dev->mem_resource[i].addr = NULL;
> - }
> + /* we only care about IORESOURCE_IO or IORESOURCE_MEM */
> + if (!(flags & IORESOURCE_IO) &&
> + !(flags & IORESOURCE_MEM))
> + continue;
> +
> +#if defined(RTE_ARCH_X86_64) || defined(RTE_ARCH_I686)
> + /* x86 can not remap ioports, so skip it, remapping code will
> +  * look at dev->mem_resource[i].phys_addr == 0 and skip it */
> + if (flags & IORESOURCE_IO)
> + continue;
> +#endif

As a tangential comment: We maybe could look to make certain preprocessor
defines available as C globals as well. There is no reason that the ifdef here
could not be implemented as a runtime check in C code.

/Bruce



[dpdk-dev] [PATCH] eal: map io resources for non x86 architectures

2015-12-16 Thread David Marchand
Bruce,

On Wed, Dec 16, 2015 at 2:15 PM, Bruce Richardson <
bruce.richardson at intel.com> wrote:

>
> > +#if defined(RTE_ARCH_X86_64) || defined(RTE_ARCH_I686)
> > + /* x86 can not remap ioports, so skip it, remapping code
> will
> > +  * look at dev->mem_resource[i].phys_addr == 0 and skip it
> */
> > + if (flags & IORESOURCE_IO)
> > + continue;
> > +#endif
>
> As a tangential comment: We maybe could look to make certain preprocessor
> defines available as C globals as well. There is no reason that the ifdef
> here
> could not be implemented as a runtime check in C code.
>
>
Well, instead of having the same information as the preprocessor define,
maybe some capability per arch/cpu would be better "arch supports io remap".
Maybe we can extend the cpuflags ?


-- 
David Marchand


[dpdk-dev] [ [PATCH v2] 11/13] virtio_ioport: armv7/v8: mmap virtio iopci bar region

2015-12-16 Thread Yuanhan Liu
On Mon, Dec 14, 2015 at 06:30:30PM +0530, Santosh Shukla wrote:
> Introducing module to mmap iopci bar region. Applicable for linuxapp for 
> non-x86
> archs, Tested for arm64/ThunderX platform for linux. For that adding two 
> global
> api.
> - virtio_ioport_init
> - virtio_ioport_unmap
> 
> Signed-off-by: Santosh Shukla 
> Signed-off-by: Rizwan Ansari 
> Signed-off-by: Rakesh Krishnamurthy 
> ---
>  drivers/net/virtio/Makefile|1 +
>  drivers/net/virtio/virtio_ioport.c |  163 
> 
>  drivers/net/virtio/virtio_ioport.h |   42 ++
>  3 files changed, 206 insertions(+)
>  create mode 100644 drivers/net/virtio/virtio_ioport.c
>  create mode 100644 drivers/net/virtio/virtio_ioport.h
> 
> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> index 25a842d..5cba6d3 100644
> --- a/drivers/net/virtio/Makefile
> +++ b/drivers/net/virtio/Makefile
> @@ -50,6 +50,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtqueue.c
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_pci.c
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
> +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ioport.c
>  SRCS-$(CONFIG_RTE_VIRTIO_INC_VECTOR) += virtio_rxtx_simple.c
>  
>  # this lib depends upon:
> diff --git a/drivers/net/virtio/virtio_ioport.c 
> b/drivers/net/virtio/virtio_ioport.c
> new file mode 100644
> index 000..ffeb8e9
> --- /dev/null
> +++ b/drivers/net/virtio/virtio_ioport.c
> @@ -0,0 +1,163 @@
> +/*
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2015 Cavium Networks. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *   * Redistributions of source code must retain the above copyright
> + * notice, this list of conditions and the following disclaimer.
> + *   * Redistributions in binary form must reproduce the above copyright
> + * notice, this list of conditions and the following disclaimer in
> + * the documentation and/or other materials provided with the
> + * distribution.
> + *   * Neither the name of Intel Corporation nor the names of its
> + * contributors may be used to endorse or promote products derived
> + * from this software without specific prior written permission.
> + *
> + *THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + *
> + */
> +
> +#include "virtio_ioport.h"
> +
> +#if defined(RTE_EXEC_ENV_LINUXAPP) && (defined(RTE_ARCH_ARM) || \
> + defined(RTE_ARCH_ARM64))
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "virtio_logs.h"
> +
> +/* start address of first pci_iobar slot (user-space virtual-addres) */
> +void *ioport_map;

You still forgot "static"?

> +/**
> + * ioport map count,
> + * Use-case: virtio-net-pci.

How about removing above two lines; it's quite meaningless here, but
instead a bit redundant.

> + * Keeps track of number of virtio-net-pci device mapped/unmapped. Max device
> + * support by linux kernel is 31, so ioport_map_cnt can not be greater than 
> 31.
> + */
> +static int ioport_map_cnt;
> +
> +static int
> +virtio_map_ioport(void **resource_addr)
> +{
> + int fd;
> + int ret = 0;
> +
> + /* avoid -Werror=unused-parameter, keep compiler happy */
> + (void)resource_addr;

Using __rte_unused is more elegant.

> + fd = open(VIRT_IOPORT_DEV, O_RDWR);
> + if (fd < 0) {
> + PMD_INIT_LOG(ERR, "device file %s open error: %d\n",
> +  DEV_NAME, fd);
> + ret = -1;
> + goto out;
> + }
> +
> + ioport_map = mmap(NULL, PCI_VIRT_IOPORT_SIZE,
> + PROT_EXEC | PROT_WRITE | PROT_READ, MAP_SHARED, fd, 0);
> +
> + if (ioport_map == MAP_FAILED) {
> + PMD_INIT_LOG(ERR, "mmap: failed to map bar Address=%p\n",
> + *resource_addr);
> + ret = -ENOMEM;
> + goto out1;
> + }
> +
> + PMD_INIT_LOG(INFO, "First pci_iobar mapped at %p

[dpdk-dev] [PATCH] eal: map io resources for non x86 architectures

2015-12-16 Thread David Marchand
Yuanhan,

On Wed, Dec 16, 2015 at 1:48 PM, Yuanhan Liu 
wrote:

> On Wed, Dec 16, 2015 at 01:31:04PM +0100, David Marchand wrote:
> > x86 requires a special set of instructions to access ioports, but other
> > architectures let you remap io resources.
> > So let eal remap io resources by accepting IORESOURCE_IO flag for
> > architectures other than x86.
>
> One question: this patch could be a replacement of the igbuio_iomap patch
> from Santosh? If so, I like it: It's more elegant.
>

Well, yes, unless I missed something since I am no guru :-).

-- 
David Marchand


[dpdk-dev] [PATCH] eal: map io resources for non x86 architectures

2015-12-16 Thread Yuanhan Liu
On Wed, Dec 16, 2015 at 02:34:35PM +0100, David Marchand wrote:
> Yuanhan,
> 
> On Wed, Dec 16, 2015 at 1:48 PM, Yuanhan Liu 
> wrote:
> 
> On Wed, Dec 16, 2015 at 01:31:04PM +0100, David Marchand wrote:
> > x86 requires a special set of instructions to access ioports, but other
> > architectures let you remap io resources.
> > So let eal remap io resources by accepting IORESOURCE_IO flag for
> > architectures other than x86.
> 
> One question: this patch could be a replacement of the igbuio_iomap patch
> from Santosh? If so, I like it: It's more elegant.
> 
> 
> Well, yes, unless I missed something since I am no guru :-).

Great then. If there is a test-by, I could give my Ack :)
(I have no arm or other platform for testing).

--yliu


[dpdk-dev] [ [PATCH v2] 05/13] virtio: change io_base datatype from uint32_t to uint64_type

2015-12-16 Thread Yuanhan Liu
On Mon, Dec 14, 2015 at 06:30:24PM +0530, Santosh Shukla wrote:
> In x86 case io_base to store ioport address not more than 65535 ioports. 
> i.e..0
> to  but in non-x86 case in particular arm64 it need to store more than 32
> bit address so changing io_base datatype from 32 to 64.
> 
> Signed-off-by: Santosh Shukla 
> ---
>  drivers/net/virtio/virtio_ethdev.c |2 +-
>  drivers/net/virtio/virtio_pci.h|4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio/virtio_ethdev.c 
> b/drivers/net/virtio/virtio_ethdev.c
> index d928339..620e0d4 100644
> --- a/drivers/net/virtio/virtio_ethdev.c
> +++ b/drivers/net/virtio/virtio_ethdev.c
> @@ -1291,7 +1291,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
>   return -1;
>  
>   hw->use_msix = virtio_has_msix(&pci_dev->addr);
> - hw->io_base = (uint32_t)(uintptr_t)pci_dev->mem_resource[0].addr;
> + hw->io_base = (uint64_t)(uintptr_t)pci_dev->mem_resource[0].addr;

I'd suggest to move the io_base assignment (and cast) into virtio_ioport_init()
so that we could do the correct cast there, say cast it to uint32_t for
X86, and uint64_t for others.

--yliu

>  
>   /* Reset the device although not necessary at startup */
>   vtpci_reset(hw);
> diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
> index 3f4ff80..f3e4178 100644
> --- a/drivers/net/virtio/virtio_pci.h
> +++ b/drivers/net/virtio/virtio_pci.h
> @@ -169,7 +169,7 @@ struct virtqueue;
>  
>  struct virtio_hw {
>   struct virtqueue *cvq;
> - uint32_tio_base;
> + uint64_tio_base;
>   uint32_tguest_features;
>   uint32_tmax_tx_queues;
>   uint32_tmax_rx_queues;
> @@ -231,7 +231,7 @@ outl_p(unsigned int data, unsigned int port)
>  #endif
>  
>  #define VIRTIO_PCI_REG_ADDR(hw, reg) \
> - (unsigned short)((hw)->io_base + (reg))
> + (unsigned long)((hw)->io_base + (reg))
>  
>  #define VIRTIO_READ_REG_1(hw, reg) \
>   inb((VIRTIO_PCI_REG_ADDR((hw), (reg
> -- 
> 1.7.9.5


[dpdk-dev] [PATCH] eal: map io resources for non x86 architectures

2015-12-16 Thread Santosh Shukla
On Wed, Dec 16, 2015 at 6:18 PM, Yuanhan Liu
 wrote:
> On Wed, Dec 16, 2015 at 01:31:04PM +0100, David Marchand wrote:
>> x86 requires a special set of instructions to access ioports, but other
>> architectures let you remap io resources.
>> So let eal remap io resources by accepting IORESOURCE_IO flag for
>> architectures other than x86.
>
> One question: this patch could be a replacement of the igbuio_iomap patch
> from Santosh? If so, I like it: It's more elegant.
>
> --yliu
>

I did tried similar in past but not in parse_sysfs (such that
mem.resource_addr to accept IO_RESOURCE_IO types) and observed that
pci_map_resource not able to map address hence segfault at tespmd
initialization.

i was getting these:
EAL: pci_map_resource(): cannot mmap(19, 0x7fa5c0, 0x20, 0x0):
Invalid argument (0x)

after enabling RTE_PCI_NEED_DRV_MAPPING flags in virtio_ethdev. I
guess patch assume that flag enabled for driver right?



>>
>> Signed-off-by: David Marchand 
>> ---
>>  lib/librte_eal/common/include/rte_pci.h |3 ++-
>>  lib/librte_eal/linuxapp/eal/eal_pci.c   |   21 +++--
>>  2 files changed, 17 insertions(+), 7 deletions(-)
>>
>> diff --git a/lib/librte_eal/common/include/rte_pci.h 
>> b/lib/librte_eal/common/include/rte_pci.h
>> index 334c12e..8aaab4a 100644
>> --- a/lib/librte_eal/common/include/rte_pci.h
>> +++ b/lib/librte_eal/common/include/rte_pci.h
>> @@ -105,7 +105,8 @@ extern struct pci_device_list pci_device_list; /**< 
>> Global list of PCI devices.
>>  /** Nb. of values in PCI resource format. */
>>  #define PCI_RESOURCE_FMT_NVAL 3
>>
>> -/** IO resource type: memory address space */
>> +/** IO resource type: */
>> +#define IORESOURCE_IO 0x0100
>>  #define IORESOURCE_MEM0x0200
>>
>>  /**
>> diff --git a/lib/librte_eal/linuxapp/eal/eal_pci.c 
>> b/lib/librte_eal/linuxapp/eal/eal_pci.c
>> index bc5b5be..9c4651d 100644
>> --- a/lib/librte_eal/linuxapp/eal/eal_pci.c
>> +++ b/lib/librte_eal/linuxapp/eal/eal_pci.c
>> @@ -236,12 +236,21 @@ pci_parse_sysfs_resource(const char *filename, struct 
>> rte_pci_device *dev)
>>   goto error;
>>   }
>>
>> - if (flags & IORESOURCE_MEM) {
>> - dev->mem_resource[i].phys_addr = phys_addr;
>> - dev->mem_resource[i].len = end_addr - phys_addr + 1;
>> - /* not mapped for now */
>> - dev->mem_resource[i].addr = NULL;
>> - }
>> + /* we only care about IORESOURCE_IO or IORESOURCE_MEM */
>> + if (!(flags & IORESOURCE_IO) &&
>> + !(flags & IORESOURCE_MEM))
>> + continue;
>> +
>> +#if defined(RTE_ARCH_X86_64) || defined(RTE_ARCH_I686)
>> + /* x86 can not remap ioports, so skip it, remapping code will
>> +  * look at dev->mem_resource[i].phys_addr == 0 and skip it */
>> + if (flags & IORESOURCE_IO)
>> + continue;
>> +#endif
>> + dev->mem_resource[i].phys_addr = phys_addr;
>> + dev->mem_resource[i].len = end_addr - phys_addr + 1;
>> + /* not mapped for now */
>> + dev->mem_resource[i].addr = NULL;
>>   }
>>   fclose(f);
>>   return 0;
>> --
>> 1.7.10.4


[dpdk-dev] [ [PATCH v2] 05/13] virtio: change io_base datatype from uint32_t to uint64_type

2015-12-16 Thread Santosh Shukla
On Wed, Dec 16, 2015 at 7:18 PM, Yuanhan Liu
 wrote:
> On Mon, Dec 14, 2015 at 06:30:24PM +0530, Santosh Shukla wrote:
>> In x86 case io_base to store ioport address not more than 65535 ioports. 
>> i.e..0
>> to  but in non-x86 case in particular arm64 it need to store more than 32
>> bit address so changing io_base datatype from 32 to 64.
>>
>> Signed-off-by: Santosh Shukla 
>> ---
>>  drivers/net/virtio/virtio_ethdev.c |2 +-
>>  drivers/net/virtio/virtio_pci.h|4 ++--
>>  2 files changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/virtio/virtio_ethdev.c 
>> b/drivers/net/virtio/virtio_ethdev.c
>> index d928339..620e0d4 100644
>> --- a/drivers/net/virtio/virtio_ethdev.c
>> +++ b/drivers/net/virtio/virtio_ethdev.c
>> @@ -1291,7 +1291,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
>>   return -1;
>>
>>   hw->use_msix = virtio_has_msix(&pci_dev->addr);
>> - hw->io_base = (uint32_t)(uintptr_t)pci_dev->mem_resource[0].addr;
>> + hw->io_base = (uint64_t)(uintptr_t)pci_dev->mem_resource[0].addr;
>
> I'd suggest to move the io_base assignment (and cast) into 
> virtio_ioport_init()
> so that we could do the correct cast there, say cast it to uint32_t for
> X86, and uint64_t for others.
>

Ok.

This was deliberately done considering your 1.0 virtio spec patch do
care for uint64_t types and in arm64 case, If I plan to use those
future patches, IMO it make more sense to me keep it in uint64_t way;

Also in x86 case max address could of type 0x1000-101f and so forth;
changing data-type to uint64_t default wont effect such address,
right? And hw->io_base by looking at virtio_pci.h function like
inb/outb etc.. takes io_base address as unsigned long types which is
arch dependent; i.e.. 4 byte for 32 bit and 8 for 64 bit so the lower
level rd/wr apis are taking care of data-types accordingly.
> --yliu
>
>>
>>   /* Reset the device although not necessary at startup */
>>   vtpci_reset(hw);
>> diff --git a/drivers/net/virtio/virtio_pci.h 
>> b/drivers/net/virtio/virtio_pci.h
>> index 3f4ff80..f3e4178 100644
>> --- a/drivers/net/virtio/virtio_pci.h
>> +++ b/drivers/net/virtio/virtio_pci.h
>> @@ -169,7 +169,7 @@ struct virtqueue;
>>
>>  struct virtio_hw {
>>   struct virtqueue *cvq;
>> - uint32_tio_base;
>> + uint64_tio_base;
>>   uint32_tguest_features;
>>   uint32_tmax_tx_queues;
>>   uint32_tmax_rx_queues;
>> @@ -231,7 +231,7 @@ outl_p(unsigned int data, unsigned int port)
>>  #endif
>>
>>  #define VIRTIO_PCI_REG_ADDR(hw, reg) \
>> - (unsigned short)((hw)->io_base + (reg))
>> + (unsigned long)((hw)->io_base + (reg))
>>
>>  #define VIRTIO_READ_REG_1(hw, reg) \
>>   inb((VIRTIO_PCI_REG_ADDR((hw), (reg
>> --
>> 1.7.9.5


[dpdk-dev] make install and RTE_KERNELDIR in dpdk 2.2

2015-12-16 Thread Piotr Bartosiewicz
A new 'make install' wrongly assumes that the output module name is 
always 'uname -r' even if RTE_KERNELDIR is passed.

# make install T=... DESTDIR=/tmp/dpdk 
RTE_KERNELDIR=/lib/modules/3.16.0-4-amd64/build
...
# ls /tmp/dpdk/lib/modules/
4.2.0-18-generic

-- 
Regards
Piotr Bartosiewicz



[dpdk-dev] [PATCH v5 1/2] tools: Add support for handling built-in kernel modules

2015-12-16 Thread Kamil Rytarowski
ping?

W dniu 09.12.2015 o 14:19, Kamil Rytarowski pisze:
> Currently dpdk_nic_bind.py detects Linux kernel modules via reading
> /proc/modules. Built-in ones aren't listed there and therefore they are not
> being found by the script.
>
> Add support for checking built-in modules with parsing the sysfs files.
>
> This commit obsoletes the /proc/modules parsing approach.
>
> Signed-off-by: Kamil Rytarowski 
> Signed-off-by: David Marchand 
> ---
>   tools/dpdk_nic_bind.py | 27 +--
>   1 file changed, 17 insertions(+), 10 deletions(-)
>
> diff --git a/tools/dpdk_nic_bind.py b/tools/dpdk_nic_bind.py
> index f02454e..e161062 100755
> --- a/tools/dpdk_nic_bind.py
> +++ b/tools/dpdk_nic_bind.py
> @@ -156,22 +156,29 @@ def check_modules():
>   '''Checks that igb_uio is loaded'''
>   global dpdk_drivers
>   
> -fd = file("/proc/modules")
> -loaded_mods = fd.readlines()
> -fd.close()
> -
>   # list of supported modules
>   mods =  [{"Name" : driver, "Found" : False} for driver in dpdk_drivers]
>   
>   # first check if module is loaded
> -for line in loaded_mods:
> +try:
> +# Get list of syfs modules, some of them might be builtin and merge 
> with mods
> +sysfs_path = '/sys/module/'
> +
> +# Get the list of directories in sysfs_path
> +sysfs_mods = [os.path.join(sysfs_path,o) for o in 
> os.listdir(sysfs_path) if os.path.isdir(os.path.join(sysfs_path,o))]
> +
> +# Extract the last element of '/sys/module/abc' in the array
> +sysfs_mods = [a.split('/')[-1] for a in sysfs_mods]
> +
> +# special case for vfio_pci (module is named vfio-pci,
> +# but its .ko is named vfio_pci)
> +sysfs_mods = map(lambda a: a if a != 'vfio_pci' else 'vfio-pci', 
> sysfs_mods)
> +
>   for mod in mods:
> -if line.startswith(mod["Name"]):
> -mod["Found"] = True
> -# special case for vfio_pci (module is named vfio-pci,
> -# but its .ko is named vfio_pci)
> -elif line.replace("_", "-").startswith(mod["Name"]):
> +if mod["Found"] == False and (mod["Name"] in sysfs_mods):
>   mod["Found"] = True
> +except:
> +pass
>   
>   # check if we have at least one loaded module
>   if True not in [mod["Found"] for mod in mods] and b_flag is not None:



[dpdk-dev] [ [PATCH v2] 11/13] virtio_ioport: armv7/v8: mmap virtio iopci bar region

2015-12-16 Thread Santosh Shukla
On Wed, Dec 16, 2015 at 6:59 PM, Yuanhan Liu
 wrote:
> On Mon, Dec 14, 2015 at 06:30:30PM +0530, Santosh Shukla wrote:
>> Introducing module to mmap iopci bar region. Applicable for linuxapp for 
>> non-x86
>> archs, Tested for arm64/ThunderX platform for linux. For that adding two 
>> global
>> api.
>> - virtio_ioport_init
>> - virtio_ioport_unmap
>>
>> Signed-off-by: Santosh Shukla 
>> Signed-off-by: Rizwan Ansari 
>> Signed-off-by: Rakesh Krishnamurthy 
>> ---
>>  drivers/net/virtio/Makefile|1 +
>>  drivers/net/virtio/virtio_ioport.c |  163 
>> 
>>  drivers/net/virtio/virtio_ioport.h |   42 ++
>>  3 files changed, 206 insertions(+)
>>  create mode 100644 drivers/net/virtio/virtio_ioport.c
>>  create mode 100644 drivers/net/virtio/virtio_ioport.h
>>
>> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
>> index 25a842d..5cba6d3 100644
>> --- a/drivers/net/virtio/Makefile
>> +++ b/drivers/net/virtio/Makefile
>> @@ -50,6 +50,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtqueue.c
>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_pci.c
>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
>> +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ioport.c
>>  SRCS-$(CONFIG_RTE_VIRTIO_INC_VECTOR) += virtio_rxtx_simple.c
>>
>>  # this lib depends upon:
>> diff --git a/drivers/net/virtio/virtio_ioport.c 
>> b/drivers/net/virtio/virtio_ioport.c
>> new file mode 100644
>> index 000..ffeb8e9
>> --- /dev/null
>> +++ b/drivers/net/virtio/virtio_ioport.c
>> @@ -0,0 +1,163 @@
>> +/*
>> + *   BSD LICENSE
>> + *
>> + *   Copyright(c) 2015 Cavium Networks. All rights reserved.
>> + *   All rights reserved.
>> + *
>> + *   Redistribution and use in source and binary forms, with or without
>> + *   modification, are permitted provided that the following conditions
>> + *   are met:
>> + *
>> + *   * Redistributions of source code must retain the above copyright
>> + * notice, this list of conditions and the following disclaimer.
>> + *   * Redistributions in binary form must reproduce the above copyright
>> + * notice, this list of conditions and the following disclaimer in
>> + * the documentation and/or other materials provided with the
>> + * distribution.
>> + *   * Neither the name of Intel Corporation nor the names of its
>> + * contributors may be used to endorse or promote products derived
>> + * from this software without specific prior written permission.
>> + *
>> + *THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>> + *"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>> + *LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
>> + *A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
>> + *OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
>> + *SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>> + *LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
>> + *DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
>> + *THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>> + *(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
>> + *OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
>> + *
>> + */
>> +
>> +#include "virtio_ioport.h"
>> +
>> +#if defined(RTE_EXEC_ENV_LINUXAPP) && (defined(RTE_ARCH_ARM) || \
>> + defined(RTE_ARCH_ARM64))
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include "virtio_logs.h"
>> +
>> +/* start address of first pci_iobar slot (user-space virtual-addres) */
>> +void *ioport_map;
>
> You still forgot "static"?
>

I misunderstood last comment, sorry we'll do.

>> +/**
>> + * ioport map count,
>> + * Use-case: virtio-net-pci.
>
> How about removing above two lines; it's quite meaningless here, but
> instead a bit redundant.
>

ok.

>> + * Keeps track of number of virtio-net-pci device mapped/unmapped. Max 
>> device
>> + * support by linux kernel is 31, so ioport_map_cnt can not be greater than 
>> 31.
>> + */
>> +static int ioport_map_cnt;
>> +
>> +static int
>> +virtio_map_ioport(void **resource_addr)
>> +{
>> + int fd;
>> + int ret = 0;
>> +
>> + /* avoid -Werror=unused-parameter, keep compiler happy */
>> + (void)resource_addr;
>
> Using __rte_unused is more elegant.
>

ok.

>> + fd = open(VIRT_IOPORT_DEV, O_RDWR);
>> + if (fd < 0) {
>> + PMD_INIT_LOG(ERR, "device file %s open error: %d\n",
>> +  DEV_NAME, fd);
>> + ret = -1;
>> + goto out;
>> + }
>> +
>> + ioport_map = mmap(NULL, PCI_VIRT_IOPORT_SIZE,
>> + PROT_EXEC | PROT_WRITE | PROT_READ, MAP_SHARED, fd, 0);
>> +
>> + if (ioport_map == MAP_FAILED) {
>> 

[dpdk-dev] [ [PATCH v2] 05/13] virtio: change io_base datatype from uint32_t to uint64_type

2015-12-16 Thread Yuanhan Liu
On Wed, Dec 16, 2015 at 07:31:57PM +0530, Santosh Shukla wrote:
> On Wed, Dec 16, 2015 at 7:18 PM, Yuanhan Liu
>  wrote:
> > On Mon, Dec 14, 2015 at 06:30:24PM +0530, Santosh Shukla wrote:
> >> In x86 case io_base to store ioport address not more than 65535 ioports. 
> >> i.e..0
> >> to  but in non-x86 case in particular arm64 it need to store more than 
> >> 32
> >> bit address so changing io_base datatype from 32 to 64.
> >>
> >> Signed-off-by: Santosh Shukla 
> >> ---
> >>  drivers/net/virtio/virtio_ethdev.c |2 +-
> >>  drivers/net/virtio/virtio_pci.h|4 ++--
> >>  2 files changed, 3 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/net/virtio/virtio_ethdev.c 
> >> b/drivers/net/virtio/virtio_ethdev.c
> >> index d928339..620e0d4 100644
> >> --- a/drivers/net/virtio/virtio_ethdev.c
> >> +++ b/drivers/net/virtio/virtio_ethdev.c
> >> @@ -1291,7 +1291,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
> >>   return -1;
> >>
> >>   hw->use_msix = virtio_has_msix(&pci_dev->addr);
> >> - hw->io_base = (uint32_t)(uintptr_t)pci_dev->mem_resource[0].addr;
> >> + hw->io_base = (uint64_t)(uintptr_t)pci_dev->mem_resource[0].addr;
> >
> > I'd suggest to move the io_base assignment (and cast) into 
> > virtio_ioport_init()
> > so that we could do the correct cast there, say cast it to uint32_t for
> > X86, and uint64_t for others.
> >
> 
> Ok.
> 
> This was deliberately done considering your 1.0 virtio spec patch do
> care for uint64_t types and in arm64 case, If I plan to use those
> future patches, IMO it make more sense to me keep it in uint64_t way;

I did different cast, 32 bit for legacy virtio pci device, and 64 bit
for modern virtio pci device.

> Also in x86 case max address could of type 0x1000-101f and so forth;
> changing data-type to uint64_t default wont effect such address,
> right?

Right, but what's the harm of doing the right cast? :)

> And hw->io_base by looking at virtio_pci.h function like
> inb/outb etc.. takes io_base address as unsigned long types which is
> arch dependent; i.e.. 4 byte for 32 bit and 8 for 64 bit so the lower
> level rd/wr apis are taking care of data-types accordingly.

Didn't get it. inb/outb takes "unsigned short" arguments, but not
"unsigned long".

--yliu


[dpdk-dev] [ [PATCH v2] 11/13] virtio_ioport: armv7/v8: mmap virtio iopci bar region

2015-12-16 Thread Yuanhan Liu
On Wed, Dec 16, 2015 at 07:50:51PM +0530, Santosh Shukla wrote:
...
> >> + *resource_addr = (void *)((char *)ioport_map + 
> >> (ioport_map_cnt)*offset);
> >
> > Redundant (), and the void * cast seems to be unnecessary.
> >
> 
> (void *) is unnecessary, but couldn't get the redundant() part?

I meant the () of "(ioport_map_cnt)*offset".

> 
> >> + ioport_map_cnt++;
> >> +
> >> + PMD_INIT_LOG(DEBUG, "pci.resource_addr %p ioport_map_cnt %d\n",
> >> + *resource_addr, ioport_map_cnt);
> >> + return ret;
> >> +}
> >> +
> >
> Is it redundant comment or your suggesting to use : r / (void) / __rte_unused?

You should always use __rte_unused instead of (void) cast. Note that you
may need check your other patches, to make sure you not miss other such
usage.

--yliu


[dpdk-dev] [ [PATCH v2] 05/13] virtio: change io_base datatype from uint32_t to uint64_type

2015-12-16 Thread Santosh Shukla
On Wed, Dec 16, 2015 at 7:53 PM, Yuanhan Liu
 wrote:
> On Wed, Dec 16, 2015 at 07:31:57PM +0530, Santosh Shukla wrote:
>> On Wed, Dec 16, 2015 at 7:18 PM, Yuanhan Liu
>>  wrote:
>> > On Mon, Dec 14, 2015 at 06:30:24PM +0530, Santosh Shukla wrote:
>> >> In x86 case io_base to store ioport address not more than 65535 ioports. 
>> >> i.e..0
>> >> to  but in non-x86 case in particular arm64 it need to store more 
>> >> than 32
>> >> bit address so changing io_base datatype from 32 to 64.
>> >>
>> >> Signed-off-by: Santosh Shukla 
>> >> ---
>> >>  drivers/net/virtio/virtio_ethdev.c |2 +-
>> >>  drivers/net/virtio/virtio_pci.h|4 ++--
>> >>  2 files changed, 3 insertions(+), 3 deletions(-)
>> >>
>> >> diff --git a/drivers/net/virtio/virtio_ethdev.c 
>> >> b/drivers/net/virtio/virtio_ethdev.c
>> >> index d928339..620e0d4 100644
>> >> --- a/drivers/net/virtio/virtio_ethdev.c
>> >> +++ b/drivers/net/virtio/virtio_ethdev.c
>> >> @@ -1291,7 +1291,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
>> >>   return -1;
>> >>
>> >>   hw->use_msix = virtio_has_msix(&pci_dev->addr);
>> >> - hw->io_base = (uint32_t)(uintptr_t)pci_dev->mem_resource[0].addr;
>> >> + hw->io_base = (uint64_t)(uintptr_t)pci_dev->mem_resource[0].addr;
>> >
>> > I'd suggest to move the io_base assignment (and cast) into 
>> > virtio_ioport_init()
>> > so that we could do the correct cast there, say cast it to uint32_t for
>> > X86, and uint64_t for others.
>> >
>>
>> Ok.
>>
>> This was deliberately done considering your 1.0 virtio spec patch do
>> care for uint64_t types and in arm64 case, If I plan to use those
>> future patches, IMO it make more sense to me keep it in uint64_t way;
>
> I did different cast, 32 bit for legacy virtio pci device, and 64 bit
> for modern virtio pci device.
>
>> Also in x86 case max address could of type 0x1000-101f and so forth;
>> changing data-type to uint64_t default wont effect such address,
>> right?
>
> Right, but what's the harm of doing the right cast? :)
>

Agree.

>> And hw->io_base by looking at virtio_pci.h function like
>> inb/outb etc.. takes io_base address as unsigned long types which is
>> arch dependent; i.e.. 4 byte for 32 bit and 8 for 64 bit so the lower
>> level rd/wr apis are taking care of data-types accordingly.
>
> Didn't get it. inb/outb takes "unsigned short" arguments, but not
> "unsigned long".
>

sys/io.h in x86 case using unsigned short int  types..

include/asm-generic/io.h for arm64 using it unsigned long (from linux
header files)

In such case keeping
#define VIRTIO_PCI_REG_ADDR(hw, reg) \
(unsigned short)((hw)->io_base + (reg))

would be x86 specific and what I thought and used in this patch is

#define VIRTIO_PCI_REG_ADDR(hw, reg) \
(unsigned long)((hw)->io_base + (reg))

to avoid ifdef ARM or non-x86..clutter, I know data-type is not right
fit for x86 sys/io.h but considering possible address inside
hw->io_base, wont effect functionality and performance my any mean.
That is why at virtio_ethdev_init() i choose to keep it in hw->io_base
= (uint64_t) types.

Otherwise I'll have to duplicate VIRTIO_PCI_REG_XXX definition for
non-x86 case, Pl. suggest better alternative. Thanks




> --yliu


[dpdk-dev] [ [PATCH v2] 11/13] virtio_ioport: armv7/v8: mmap virtio iopci bar region

2015-12-16 Thread Santosh Shukla
On Wed, Dec 16, 2015 at 8:07 PM, Yuanhan Liu
 wrote:
> On Wed, Dec 16, 2015 at 07:50:51PM +0530, Santosh Shukla wrote:
> ...
>> >> + *resource_addr = (void *)((char *)ioport_map + 
>> >> (ioport_map_cnt)*offset);
>> >
>> > Redundant (), and the void * cast seems to be unnecessary.
>> >
>>
>> (void *) is unnecessary, but couldn't get the redundant() part?
>
> I meant the () of "(ioport_map_cnt)*offset".
>

ok.

>>
>> >> + ioport_map_cnt++;
>> >> +
>> >> + PMD_INIT_LOG(DEBUG, "pci.resource_addr %p ioport_map_cnt %d\n",
>> >> + *resource_addr, ioport_map_cnt);
>> >> + return ret;
>> >> +}
>> >> +
>> >
>> Is it redundant comment or your suggesting to use : r / (void) / 
>> __rte_unused?
>
> You should always use __rte_unused instead of (void) cast. Note that you
> may need check your other patches, to make sure you not miss other such
> usage.
>

yup, noted down. Thanks

> --yliu


[dpdk-dev] [ [PATCH v2] 05/13] virtio: change io_base datatype from uint32_t to uint64_type

2015-12-16 Thread Yuanhan Liu
On Wed, Dec 16, 2015 at 08:09:40PM +0530, Santosh Shukla wrote:
> On Wed, Dec 16, 2015 at 7:53 PM, Yuanhan Liu
>  wrote:
> > On Wed, Dec 16, 2015 at 07:31:57PM +0530, Santosh Shukla wrote:
> >> On Wed, Dec 16, 2015 at 7:18 PM, Yuanhan Liu
> >>  wrote:
> >> > On Mon, Dec 14, 2015 at 06:30:24PM +0530, Santosh Shukla wrote:
> >> >> In x86 case io_base to store ioport address not more than 65535 
> >> >> ioports. i.e..0
> >> >> to  but in non-x86 case in particular arm64 it need to store more 
> >> >> than 32
> >> >> bit address so changing io_base datatype from 32 to 64.
> >> >>
> >> >> Signed-off-by: Santosh Shukla 
> >> >> ---
> >> >>  drivers/net/virtio/virtio_ethdev.c |2 +-
> >> >>  drivers/net/virtio/virtio_pci.h|4 ++--
> >> >>  2 files changed, 3 insertions(+), 3 deletions(-)
> >> >>
> >> >> diff --git a/drivers/net/virtio/virtio_ethdev.c 
> >> >> b/drivers/net/virtio/virtio_ethdev.c
> >> >> index d928339..620e0d4 100644
> >> >> --- a/drivers/net/virtio/virtio_ethdev.c
> >> >> +++ b/drivers/net/virtio/virtio_ethdev.c
> >> >> @@ -1291,7 +1291,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
> >> >>   return -1;
> >> >>
> >> >>   hw->use_msix = virtio_has_msix(&pci_dev->addr);
> >> >> - hw->io_base = (uint32_t)(uintptr_t)pci_dev->mem_resource[0].addr;
> >> >> + hw->io_base = (uint64_t)(uintptr_t)pci_dev->mem_resource[0].addr;
> >> >
> >> > I'd suggest to move the io_base assignment (and cast) into 
> >> > virtio_ioport_init()
> >> > so that we could do the correct cast there, say cast it to uint32_t for
> >> > X86, and uint64_t for others.
> >> >
> >>
> >> Ok.
> >>
> >> This was deliberately done considering your 1.0 virtio spec patch do
> >> care for uint64_t types and in arm64 case, If I plan to use those
> >> future patches, IMO it make more sense to me keep it in uint64_t way;
> >
> > I did different cast, 32 bit for legacy virtio pci device, and 64 bit
> > for modern virtio pci device.
> >
> >> Also in x86 case max address could of type 0x1000-101f and so forth;
> >> changing data-type to uint64_t default wont effect such address,
> >> right?
> >
> > Right, but what's the harm of doing the right cast? :)
> >
> 
> Agree.
> 
> >> And hw->io_base by looking at virtio_pci.h function like
> >> inb/outb etc.. takes io_base address as unsigned long types which is
> >> arch dependent; i.e.. 4 byte for 32 bit and 8 for 64 bit so the lower
> >> level rd/wr apis are taking care of data-types accordingly.
> >
> > Didn't get it. inb/outb takes "unsigned short" arguments, but not
> > "unsigned long".
> >
> 
> sys/io.h in x86 case using unsigned short int  types..
> 
> include/asm-generic/io.h for arm64 using it unsigned long (from linux
> header files)
> 
> In such case keeping
> #define VIRTIO_PCI_REG_ADDR(hw, reg) \
> (unsigned short)((hw)->io_base + (reg))
> 
> would be x86 specific and what I thought and used in this patch is
> 
> #define VIRTIO_PCI_REG_ADDR(hw, reg) \
> (unsigned long)((hw)->io_base + (reg))
> 
> to avoid ifdef ARM or non-x86..clutter, I know data-type is not right
> fit for x86 sys/io.h but considering possible address inside
> hw->io_base, wont effect functionality and performance my any mean.
> That is why at virtio_ethdev_init() i choose to keep it in hw->io_base
> = (uint64_t) types.
> 
> Otherwise I'll have to duplicate VIRTIO_PCI_REG_XXX definition for
> non-x86 case, Pl. suggest better alternative. Thanks


My understanding is that if you have done the right cast in the first
time (at the io_base assignment), casting from a short type to a longer
type will not matter: the upper bits will be filled with zero.

So, I guess we are fine here. I'm thinking that the extra cast in
VIRTIO_PCI_REG_ADDR() is not necessary, as C will do the right
cast for different inb(), say cast it to "unsigned short" for x86,
and "unsigned long" for your arm implementation. The same to
other io helpers.

--yliu


[dpdk-dev] [ [PATCH v2] 05/13] virtio: change io_base datatype from uint32_t to uint64_type

2015-12-16 Thread Santosh Shukla
On Wed, Dec 16, 2015 at 8:28 PM, Yuanhan Liu
 wrote:
> On Wed, Dec 16, 2015 at 08:09:40PM +0530, Santosh Shukla wrote:
>> On Wed, Dec 16, 2015 at 7:53 PM, Yuanhan Liu
>>  wrote:
>> > On Wed, Dec 16, 2015 at 07:31:57PM +0530, Santosh Shukla wrote:
>> >> On Wed, Dec 16, 2015 at 7:18 PM, Yuanhan Liu
>> >>  wrote:
>> >> > On Mon, Dec 14, 2015 at 06:30:24PM +0530, Santosh Shukla wrote:
>> >> >> In x86 case io_base to store ioport address not more than 65535 
>> >> >> ioports. i.e..0
>> >> >> to  but in non-x86 case in particular arm64 it need to store more 
>> >> >> than 32
>> >> >> bit address so changing io_base datatype from 32 to 64.
>> >> >>
>> >> >> Signed-off-by: Santosh Shukla 
>> >> >> ---
>> >> >>  drivers/net/virtio/virtio_ethdev.c |2 +-
>> >> >>  drivers/net/virtio/virtio_pci.h|4 ++--
>> >> >>  2 files changed, 3 insertions(+), 3 deletions(-)
>> >> >>
>> >> >> diff --git a/drivers/net/virtio/virtio_ethdev.c 
>> >> >> b/drivers/net/virtio/virtio_ethdev.c
>> >> >> index d928339..620e0d4 100644
>> >> >> --- a/drivers/net/virtio/virtio_ethdev.c
>> >> >> +++ b/drivers/net/virtio/virtio_ethdev.c
>> >> >> @@ -1291,7 +1291,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
>> >> >>   return -1;
>> >> >>
>> >> >>   hw->use_msix = virtio_has_msix(&pci_dev->addr);
>> >> >> - hw->io_base = (uint32_t)(uintptr_t)pci_dev->mem_resource[0].addr;
>> >> >> + hw->io_base = (uint64_t)(uintptr_t)pci_dev->mem_resource[0].addr;
>> >> >
>> >> > I'd suggest to move the io_base assignment (and cast) into 
>> >> > virtio_ioport_init()
>> >> > so that we could do the correct cast there, say cast it to uint32_t for
>> >> > X86, and uint64_t for others.
>> >> >
>> >>
>> >> Ok.
>> >>
>> >> This was deliberately done considering your 1.0 virtio spec patch do
>> >> care for uint64_t types and in arm64 case, If I plan to use those
>> >> future patches, IMO it make more sense to me keep it in uint64_t way;
>> >
>> > I did different cast, 32 bit for legacy virtio pci device, and 64 bit
>> > for modern virtio pci device.
>> >
>> >> Also in x86 case max address could of type 0x1000-101f and so forth;
>> >> changing data-type to uint64_t default wont effect such address,
>> >> right?
>> >
>> > Right, but what's the harm of doing the right cast? :)
>> >
>>
>> Agree.
>>
>> >> And hw->io_base by looking at virtio_pci.h function like
>> >> inb/outb etc.. takes io_base address as unsigned long types which is
>> >> arch dependent; i.e.. 4 byte for 32 bit and 8 for 64 bit so the lower
>> >> level rd/wr apis are taking care of data-types accordingly.
>> >
>> > Didn't get it. inb/outb takes "unsigned short" arguments, but not
>> > "unsigned long".
>> >
>>
>> sys/io.h in x86 case using unsigned short int  types..
>>
>> include/asm-generic/io.h for arm64 using it unsigned long (from linux
>> header files)
>>
>> In such case keeping
>> #define VIRTIO_PCI_REG_ADDR(hw, reg) \
>> (unsigned short)((hw)->io_base + (reg))
>>
>> would be x86 specific and what I thought and used in this patch is
>>
>> #define VIRTIO_PCI_REG_ADDR(hw, reg) \
>> (unsigned long)((hw)->io_base + (reg))
>>
>> to avoid ifdef ARM or non-x86..clutter, I know data-type is not right
>> fit for x86 sys/io.h but considering possible address inside
>> hw->io_base, wont effect functionality and performance my any mean.
>> That is why at virtio_ethdev_init() i choose to keep it in hw->io_base
>> = (uint64_t) types.
>>
>> Otherwise I'll have to duplicate VIRTIO_PCI_REG_XXX definition for
>> non-x86 case, Pl. suggest better alternative. Thanks
>
>
> My understanding is that if you have done the right cast in the first
> time (at the io_base assignment), casting from a short type to a longer
> type will not matter: the upper bits will be filled with zero.
>
> So, I guess we are fine here. I'm thinking that the extra cast in
> VIRTIO_PCI_REG_ADDR() is not necessary, as C will do the right
> cast for different inb(), say cast it to "unsigned short" for x86,
> and "unsigned long" for your arm implementation. The same to
> other io helpers.
>

so to summarize and correct me if i misunderstood,
keep hw->io_base = (uint64_t)
and remove extra cast {i.e.. (unsigned short) for x86 or (unsigned
long) for non-x86/arm64 case} in   VIRTIO_PCI_REG_ADDR().

did I got everything alright?


> --yliu


[dpdk-dev] VFIO no-iommu

2015-12-16 Thread Burakov, Anatoly
Hi Thomas,

 > > On Tue, Dec 15, 2015 at 09:53:18AM -0700, Alex Williamson wrote:
> > So it works. ?Is it acceptable? ?Useful? ?Sufficiently complete? ?Does
> > it imply deprecating the uio interface? ?I believe the feature that
> > started this discussion was support for MSI/X interrupts so that VFs
> > can support some kind of interrupt (uio only supports INTx since it
> > doesn't allow DMA). ?Implementing that would be the ultimate test of
> > whether this provides dpdk with not only a more consistent interface,
> > but the feature dpdk wants that's missing in uio. Thanks,

Ferruh has done a great job so far testing Alex's patch, very few changes from 
DPDK side seem to be required as far as existing functionality goes (not sure 
about VF interrupts mentioned by Alex). However, one thing that concerns me is 
usability. While it is true that no-IOMMU mode in VFIO would mean uio 
interfaces could be deprecated in time, the no-iommu mode is way more hassle 
than using igb_uio/uio_pci_generic because it will require a kernel recompile 
as opposed to simply compiling and insmod'ding an out-of-tree driver. So, in 
essence, if you don't want an IOMMU, it's becoming that much harder to use 
DPDK. Would that be something DPDK is willing to live with in the absence of 
uio interfaces?

Thanks,
Anatoly


[dpdk-dev] VFIO no-iommu

2015-12-16 Thread Alex Williamson
On Wed, 2015-12-16 at 08:35 +, Burakov, Anatoly wrote:
> Hi Alex,
> 
> > On Wed, 2015-12-16 at 04:04 +, Ferruh Yigit wrote:
> > > On Tue, Dec 15, 2015 at 09:53:18AM -0700, Alex Williamson wrote:
> > > I tested the DPDK (HEAD of master) with the patch, with help of
> > > Anatoly, and DPDK works in no-iommu environment with a little
> > > modification.
> > > 
> > > Basically the only modification is adapt new group naming
> > > (noiommu-$)
> > > and
> > 
> > Sorry, forgot to mention that one. ?The intention with the modified
> > group
> > name is that I want to be very certain that a user intending to
> > only support
> > properly iommu isolated devices doesn't accidentally need to deal
> > with these
> > no-iommu mode devices.
> > 
> > > disable dma mapping (VFIO_IOMMU_MAP_DMA)
> > > 
> > > Also I need to disable VFIO_CHECK_EXTENSION ioctl, because in
> > > vfio
> > > module,
> > > container->noiommu is not set before doing a
> > > vfio_group_set_container()
> > > and vfio_for_each_iommu_driver selects wrong driver.
> > 
> > Running CHECK_EXTENSION on a container without the group attached
> > is
> > only going to tell you what extensions vfio is capable of, not
> > necessarily what
> > extensions are available to you with that group. ?Is this just a
> > general dpdk-
> > vfio ordering bug?
> 
> Yes, that is how VFIO was implemented in DPDK. I was under the
> impression that checking extension before assigning devices was the
> correct way to do things, so as to not to try anything we know would
> fail anyway. Does this imply that CHECK_EXTENSION needs to be called
> on both container and groups (or just on groups)?

Hmm, in Documentation/vfio.txt we do give the following algorithm:

if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION)
/* Unknown API version */

if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
/* Doesn't support the IOMMU driver we want. */
...

That's just going to query each iommu driver and we can't yet say
whether the group the user attaches to the container later will
actually support that extension until we try to do it, that would come
at VFIO_SET_IOMMU. ?So is it perhaps a vfio bug that we're not
advertising no-iommu until the group is attached? ?After all, we are
capable of it with just an empty container, just like we are with
type1, but we're going to fail SET_IOMMU for the wrong combination.
?This is exactly the sort of thing that makes me glad we reverted it
without feedback from a working user driver. ?Thanks,

Alex


[dpdk-dev] VFIO no-iommu

2015-12-16 Thread Burakov, Anatoly
Hi Alex,

> On Wed, 2015-12-16 at 08:35 +, Burakov, Anatoly wrote:
> > Hi Alex,
> >
> > > On Wed, 2015-12-16 at 04:04 +, Ferruh Yigit wrote:
> > > > On Tue, Dec 15, 2015 at 09:53:18AM -0700, Alex Williamson wrote:
> > > > I tested the DPDK (HEAD of master) with the patch, with help of
> > > > Anatoly, and DPDK works in no-iommu environment with a little
> > > > modification.
> > > >
> > > > Basically the only modification is adapt new group naming
> > > > (noiommu-$)
> > > > and
> > >
> > > Sorry, forgot to mention that one. ?The intention with the modified
> > > group name is that I want to be very certain that a user intending
> > > to only support properly iommu isolated devices doesn't accidentally
> > > need to deal with these no-iommu mode devices.
> > >
> > > > disable dma mapping (VFIO_IOMMU_MAP_DMA)
> > > >
> > > > Also I need to disable VFIO_CHECK_EXTENSION ioctl, because in vfio
> > > > module,
> > > > container->noiommu is not set before doing a
> > > > vfio_group_set_container()
> > > > and vfio_for_each_iommu_driver selects wrong driver.
> > >
> > > Running CHECK_EXTENSION on a container without the group attached is
> > > only going to tell you what extensions vfio is capable of, not
> > > necessarily what extensions are available to you with that group.
> > > Is this just a general dpdk- vfio ordering bug?
> >
> > Yes, that is how VFIO was implemented in DPDK. I was under the
> > impression that checking extension before assigning devices was the
> > correct way to do things, so as to not to try anything we know would
> > fail anyway. Does this imply that CHECK_EXTENSION needs to be called
> > on both container and groups (or just on groups)?
> 
> Hmm, in Documentation/vfio.txt we do give the following algorithm:
> 
> if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION)
> /* Unknown API version */
> 
> if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
> /* Doesn't support the IOMMU driver we want. */
> ...
> 
> That's just going to query each iommu driver and we can't yet say whether
> the group the user attaches to the container later will actually support that
> extension until we try to do it, that would come at VFIO_SET_IOMMU. ?So is
> it perhaps a vfio bug that we're not advertising no-iommu until the group is
> attached? ?After all, we are capable of it with just an empty container, just
> like we are with type1, but we're going to fail SET_IOMMU for the wrong
> combination.
> ?This is exactly the sort of thing that makes me glad we reverted it without
> feedback from a working user driver. ?Thanks,

Whether it should be considered a "bug" in VFIO or "by design" is up to you, of 
course, but at least according to the VFIO documentation, we are meant to check 
for type 1 extension and then attach devices, so it would be expected to get 
VFIO_NOIOMMU_IOMMU marked as supported even without any devices attached to the 
container (just like we get type 1 as supported without any devices attached). 
Having said that, if it was meant to attach devices first and then check the 
extensions, then perhaps the documentation should also point out that fact (or 
perhaps I missed that detail in my readings of the docs, in which case my 
apologies).

Thanks,
Anatoly


[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Matthew Hall
On Wed, Dec 16, 2015 at 11:56:11AM +, Bruce Richardson wrote:
> Having this work with any application is one of our primary targets here. 
> The app author should not have to worry too much about getting basic debug 
> support. Even if it doesn't work at 40G small packet rates, you can get a 
> lot of benefit from a scheme that provides functional debugging for an app. 

I think my issue is that I don't think I buy into this particular set of 
assumptions above.

I don't think a capture mechanism that doesn't work right in the real use 
cases of the apps actually buys us much. If all we care about is quickly 
dumping some frames to a pcap for occasional debugging, I already have some C 
code for that I can donate which is a lot less complicated than the trouble 
being proposed for "basic debug support". Or we could use libpcap's 
equivalent... but it's quite a lot more complicated than the code I have.

If we're going to assign engineers to this it's costing somebody a lot of time 
and money. So I'd prefer to get them focused on something that will always 
work even with high loads, such as real bpfjit support.

Matthew.


[dpdk-dev] [PATCH v6 0/5] ethdev: add speed capabilities and refactor link API

2015-12-16 Thread Marc Sune
2015-10-25 22:59 GMT+01:00 Marc Sune :

> The current rte_eth_dev_info abstraction does not provide any mechanism to
> get the supported speed(s) of an ethdev.
>
> For some drivers (e.g. ixgbe), an educated guess could be done based on the
> driver's name (driver_name in rte_eth_dev_info), see:
>
> http://dpdk.org/ml/archives/dev/2013-August/000412.html
>
> However, i) doing string comparisons is annoying, and can silently
> break existing applications if PMDs change their names ii) it does not
> provide all the supported capabilities of the ethdev iii) for some drivers
> it
> is impossible determine correctly the (max) speed by the application
> (e.g. in i40, distinguish between XL710 and X710).
>
> In addition, the link APIs do not allow to define a set of advertised link
> speeds for autonegociation.
>
> This series of patches adds the following capabilities:
>
> * speed_capa bitmap in rte_eth_dev_info, which is filled by the PMDs
>   according to the physical device capabilities.
> * refactors link API in ethdev to allow the definition of the advertised
>   link speeds, fix speed (no auto-negociation) or advertise all supported
>   speeds (default).
>
> WARNING: this patch series, specifically 3/4, is NOT tested for most of the
> PMDs, due to the lack of hardware. Only generic EM is tested (VM).
> Minor bugs expected.
>

I will respin this patch to current HEAD targeting 2.3, but note that
testing of PMDs other than i40 and e1000 (82540Em) is necessary for this
patch to be merged.

I do not have all the HW to test it, so I would like to ask for some help
here. Some (more) peer reviews would also help.

Regards
marc


>
> * * * * *
>
> v2: rebase, converted speed_capa into 32 bits bitmap, fixed alignment
> (checkpatch).
>
> v3: rebase to v2.1. unified ETH_LINK_SPEED and ETH_SPEED_CAP into
> ETH_SPEED.
> Converted field speed in struct rte_eth_conf to speed, to allow a
> bitmap
> for defining the announced speeds, as suggested M. Brorup. Fixed
> spelling
> issues.
>
> v4: fixed errata in the documentation of field speeds of rte_eth_conf, and
> commit 1/2 message. rebased to v2.1.0. v3 was incorrectly based on
> ~2.1.0-rc1.
>
> v5: revert to v2 speed capabilities patch. Fixed MLX4 speed capabilities
> (thanks N. Laranjeiro). Refactored link speed API to allow setting
> advertised speeds (3/4). Added NO_AUTONEG option to explicitely disable
> auto-negociation. Updated 2.2 rel. notes (4/4). Rebased to current
> HEAD.
>
> v6: Move link_duplex to be part of bitfield. Fixed i40 autoneg flag link
> update code. Added rte_eth_speed_to_bm_flag() to .map file. Fixed other
> spelling issues. Rebased to current HEAD.
>
> Marc Sune (5):
>   ethdev: Added ETH_SPEED_CAP bitmap for ports
>   ethdev: Fill speed capability bitmaps in the PMDs
>   ethdev: redesign link speed config API
>   doc: update with link changes
>   ethdev: add rte_eth_speed_to_bm_flag() to ver. map
>
>  app/test-pmd/cmdline.c | 124
> +++--
>  app/test/virtual_pmd.c |   4 +-
>  doc/guides/rel_notes/release_2_2.rst   |  23 ++
>  drivers/net/af_packet/rte_eth_af_packet.c  |   5 +-
>  drivers/net/bonding/rte_eth_bond_8023ad.c  |  14 ++--
>  drivers/net/cxgbe/base/t4_hw.c |   8 +-
>  drivers/net/e1000/base/e1000_80003es2lan.c |   6 +-
>  drivers/net/e1000/base/e1000_82541.c   |   8 +-
>  drivers/net/e1000/base/e1000_82543.c   |   4 +-
>  drivers/net/e1000/base/e1000_82575.c   |  11 +--
>  drivers/net/e1000/base/e1000_api.c |   2 +-
>  drivers/net/e1000/base/e1000_api.h |   2 +-
>  drivers/net/e1000/base/e1000_defines.h |   4 +-
>  drivers/net/e1000/base/e1000_hw.h  |   2 +-
>  drivers/net/e1000/base/e1000_ich8lan.c |   4 +-
>  drivers/net/e1000/base/e1000_mac.c |   9 ++-
>  drivers/net/e1000/base/e1000_mac.h |   6 +-
>  drivers/net/e1000/base/e1000_vf.c  |   4 +-
>  drivers/net/e1000/base/e1000_vf.h  |   2 +-
>  drivers/net/e1000/em_ethdev.c  | 109 -
>  drivers/net/e1000/igb_ethdev.c | 104 +---
>  drivers/net/fm10k/fm10k_ethdev.c   |   5 +-
>  drivers/net/i40e/i40e_ethdev.c |  78 ++
>  drivers/net/i40e/i40e_ethdev_vf.c  |  11 +--
>  drivers/net/ixgbe/ixgbe_ethdev.c   |  74 -
>  drivers/net/mlx4/mlx4.c|   6 ++
>  drivers/net/mpipe/mpipe_tilegx.c   |   6 +-
>  drivers/net/null/rte_eth_null.c|   5 +-
>  drivers/net/pcap/rte_eth_pcap.c|   9 ++-
>  drivers/net/ring/rte_eth_ring.c|   5 +-
>  drivers/net/virtio/virtio_ethdev.c |   2 +-
>  drivers/net/virtio/virtio_ethdev.h |   2 -
>  drivers/net/vmxnet3/vmxnet3_ethdev.c   |   5 +-
>  drivers/net/xenvirt/rte_eth_xenvirt.c  |   5 +-
>  examples/ip_pipeline/config_parse.c|   3 +-
>  lib/li

[dpdk-dev] [PATCH v6 0/5] ethdev: add speed capabilities and refactor link API

2015-12-16 Thread Olga Shern
We will test on Mellanox NICs

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Marc Sune
Sent: Wednesday, December 16, 2015 10:38 PM
To: dev at dpdk.org
Subject: Re: [dpdk-dev] [PATCH v6 0/5] ethdev: add speed capabilities and 
refactor link API

2015-10-25 22:59 GMT+01:00 Marc Sune :

> The current rte_eth_dev_info abstraction does not provide any 
> mechanism to get the supported speed(s) of an ethdev.
>
> For some drivers (e.g. ixgbe), an educated guess could be done based 
> on the driver's name (driver_name in rte_eth_dev_info), see:
>
> http://dpdk.org/ml/archives/dev/2013-August/000412.html
>
> However, i) doing string comparisons is annoying, and can silently 
> break existing applications if PMDs change their names ii) it does not 
> provide all the supported capabilities of the ethdev iii) for some 
> drivers it is impossible determine correctly the (max) speed by the 
> application (e.g. in i40, distinguish between XL710 and X710).
>
> In addition, the link APIs do not allow to define a set of advertised 
> link speeds for autonegociation.
>
> This series of patches adds the following capabilities:
>
> * speed_capa bitmap in rte_eth_dev_info, which is filled by the PMDs
>   according to the physical device capabilities.
> * refactors link API in ethdev to allow the definition of the advertised
>   link speeds, fix speed (no auto-negociation) or advertise all supported
>   speeds (default).
>
> WARNING: this patch series, specifically 3/4, is NOT tested for most 
> of the PMDs, due to the lack of hardware. Only generic EM is tested (VM).
> Minor bugs expected.
>

I will respin this patch to current HEAD targeting 2.3, but note that testing 
of PMDs other than i40 and e1000 (82540Em) is necessary for this patch to be 
merged.

I do not have all the HW to test it, so I would like to ask for some help here. 
Some (more) peer reviews would also help.

Regards
marc


>
> * * * * *
>
> v2: rebase, converted speed_capa into 32 bits bitmap, fixed alignment 
> (checkpatch).
>
> v3: rebase to v2.1. unified ETH_LINK_SPEED and ETH_SPEED_CAP into 
> ETH_SPEED.
> Converted field speed in struct rte_eth_conf to speed, to allow a 
> bitmap
> for defining the announced speeds, as suggested M. Brorup. Fixed 
> spelling
> issues.
>
> v4: fixed errata in the documentation of field speeds of rte_eth_conf, and
> commit 1/2 message. rebased to v2.1.0. v3 was incorrectly based on
> ~2.1.0-rc1.
>
> v5: revert to v2 speed capabilities patch. Fixed MLX4 speed capabilities
> (thanks N. Laranjeiro). Refactored link speed API to allow setting
> advertised speeds (3/4). Added NO_AUTONEG option to explicitely disable
> auto-negociation. Updated 2.2 rel. notes (4/4). Rebased to current 
> HEAD.
>
> v6: Move link_duplex to be part of bitfield. Fixed i40 autoneg flag link
> update code. Added rte_eth_speed_to_bm_flag() to .map file. Fixed other
> spelling issues. Rebased to current HEAD.
>
> Marc Sune (5):
>   ethdev: Added ETH_SPEED_CAP bitmap for ports
>   ethdev: Fill speed capability bitmaps in the PMDs
>   ethdev: redesign link speed config API
>   doc: update with link changes
>   ethdev: add rte_eth_speed_to_bm_flag() to ver. map
>
>  app/test-pmd/cmdline.c | 124
> +++--
>  app/test/virtual_pmd.c |   4 +-
>  doc/guides/rel_notes/release_2_2.rst   |  23 ++
>  drivers/net/af_packet/rte_eth_af_packet.c  |   5 +-
>  drivers/net/bonding/rte_eth_bond_8023ad.c  |  14 ++--
>  drivers/net/cxgbe/base/t4_hw.c |   8 +-
>  drivers/net/e1000/base/e1000_80003es2lan.c |   6 +-
>  drivers/net/e1000/base/e1000_82541.c   |   8 +-
>  drivers/net/e1000/base/e1000_82543.c   |   4 +-
>  drivers/net/e1000/base/e1000_82575.c   |  11 +--
>  drivers/net/e1000/base/e1000_api.c |   2 +-
>  drivers/net/e1000/base/e1000_api.h |   2 +-
>  drivers/net/e1000/base/e1000_defines.h |   4 +-
>  drivers/net/e1000/base/e1000_hw.h  |   2 +-
>  drivers/net/e1000/base/e1000_ich8lan.c |   4 +-
>  drivers/net/e1000/base/e1000_mac.c |   9 ++-
>  drivers/net/e1000/base/e1000_mac.h |   6 +-
>  drivers/net/e1000/base/e1000_vf.c  |   4 +-
>  drivers/net/e1000/base/e1000_vf.h  |   2 +-
>  drivers/net/e1000/em_ethdev.c  | 109 -
>  drivers/net/e1000/igb_ethdev.c | 104 +---
>  drivers/net/fm10k/fm10k_ethdev.c   |   5 +-
>  drivers/net/i40e/i40e_ethdev.c |  78 ++
>  drivers/net/i40e/i40e_ethdev_vf.c  |  11 +--
>  drivers/net/ixgbe/ixgbe_ethdev.c   |  74 -
>  drivers/net/mlx4/mlx4.c|   6 ++
>  drivers/net/mpipe/mpipe_tilegx.c   |   6 +-
>  drivers/net/null/rte_eth_null.c|   5 +-
>  drivers/net/pcap/rte_eth_pcap.c|   9 ++-
>  drivers/net/ring/rte_eth_ring.c   

[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Morten Brørup
Bruce,

Matthew presented a very important point a few hours ago: We don't need tcpdump 
support for debugging the application in a lab; we already have plenty of other 
tools for debugging what we are developing. We need tcpdump support for 
debugging network issues in a production network.

In my "hardened network appliance" world, a solution designed purely for legacy 
applications (tcpdump, Wireshark etc.) is useless because the network 
technician doesn't have access to these applications on the appliance.

While a PC system running a DPDK based application might have plenty of spare 
lcores for filtering, the SmartShare appliances are already using all lcores 
for dedicated purposes, so the runtime filtering has to be done by the IO 
lcores (otherwise we would have to rehash everything and reallocate some lcores 
for mirroring, which I strongly oppose). Our non-DPDK firmware has also always 
been filtering directly in the fast path.

If the filter is so complex that it unexpectedly degrades the normal traffic 
forwarding performance, the mirror still reflects all the forwarded network 
traffic, not just some of it. In many real life network debugging scenarios 
this is better than the alternative: keeping the traffic forwarding up at full 
performance and having a network technician trying to understand a mirror 
output where some of the relevant packets are unexpectedly missing.

Although it is generally considered bad design if a system's behavior (or 
performance) changes unexpectedly when debugging features are being used, 
experienced network technicians have already grown accustomed to the 
performance of most non-trivial network equipment depending on the number of 
features enabled and how it is configured, so reality might beat theory here. 
(Still, other companies might prefer to keep their fast path performance 
unaffected and dedicate/reallocate some lcores for filtering.)

I am probably repeating myself here, but I would prefer if the DPDK provided 
the packet capturing framework in the form of a set of efficient libraries for 
1. BPF filtering (e.g. a simple BPF interpreter or a DPDK variant of bpfjit), 
2. scalable packet queueing for the mirrored packets (probably multi producer, 
single or multi consumer), as well as 3. high resolution time stamping 
(preferably easily convertible to the pcap file packet timestamp format). Then 
the DPDK application can take care of interfacing to the attached application 
and outputting the mirrored packets to the appropriate destination, e.g. a pcap 
file, a Wireshark excap named pipe, a dedicated RSPAN VLAN, or an ERSPAN 
tunnel. And an example application should show how to bind all this together in 
a tcpdump-like scenario for debugging a production network.

A note about timestamps: In theory, the captured packets should be time stamped 
as early as possible. In practice though, it is probably sufficiently accurate 
to time stamp the accepted packets after filtering, especially if they are 
filtered by an IO lcore. Alternatively, they can be time stamped when consumed 
from the mirror output queue.

A note about packet ordering: Mirrored packets belonging to different flows are 
probably out of order because of RSS, where multiple lcores contribute to the 
mirror output. This packet ordering inaccuracy could also serve as a reason for 
not being too strict about the accuracy of the timestamps on the mirrored 
packets.


Med venlig hilsen / kind regards
- Morten Br?rup


-Original Message-
From: Bruce Richardson [mailto:bruce.richard...@intel.com] 
Sent: 16. december 2015 14:13
To: Morten Br?rup
Cc: Matthew Hall; Kyle Larose; dev at dpdk.org
Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3

On Wed, Dec 16, 2015 at 01:26:11PM +0100, Morten Br?rup wrote:
> Bruce,
> 
> Please note that tcpdump is a stupid name for a packet capture application 
> that supports much more than just TCP.
> 
> I had missed the point about ethdev supporting virtual interfaces, so thank 
> you for pointing that out. That covers my concerns about capturing packets 
> inside tunnels.
> 
> I will gladly admit that you Intel guys are probably much more competent in 
> the field of DPDK performance and scalability than I am. So Matthew and I 
> have been asking you to kindly ensure that your solution scales well at very 
> high packet rates too, and pointing out that filtering before copying is 
> probably cheaper than copying before filtering. You mention that it leads to 
> an important choice about which lcores get to do the work of filtering the 
> packets, so that might be worth some discussion.
> 
> :-)
> 
> Med venlig hilsen / kind regards
> - Morten Br?rup
> 

Thanks for your support.

We may look at having a certain amount of flexibility in the configuration of 
the setup, so as to avoid limiting the use of the functionality.

For scalability at very high packet rates, it's something we'll need you guys 
to give us pointers on too - what's acceptable or not

[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Matthew Hall
On Wed, Dec 16, 2015 at 11:45:46PM +0100, Morten Br?rup wrote:
> Matthew presented a very important point a few hours ago: We don't need 
> tcpdump support for debugging the application in a lab; we already have 
> plenty of other tools for debugging what we are developing. We need tcpdump 
> support for debugging network issues in a production network.

+1

> In my "hardened network appliance" world, a solution designed purely for 
> legacy applications (tcpdump, Wireshark etc.) is useless because the network 
> technician doesn't have access to these applications on the appliance.

Maybe that's true on one exact system. But I've used a whole ton of systems 
including appliances where this was not true. I really do want to find a way 
to support them, but according to my recent discussions w/ Alex Nasonov who 
made bpfjit, I don't think it is possible without really tearing apart 
libpcap. So for now the only good hope is Wireshark's Extcap support.

> While a PC system running a DPDK based application might have plenty of 
> spare lcores for filtering, the SmartShare appliances are already using all 
> lcores for dedicated purposes, so the runtime filtering has to be done by 
> the IO lcores (otherwise we would have to rehash everything and reallocate 
> some lcores for mirroring, which I strongly oppose). Our non-DPDK firmware 
> has also always been filtering directly in the fast path.

The shared process stuff and weird leftover lcore stuff seems way too complex 
for me whether or not there are any spare lcores. To me it seems easier if I 
just call some function and hand it mbufs, and it would quickly check them 
against a linked list of active filters if filters are present, or do nothing 
and return if no filter is active.

> If the filter is so complex that it unexpectedly degrades the normal traffic 
> forwarding performance

If bpfjit is used, I think it is very hard to affect the performance much. 
Unless you do something incredibly crazy.

> Although it is generally considered bad design if a system's behavior (or 
> performance) changes unexpectedly when debugging features are being used, 

I think we can keep the behavior change quite small using something like what 
I described.

> Other companies might prefer to keep their fast path performance unaffected 
> and dedicate/reallocate some lcores for filtering.

It always starts out unaffected... then goes back to accepting a bit of 
slowness when people are forced to re-learn how bad it is with no debugging. I 
have seen it again and again in many companies. Hence my proposal for 
efficient lightweight debugging support from the beginning.

> 1. BPF filtering (... a DPDK variant of bpfjit),

+1

> 2. scalable packet queueing for the mirrored packets (probably multi 
> producer, single or multi consumer)

I hate queueing. Queueing always reduces max possible throughput because 
queueing is inefficient. It is better just to put them where they need to go 
immediately (run to completion) while the mbufs are already prefetched.

> Then the DPDK application can take care of interfacing to 
> the attached application and outputting the mirrored packets to the 
> appropriate destination

Too complicated. Pcap and extcap should be working by default.

> A note about packet ordering: Mirrored packets belonging to different flows 
> are probably out of order because of RSS, where multiple lcores contribute 
> to the mirror output.

Where I worry is weird configurations where a flow can occur in >1 cores. But 
I think most users try not to do this.


[dpdk-dev] [PATCH] log: add missing symbol

2015-12-16 Thread Stephen Hemminger
rte_get_log_type and rte_get_log_level functions has been avaliable
for many versions. But they are missing from the shared library map
and therefore do not get exported correctly.

Signed-off-by: Stephen Hemminger 
---
 lib/librte_eal/linuxapp/eal/rte_eal_version.map | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/librte_eal/linuxapp/eal/rte_eal_version.map 
b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
index cbe175f..51a241c 100644
--- a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
+++ b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
@@ -93,7 +93,9 @@ DPDK_2.0 {
rte_realloc;
rte_set_application_usage_hook;
rte_set_log_level;
+   rte_get_log_level;
rte_set_log_type;
+   rte_get_log_type;
rte_socket_id;
rte_strerror;
rte_strsplit;
-- 
2.1.4



[dpdk-dev] [PATCH] Unlink existing unused sockets at start up

2015-12-16 Thread Zhihong Wang
This patch unlinks existing unused sockets (which cause new bindings to fail, 
e.g. vHost PMD) to ensure smooth startup.
In a lot of cases DPDK applications are terminated abnormally without proper 
resource release. Therefore, DPDK libs should be able to deal with unclean boot 
environment.

Signed-off-by: Zhihong Wang 
---
 lib/librte_vhost/vhost_user/vhost-net-user.c | 28 
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 8b7a448..eac0721 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -120,18 +120,38 @@ uds_socket(const char *path)
sockfd = socket(AF_UNIX, SOCK_STREAM, 0);
if (sockfd < 0)
return -1;
-   RTE_LOG(INFO, VHOST_CONFIG, "socket created, fd:%d\n", sockfd);
+   RTE_LOG(INFO, VHOST_CONFIG, "socket created, fd: %d\n", sockfd);

memset(&un, 0, sizeof(un));
un.sun_family = AF_UNIX;
snprintf(un.sun_path, sizeof(un.sun_path), "%s", path);
ret = bind(sockfd, (struct sockaddr *)&un, sizeof(un));
if (ret == -1) {
-   RTE_LOG(ERR, VHOST_CONFIG, "fail to bind fd:%d, remove file:%s 
and try again.\n",
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "bind fd: %d to file: %s failed, checking socket...\n",
sockfd, path);
-   goto err;
+   ret = connect(sockfd, (struct sockaddr *)&un, sizeof(un));
+   if (ret == -1) {
+   RTE_LOG(INFO, VHOST_CONFIG,
+   "socket: %s is inactive, rebinding after 
unlink...\n", path);
+   unlink(path);
+   ret = bind(sockfd, (struct sockaddr *)&un, sizeof(un));
+   if (ret == -1) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "bind fd: %d to file: %s failed even 
after unlink\n",
+   sockfd, path);
+   goto err;
+   }
+   } else {
+   RTE_LOG(INFO, VHOST_CONFIG,
+   "socket: %s is alive, remove it and try 
again\n", path);
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "bind fd: %d to file: %s failed\n", sockfd, 
path);
+   goto err;
+   }
}
-   RTE_LOG(INFO, VHOST_CONFIG, "bind to %s\n", path);
+   RTE_LOG(INFO, VHOST_CONFIG,
+   "bind fd: %d to file: %s successful\n", sockfd, path);

ret = listen(sockfd, MAX_VIRTIO_BACKLOG);
if (ret == -1)
-- 
2.5.0