Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.

2021-03-30 Thread Stefan Hajnoczi
On Thu, Mar 25, 2021 at 06:58:56PM +0100, Ilya Maximets wrote:
> On 3/25/21 5:43 PM, Stefan Hajnoczi wrote:
> > On Thu, Mar 25, 2021 at 12:00:11PM +0100, Ilya Maximets wrote:
> >> On 3/25/21 10:35 AM, Stefan Hajnoczi wrote:
> >>> On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
> >>>> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> >>>>> On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> >>>>>> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets  
> >>>>>> wrote:
> >>>>>>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >>>>>>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >>>>>>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>>>>>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>>>>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>>>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>> - How to get this fd again after the OVS restart?  CNI will not be 
> >>>> invoked
> >>>>   at this point to pass a new fd.
> >>>>
> >>>> - If application will close the connection for any reason (restart, some
> >>>>   reconfiguration internal to the application) and OVS will be re-started
> >>>>   at the same time, abstract socket will be gone.  Need a persistent 
> >>>> daemon
> >>>>   to hold it.
> >>>
> >>> I remembered that these two points can be solved by sd_notify(3)
> >>> FDSTORE=1. This requires that OVS runs as a systemd service. Not sure if
> >>> this is the case (at least in the CNI use case)?
> >>>
> >>> https://www.freedesktop.org/software/systemd/man/sd_notify.html
> >>
> >> IIUC, these file descriptors only passed on the restart of the service,
> >> so port-del + port-add scenario is not covered (and this is a very
> >> common usecase, users are implementing some configuration changes this
> >> way and also this is internally possible scenario, e.g. this sequence
> >> will be triggered internally to change the OpenFlow port number).
> >> port-del will release all the resources including the listening socket.
> >> Keeping the fd for later use is not an option, because OVS will not know
> >> if this port will be added back or not and fds is a limited resource.
> > 
> > If users of the CNI plugin are reasonably expected to do this then it
> > sounds like a blocker for the sd_notify(3) approach. Maybe it could be
> > fixed by introducing an atomic port-rename (?) operation, but this is
> > starting to sound too invasive.
> 
> It's hard to implement, actually.  Things like 'port-rename' will
> be internally implemented as del+add in most cases.  Otherwise, it
> will require a significant rework of OVS internals.
> There are things that could be adjusted on the fly, but some
> fundamental parts like OF port number that every other part depends
> on are not easy to change.

I see. In that case the sd_notify(3) approach won't work.

> >> OVS could run as a system pod or as a systemd service.  It differs from
> >> one setup to another.  So it might not be controlled by systemd.
> > 
> > Does the CNI plugin allow both configurations?
> 
> CNI runs as a DaemonSet (pod on each node) by itself, and it doesn't
> matter if OVS is running on the host or in a different pod.

Okay.

> > 
> > It's impossible to come up with one approach that works for everyone in
> > the general case (beyond the CNI plugin, beyond Kubernetes).
> 
> If we're looking for a solution to store abstract sockets somehow
> for OVS then it's hard to came up with something generic.  It will
> have dependency on specific init system anyway.
> 
> OTOH, Broker solution will work for all cases. :)  One may think
> of a broker as a service that supplies abstract sockets for processes
> from different namespaces.  These sockets are already connected, for
> convenience.

I'm not sure what we're trying to come up with :). I haven't figured out
how much of what has been discussed is cosmetic and nice-to-have stuff
versus what is a real problem that needs a solution.

>From the vhost-user point of view I would prefer to stick to the
existing UNIX domain socket approach. Any additional mechanism adds
extra complexity, won't be supported by all software, requires educating
users and developers, requires building new vhost-user application
container images, etc. IMO it's only worth doing if there is a real
problem with UNIX domain sockets that cannot be solved without
introducing a new connection mechanism.

Stefan


Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.

2021-03-18 Thread Stefan Hajnoczi
On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
Hi,
Some questions to understand the problems that SocketPair Broker solves:

> Even more configuration tricks required in order to share some sockets
> between different containers and not only with the host, e.g. to
> create service chains.

How does SocketPair Broker solve this? I guess the idea is that
SocketPair Broker must be started before other containers. That way
applications don't need to sleep and reconnect when a socket isn't
available yet.

On the other hand, the SocketPair Broker might be unavailable (OOM
killer, crash, etc), so applications still need to sleep and reconnect
to the broker itself. I'm not sure the problem has actually been solved
unless there is a reason why the broker is always guaranteed to be
available?

> And some housekeeping usually required for applications in case the
> socket server terminated abnormally and socket files left on a file
> system:
>  "failed to bind to vhu: Address already in use; remove it and try again"

QEMU avoids this by unlinking before binding. The drawback is that users
might accidentally hijack an existing listen socket, but that can be
solved with a pidfile.

> Additionally, all applications (system and user's!) should follow
> naming conventions and place socket files in particular location on a
> file system to make things work.

Does SocketPair Broker solve this? Applications now need to use a naming
convention for keys, so it seems like this issue has not been
eliminated.

> This patch-set aims to eliminate most of the inconveniences by
> leveraging an infrastructure service provided by a SocketPair Broker.

I don't understand yet why this is useful for vhost-user, where the
creation of the vhost-user device backend and its use by a VMM are
closely managed by one piece of software:

1. Unlink the socket path.
2. Create, bind, and listen on the socket path.
3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
   RPC, spawn a process, etc) and pass in the listen fd.
4. In the meantime the VMM can open the socket path and call connect(2).
   As soon as the vhost-user device backend calls accept(2) the
   connection will proceed (there is no need for sleeping).

This approach works across containers without a broker.

BTW what is the security model of the broker? Unlike pathname UNIX
domain sockets there is no ownership permission check.

Stefan


Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.

2021-03-19 Thread Stefan Hajnoczi
On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> > On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >> And some housekeeping usually required for applications in case the
> >> socket server terminated abnormally and socket files left on a file
> >> system:
> >>  "failed to bind to vhu: Address already in use; remove it and try again"
> > 
> > QEMU avoids this by unlinking before binding. The drawback is that users
> > might accidentally hijack an existing listen socket, but that can be
> > solved with a pidfile.
> 
> How exactly this could be solved with a pidfile?

A pidfile prevents two instances of the same service from running at the
same time.

The same effect can be achieved by the container orchestrator, systemd,
etc too because it refuses to run the same service twice.

> And what if this is
> a different application that tries to create a socket on a same path?
> e.g. QEMU creates a socket (started in a server mode) and user
> accidentally created dpdkvhostuser port in Open vSwitch instead of
> dpdkvhostuserclient.  This way rte_vhost library will try to bind
> to an existing socket file and will fail.  Subsequently port creation
> in OVS will fail.   We can't allow OVS to unlink files because this
> way OVS users will have ability to unlink random sockets that OVS has
> access to and we also has no idea if it's a QEMU that created a file
> or it was a virtio-user application or someone else.

If rte_vhost unlinks the socket then the user will find that networking
doesn't work. They can either hot unplug the QEMU vhost-user-net device
or restart QEMU, depending on whether they need to keep the guest
running or not. This is a misconfiguration that is recoverable.

Regarding letting OVS unlink files, I agree that it shouldn't if this
create a security issue. I don't know the security model of OVS.

> There are, probably, ways to detect if there is any alive process that
> has this socket open, but that sounds like too much for this purpose,
> also I'm not sure if it's possible if actual user is in a different
> container.
> So I don't see a good reliable way to detect these conditions.  This
> falls on shoulders of a higher level management software or a user to
> clean these socket files up before adding ports.

Does OVS always run in the same net namespace (pod) as the DPDK
application? If yes, then abstract AF_UNIX sockets can be used. Abstract
AF_UNIX sockets don't have a filesystem path and the socket address
disappears when there is no process listening anymore.

> >> This patch-set aims to eliminate most of the inconveniences by
> >> leveraging an infrastructure service provided by a SocketPair Broker.
> > 
> > I don't understand yet why this is useful for vhost-user, where the
> > creation of the vhost-user device backend and its use by a VMM are
> > closely managed by one piece of software:
> > 
> > 1. Unlink the socket path.
> > 2. Create, bind, and listen on the socket path.
> > 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
> >RPC, spawn a process, etc) and pass in the listen fd.
> > 4. In the meantime the VMM can open the socket path and call connect(2).
> >As soon as the vhost-user device backend calls accept(2) the
> >connection will proceed (there is no need for sleeping).
> > 
> > This approach works across containers without a broker.
> 
> Not sure if I fully understood a question here, but anyway.
>
> This approach works fine if you know what application to run.
> In case of a k8s cluster, it might be a random DPDK application
> with virtio-user ports running inside a container and want to
> have a network connection.  Also, this application needs to run
> virtio-user in server mode, otherwise restart of the OVS will
> require restart of the application.  So, you basically need to
> rely on a third-party application to create a socket with a right
> name and in a correct location that is shared with a host, so
> OVS can find it and connect.
> 
> In a VM world everything is much more simple, since you have
> a libvirt and QEMU that will take care of all of these stuff
> and which are also under full control of management software
> and a system administrator.
> In case of a container with a "random" DPDK application inside
> there is no such entity that can help.  Of course, some solution
> might be implemented in docker/podman daemon to create and manage
> outside-looking sockets for an application inside the container,
> but that is not available today AFAIK and I'm not sure if it
> ever will.

Wait, whe

Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.

2021-03-19 Thread Stefan Hajnoczi
On Thu, Mar 18, 2021 at 09:14:27PM +0100, Ilya Maximets wrote:
> On 3/18/21 8:47 PM, Ilya Maximets wrote:
> > On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >> BTW what is the security model of the broker? Unlike pathname UNIX
> >> domain sockets there is no ownership permission check.
> > 
> > I thought about this.  Yes, we should allow connection to this socket
> > for a wide group of applications.  That might be a problem.
> > However, 2 applications need to know the 1024 (at most) byte key in
> > order to connect to each other.  This might be considered as a
> > sufficient security model in case these keys are not predictable.
> > Suggestions on how to make this more secure are welcome.
> 
> Digging more into unix sockets, I think that broker might use
> SO_PEERCRED to identify at least a uid and gid of a client.
> This way we can implement policies, e.g. one client might
> request to pair it only with clients from the same group or
> from the same user.
> 
> This is actually a great extension for the SocketPair Broker Protocol.
> 
> Might even use SO_PEERSEC to enforce even stricter policies
> based on selinux context.

Some piece of software or an administrator would need to understand the
pid/uid/gid mappings used by specific containers in order to configure
security policies in the broker like "app1 is allowed to connect to
app2's socket". This is probably harder than it looks (and DBus already
has everything to do this and more).

Stefan


Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.

2021-03-19 Thread Stefan Hajnoczi
Hi Ilya,
By the way, it's not clear to me why dpdkvhostuser is deprecated. If OVS
is restarted then existing vhost-user connections drop with an error but
QEMU could attempt to reconnect to the UNIX domain socket which the new
OVS instance will set up.

Why is it impossible to reconnect when OVS owns the listen socket?

Stefan


Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.

2021-03-19 Thread Stefan Hajnoczi
On Fri, Mar 19, 2021 at 04:37:01PM +0100, Ilya Maximets wrote:
> On 3/19/21 3:16 PM, Stefan Hajnoczi wrote:
> > On Thu, Mar 18, 2021 at 09:14:27PM +0100, Ilya Maximets wrote:
> >> On 3/18/21 8:47 PM, Ilya Maximets wrote:
> >>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>> BTW what is the security model of the broker? Unlike pathname UNIX
> >>>> domain sockets there is no ownership permission check.
> >>>
> >>> I thought about this.  Yes, we should allow connection to this socket
> >>> for a wide group of applications.  That might be a problem.
> >>> However, 2 applications need to know the 1024 (at most) byte key in
> >>> order to connect to each other.  This might be considered as a
> >>> sufficient security model in case these keys are not predictable.
> >>> Suggestions on how to make this more secure are welcome.
> >>
> >> Digging more into unix sockets, I think that broker might use
> >> SO_PEERCRED to identify at least a uid and gid of a client.
> >> This way we can implement policies, e.g. one client might
> >> request to pair it only with clients from the same group or
> >> from the same user.
> >>
> >> This is actually a great extension for the SocketPair Broker Protocol.
> >>
> >> Might even use SO_PEERSEC to enforce even stricter policies
> >> based on selinux context.
> > 
> > Some piece of software or an administrator would need to understand the
> > pid/uid/gid mappings used by specific containers in order to configure
> > security policies in the broker like "app1 is allowed to connect to
> > app2's socket". This is probably harder than it looks (and DBus already
> > has everything to do this and more).
> 
> AFAIU, neither of orchestration solutions configures different access
> rights for sockets right now.  So, it, probably, should not be a big
> problem for current setups.
>
> I'd expect pid/uid/gid being mapped to host namespace if SO_PEERCRED
> requested from it.  Interesting thing to check, though.
> 
> For DBus, as I mentioned in the other reply, IIUC, it will require
> mounting /run/user/** or its bits and some other stuff to the
> container in order to make it work.  Also it will, probably, require
> running containers in privileged mode which will wipe out most of the
> security.

Flatpak has sandboxed D-Bus for it application containers:
https://docs.flatpak.org/en/latest/sandbox-permissions.html

"Limited access to the session D-Bus instance - an app can only own its
own name on the bus."

I don't know about how it works.

Stefan


Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.

2021-03-19 Thread Stefan Hajnoczi
On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> > On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>> And some housekeeping usually required for applications in case the
> >>>> socket server terminated abnormally and socket files left on a file
> >>>> system:
> >>>>  "failed to bind to vhu: Address already in use; remove it and try again"
> >>>
> >>> QEMU avoids this by unlinking before binding. The drawback is that users
> >>> might accidentally hijack an existing listen socket, but that can be
> >>> solved with a pidfile.
> >>
> >> How exactly this could be solved with a pidfile?
> > 
> > A pidfile prevents two instances of the same service from running at the
> > same time.
> > 
> > The same effect can be achieved by the container orchestrator, systemd,
> > etc too because it refuses to run the same service twice.
> 
> Sure. I understand that.  My point was that these could be 2 different
> applications and they might not know which process to look for.
> 
> > 
> >> And what if this is
> >> a different application that tries to create a socket on a same path?
> >> e.g. QEMU creates a socket (started in a server mode) and user
> >> accidentally created dpdkvhostuser port in Open vSwitch instead of
> >> dpdkvhostuserclient.  This way rte_vhost library will try to bind
> >> to an existing socket file and will fail.  Subsequently port creation
> >> in OVS will fail.   We can't allow OVS to unlink files because this
> >> way OVS users will have ability to unlink random sockets that OVS has
> >> access to and we also has no idea if it's a QEMU that created a file
> >> or it was a virtio-user application or someone else.
> > 
> > If rte_vhost unlinks the socket then the user will find that networking
> > doesn't work. They can either hot unplug the QEMU vhost-user-net device
> > or restart QEMU, depending on whether they need to keep the guest
> > running or not. This is a misconfiguration that is recoverable.
> 
> True, it's recoverable, but with a high cost.  Restart of a VM is rarely
> desirable.  And the application inside the guest might not feel itself
> well after hot re-plug of a device that it actively used.  I'd expect
> a DPDK application that runs inside a guest on some virtio-net device
> to crash after this kind of manipulations.  Especially, if it uses some
> older versions of DPDK.

This unlink issue is probably something we think differently about.
There are many ways for users to misconfigure things when working with
system tools. If it's possible to catch misconfigurations that is
preferrable. In this case it's just the way pathname AF_UNIX domain
sockets work and IMO it's better not to have problems starting the
service due to stale files than to insist on preventing
misconfigurations. QEMU and DPDK do this differently and both seem to be
successful, so ¯\_(ツ)_/¯.

> > 
> > Regarding letting OVS unlink files, I agree that it shouldn't if this
> > create a security issue. I don't know the security model of OVS.
> 
> In general privileges of a ovs-vswitchd daemon might be completely
> different from privileges required to invoke control utilities or
> to access the configuration database.  SO, yes, we should not allow
> that.

That can be locked down by restricting the socket path to a file beneath
/var/run/ovs/vhost-user/.

> > 
> >> There are, probably, ways to detect if there is any alive process that
> >> has this socket open, but that sounds like too much for this purpose,
> >> also I'm not sure if it's possible if actual user is in a different
> >> container.
> >> So I don't see a good reliable way to detect these conditions.  This
> >> falls on shoulders of a higher level management software or a user to
> >> clean these socket files up before adding ports.
> > 
> > Does OVS always run in the same net namespace (pod) as the DPDK
> > application? If yes, then abstract AF_UNIX sockets can be used. Abstract
> > AF_UNIX sockets don't have a filesystem path and the socket address
> > disappears when there is no process listening anymore.
> 
> OVS is usually started right on the host in a main network namespace.
> In case it's started in a pod, it will run in a separate container but
> configured with a host network.  

Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.

2021-03-24 Thread Stefan Hajnoczi
On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets  wrote:
> 
> > On 3/23/21 6:57 PM, Adrian Moreno wrote:
> > >
> > >
> > > On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> > >> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> > >>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> > >>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> > >>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> > >>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> > >>>>>>> And some housekeeping usually required for applications in case the
> > >>>>>>> socket server terminated abnormally and socket files left on a file
> > >>>>>>> system:
> > >>>>>>>  "failed to bind to vhu: Address already in use; remove it and try
> > again"
> > >>>>>>
> > >>>>>> QEMU avoids this by unlinking before binding. The drawback is that
> > users
> > >>>>>> might accidentally hijack an existing listen socket, but that can be
> > >>>>>> solved with a pidfile.
> > >>>>>
> > >>>>> How exactly this could be solved with a pidfile?
> > >>>>
> > >>>> A pidfile prevents two instances of the same service from running at
> > the
> > >>>> same time.
> > >>>>
> > >>>> The same effect can be achieved by the container orchestrator,
> > systemd,
> > >>>> etc too because it refuses to run the same service twice.
> > >>>
> > >>> Sure. I understand that.  My point was that these could be 2 different
> > >>> applications and they might not know which process to look for.
> > >>>
> > >>>>
> > >>>>> And what if this is
> > >>>>> a different application that tries to create a socket on a same path?
> > >>>>> e.g. QEMU creates a socket (started in a server mode) and user
> > >>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
> > >>>>> dpdkvhostuserclient.  This way rte_vhost library will try to bind
> > >>>>> to an existing socket file and will fail.  Subsequently port creation
> > >>>>> in OVS will fail.   We can't allow OVS to unlink files because this
> > >>>>> way OVS users will have ability to unlink random sockets that OVS has
> > >>>>> access to and we also has no idea if it's a QEMU that created a file
> > >>>>> or it was a virtio-user application or someone else.
> > >>>>
> > >>>> If rte_vhost unlinks the socket then the user will find that
> > networking
> > >>>> doesn't work. They can either hot unplug the QEMU vhost-user-net
> > device
> > >>>> or restart QEMU, depending on whether they need to keep the guest
> > >>>> running or not. This is a misconfiguration that is recoverable.
> > >>>
> > >>> True, it's recoverable, but with a high cost.  Restart of a VM is
> > rarely
> > >>> desirable.  And the application inside the guest might not feel itself
> > >>> well after hot re-plug of a device that it actively used.  I'd expect
> > >>> a DPDK application that runs inside a guest on some virtio-net device
> > >>> to crash after this kind of manipulations.  Especially, if it uses some
> > >>> older versions of DPDK.
> > >>
> > >> This unlink issue is probably something we think differently about.
> > >> There are many ways for users to misconfigure things when working with
> > >> system tools. If it's possible to catch misconfigurations that is
> > >> preferrable. In this case it's just the way pathname AF_UNIX domain
> > >> sockets work and IMO it's better not to have problems starting the
> > >> service due to stale files than to insist on preventing
> > >> misconfigurations. QEMU and DPDK do this differently and both seem to be
> > >> successful, so ¯\_(ツ)_/¯.
> > >>
> > >>>>
> > >>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
> > >>>> create a security issue. I don't know the security model o

Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.

2021-03-24 Thread Stefan Hajnoczi
On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> > On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> >> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets  wrote:
> >>
> >>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >>>>
> >>>>
> >>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>>>>>>>> And some housekeeping usually required for applications in case the
> >>>>>>>>>> socket server terminated abnormally and socket files left on a file
> >>>>>>>>>> system:
> >>>>>>>>>>  "failed to bind to vhu: Address already in use; remove it and try
> >>> again"
> >>>>>>>>>
> >>>>>>>>> QEMU avoids this by unlinking before binding. The drawback is that
> >>> users
> >>>>>>>>> might accidentally hijack an existing listen socket, but that can be
> >>>>>>>>> solved with a pidfile.
> >>>>>>>>
> >>>>>>>> How exactly this could be solved with a pidfile?
> >>>>>>>
> >>>>>>> A pidfile prevents two instances of the same service from running at
> >>> the
> >>>>>>> same time.
> >>>>>>>
> >>>>>>> The same effect can be achieved by the container orchestrator,
> >>> systemd,
> >>>>>>> etc too because it refuses to run the same service twice.
> >>>>>>
> >>>>>> Sure. I understand that.  My point was that these could be 2 different
> >>>>>> applications and they might not know which process to look for.
> >>>>>>
> >>>>>>>
> >>>>>>>> And what if this is
> >>>>>>>> a different application that tries to create a socket on a same path?
> >>>>>>>> e.g. QEMU creates a socket (started in a server mode) and user
> >>>>>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
> >>>>>>>> dpdkvhostuserclient.  This way rte_vhost library will try to bind
> >>>>>>>> to an existing socket file and will fail.  Subsequently port creation
> >>>>>>>> in OVS will fail.   We can't allow OVS to unlink files because this
> >>>>>>>> way OVS users will have ability to unlink random sockets that OVS has
> >>>>>>>> access to and we also has no idea if it's a QEMU that created a file
> >>>>>>>> or it was a virtio-user application or someone else.
> >>>>>>>
> >>>>>>> If rte_vhost unlinks the socket then the user will find that
> >>> networking
> >>>>>>> doesn't work. They can either hot unplug the QEMU vhost-user-net
> >>> device
> >>>>>>> or restart QEMU, depending on whether they need to keep the guest
> >>>>>>> running or not. This is a misconfiguration that is recoverable.
> >>>>>>
> >>>>>> True, it's recoverable, but with a high cost.  Restart of a VM is
> >>> rarely
> >>>>>> desirable.  And the application inside the guest might not feel itself
> >>>>>> well after hot re-plug of a device that it actively used.  I'd expect
> >>>>>> a DPDK application that runs inside a guest on some virtio-net device
> >>>>>> to crash after this kind of manipulations.  Especially, if it uses some
> >>>>>> older versions of DPDK.
> >>>>>
> >>>>> This unlink issue is probably something we think differently about.
> >>>>> There are many ways for users to misconfigure things when working with
> >>>>> system tools. If it's possible to catch misconfigurations that is
> >>>>> preferrable. In this c

Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.

2021-03-25 Thread Stefan Hajnoczi
On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> > On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> >> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets  wrote:
> >>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> - How to get this fd again after the OVS restart?  CNI will not be invoked
>   at this point to pass a new fd.
> 
> - If application will close the connection for any reason (restart, some
>   reconfiguration internal to the application) and OVS will be re-started
>   at the same time, abstract socket will be gone.  Need a persistent daemon
>   to hold it.

I remembered that these two points can be solved by sd_notify(3)
FDSTORE=1. This requires that OVS runs as a systemd service. Not sure if
this is the case (at least in the CNI use case)?

https://www.freedesktop.org/software/systemd/man/sd_notify.html

Stefan


Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.

2021-03-25 Thread Stefan Hajnoczi
On Thu, Mar 25, 2021 at 12:00:11PM +0100, Ilya Maximets wrote:
> On 3/25/21 10:35 AM, Stefan Hajnoczi wrote:
> > On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
> >> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> >>> On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> >>>> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets  wrote:
> >>>>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >>>>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >>>>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>>>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >> - How to get this fd again after the OVS restart?  CNI will not be invoked
> >>   at this point to pass a new fd.
> >>
> >> - If application will close the connection for any reason (restart, some
> >>   reconfiguration internal to the application) and OVS will be re-started
> >>   at the same time, abstract socket will be gone.  Need a persistent daemon
> >>   to hold it.
> > 
> > I remembered that these two points can be solved by sd_notify(3)
> > FDSTORE=1. This requires that OVS runs as a systemd service. Not sure if
> > this is the case (at least in the CNI use case)?
> > 
> > https://www.freedesktop.org/software/systemd/man/sd_notify.html
> 
> IIUC, these file descriptors only passed on the restart of the service,
> so port-del + port-add scenario is not covered (and this is a very
> common usecase, users are implementing some configuration changes this
> way and also this is internally possible scenario, e.g. this sequence
> will be triggered internally to change the OpenFlow port number).
> port-del will release all the resources including the listening socket.
> Keeping the fd for later use is not an option, because OVS will not know
> if this port will be added back or not and fds is a limited resource.

If users of the CNI plugin are reasonably expected to do this then it
sounds like a blocker for the sd_notify(3) approach. Maybe it could be
fixed by introducing an atomic port-rename (?) operation, but this is
starting to sound too invasive.

> It's also unclear how to map these file descriptors to particular ports
> they belong to after restart.

The first fd would be a memfd containing a description of the remaining
fds plus any other crash recovery state that OVS wants.

> OVS could run as a system pod or as a systemd service.  It differs from
> one setup to another.  So it might not be controlled by systemd.

Does the CNI plugin allow both configurations?

It's impossible to come up with one approach that works for everyone in
the general case (beyond the CNI plugin, beyond Kubernetes). I think we
need to enumerate use cases and decide which ones are currently not
addressed satisfactorily.

> Also, it behaves as an old-style daemon, so it closes all the file
> descriptors, forkes and so on.  This might be adjusted, though, with
> some rework of the deamonization procedure.

Doesn't sound like fun but may be doable.

> On the side note, it maybe interesting to allow user application to
> create a socket and pass a pollable file descriptor directly to
> rte_vhost_driver_register() instead of a socket path.  This way
> the user application may choose to use an abstract socket or a file
> socket or any other future type of socket connections.  This will
> also allow user application to store these sockets somewhere, or
> receive them from systemd/init/other management software.

Yes, sounds useful.

Stefan


[dpdk-dev] [RFC 00/24] vhost: add virtio-vhost-user transport

2018-01-19 Thread Stefan Hajnoczi
 tested again:

  b9d17bfaf WORKAROUND revert virtio-net mq vring deletion
  cadb25e7d WORKAROUND examples/vhost_scsi: avoid broken EVENT_IDX

Stefan Hajnoczi (24):
  vhost: move vring_call() into trans_af_unix.c
  vhost: move AF_UNIX code from socket.c to trans_af_unix.c
  vhost: allocate per-socket transport state
  vhost: move socket_fd and un sockaddr into trans_af_unix.c
  vhost: move start_server/client() calls to trans_af_unix.c
  vhost: move vhost_user_connection to trans_af_unix.c
  vhost: move vhost_user_reconnect_init() into trans_af_unix.c
  vhost: move vhost_user.fdset to trans_af_unix.c
  vhost: pass vhost_transport_ops through vhost_new_device()
  vhost: embed struct virtio_net inside struct vhost_user_connection
  vhost: extract vhost_user.c socket I/O into transport
  vhost: move slave_req_fd field to AF_UNIX transport
  vhost: move mmap/munmap to AF_UNIX transport
  vhost: move librte_vhost to drivers/
  vhost: add virtio pci framework
  vhost: remember a vhost_virtqueue's queue index
  vhost: add virtio-vhost-user transport
  vhost: add RTE_VHOST_USER_VIRTIO_TRANSPORT flag
  net/vhost: add virtio-vhost-user support
  examples/vhost_scsi: add --socket-file argument
  examples/vhost_scsi: add virtio-vhost-user support
  usertools: add virtio-vhost-user devices to dpdk-devbind.py
  WORKAROUND revert virtio-net mq vring deletion
  WORKAROUND examples/vhost_scsi: avoid broken EVENT_IDX

 drivers/Makefile   |2 +
 {lib => drivers}/librte_vhost/Makefile |5 +-
 lib/Makefile   |3 -
 {lib => drivers}/librte_vhost/fd_man.h |0
 {lib => drivers}/librte_vhost/iotlb.h  |0
 {lib => drivers}/librte_vhost/rte_vhost.h  |1 +
 {lib => drivers}/librte_vhost/vhost.h  |  193 +++-
 {lib => drivers}/librte_vhost/vhost_user.h |9 +-
 drivers/librte_vhost/virtio_pci.h  |  267 +
 drivers/librte_vhost/virtio_vhost_user.h   |   18 +
 drivers/librte_vhost/virtqueue.h   |  181 
 {lib => drivers}/librte_vhost/fd_man.c |0
 {lib => drivers}/librte_vhost/iotlb.c  |0
 drivers/librte_vhost/socket.c  |  282 ++
 drivers/librte_vhost/trans_af_unix.c   |  795 +++
 drivers/librte_vhost/trans_virtio_vhost_user.c | 1050 
 {lib => drivers}/librte_vhost/vhost.c  |   18 +-
 {lib => drivers}/librte_vhost/vhost_user.c |  200 +---
 {lib => drivers}/librte_vhost/virtio_net.c |0
 drivers/librte_vhost/virtio_pci.c  |  504 ++
 drivers/net/vhost/rte_eth_vhost.c  |   13 +
 examples/vhost_scsi/vhost_scsi.c   |  104 +-
 lib/librte_vhost/socket.c  |  828 ---
 .../librte_vhost/rte_vhost_version.map |0
 usertools/dpdk-devbind.py  |8 +
 25 files changed, 3455 insertions(+), 1026 deletions(-)
 rename {lib => drivers}/librte_vhost/Makefile (86%)
 rename {lib => drivers}/librte_vhost/fd_man.h (100%)
 rename {lib => drivers}/librte_vhost/iotlb.h (100%)
 rename {lib => drivers}/librte_vhost/rte_vhost.h (99%)
 rename {lib => drivers}/librte_vhost/vhost.h (68%)
 rename {lib => drivers}/librte_vhost/vhost_user.h (93%)
 create mode 100644 drivers/librte_vhost/virtio_pci.h
 create mode 100644 drivers/librte_vhost/virtio_vhost_user.h
 create mode 100644 drivers/librte_vhost/virtqueue.h
 rename {lib => drivers}/librte_vhost/fd_man.c (100%)
 rename {lib => drivers}/librte_vhost/iotlb.c (100%)
 create mode 100644 drivers/librte_vhost/socket.c
 create mode 100644 drivers/librte_vhost/trans_af_unix.c
 create mode 100644 drivers/librte_vhost/trans_virtio_vhost_user.c
 rename {lib => drivers}/librte_vhost/vhost.c (97%)
 rename {lib => drivers}/librte_vhost/vhost_user.c (89%)
 rename {lib => drivers}/librte_vhost/virtio_net.c (100%)
 create mode 100644 drivers/librte_vhost/virtio_pci.c
 delete mode 100644 lib/librte_vhost/socket.c
 rename {lib => drivers}/librte_vhost/rte_vhost_version.map (100%)

-- 
2.14.3



[dpdk-dev] [RFC 01/24] vhost: move vring_call() into trans_af_unix.c

2018-01-19 Thread Stefan Hajnoczi
File descriptor passing is specific to the AF_UNIX vhost-user protocol
transport.  In order to add support for additional transports it is
necessary to extract transport-specific code from the main vhost-user
code.

This patch introduces struct vhost_transport_ops and associates each
device with a transport.  Core vhost-user code calls into
vhost_transport_ops to perform transport-specific operations.

Notifying callfd is a transport-specific operation so it belongs in
trans_af_unix.c.

Several more patches follow this one to complete the task of moving
AF_UNIX transport code out of core vhost-user code.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/Makefile|  2 +-
 lib/librte_vhost/vhost.h | 35 +++-
 lib/librte_vhost/trans_af_unix.c | 49 
 lib/librte_vhost/vhost.c |  1 +
 4 files changed, 80 insertions(+), 7 deletions(-)
 create mode 100644 lib/librte_vhost/trans_af_unix.c

diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
index 065d5c469..ccbbce3af 100644
--- a/lib/librte_vhost/Makefile
+++ b/lib/librte_vhost/Makefile
@@ -21,7 +21,7 @@ LDLIBS += -lrte_eal -lrte_mempool -lrte_mbuf -lrte_ethdev 
-lrte_net
 
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
-   vhost_user.c virtio_net.c
+   vhost_user.c virtio_net.c 
trans_af_unix.c
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index b2bf0e833..53811a8b1 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -200,6 +200,30 @@ struct guest_page {
uint64_t size;
 };
 
+struct virtio_net;
+
+/**
+ * A structure containing function pointers for transport-specific operations.
+ */
+struct vhost_transport_ops {
+   /**
+* Notify the guest that used descriptors have been added to the vring.
+* The VRING_AVAIL_F_NO_INTERRUPT flag has already been checked so this
+* function just needs to perform the notification.
+*
+* @param dev
+*  vhost device
+* @param vq
+*  vhost virtqueue
+* @return
+*  0 on success, -1 on failure
+*/
+   int (*vring_call)(struct virtio_net *dev, struct vhost_virtqueue *vq);
+};
+
+/** The traditional AF_UNIX vhost-user protocol transport. */
+extern const struct vhost_transport_ops af_unix_trans_ops;
+
 /**
  * Device structure contains all configuration information relating
  * to the device.
@@ -226,6 +250,7 @@ struct virtio_net {
uint16_tmtu;
 
struct vhost_device_ops const *notify_ops;
+   struct vhost_transport_ops const *trans_ops;
 
uint32_tnr_guest_pages;
uint32_tmax_guest_pages;
@@ -405,16 +430,14 @@ vhost_vring_call(struct virtio_net *dev, struct 
vhost_virtqueue *vq)
__func__,
vhost_used_event(vq),
old, new);
-   if (vhost_need_event(vhost_used_event(vq), new, old)
-   && (vq->callfd >= 0)) {
+   if (vhost_need_event(vhost_used_event(vq), new, old)) {
vq->signalled_used = vq->last_used_idx;
-   eventfd_write(vq->callfd, (eventfd_t) 1);
+   dev->trans_ops->vring_call(dev, vq);
}
} else {
/* Kick the guest if necessary. */
-   if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
-   && (vq->callfd >= 0))
-   eventfd_write(vq->callfd, (eventfd_t)1);
+   if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
+   dev->trans_ops->vring_call(dev, vq);
}
 }
 
diff --git a/lib/librte_vhost/trans_af_unix.c b/lib/librte_vhost/trans_af_unix.c
new file mode 100644
index 0..9ed04b7eb
--- /dev/null
+++ b/lib/librte_vhost/trans_af_unix.c
@@ -0,0 +1,49 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Copyright (C) 2017 Red Hat, Inc.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in
+ *   the documentation and/or other materials provided with the
+ *   distribution.
+ * * Neither the name of Intel Corporation nor t

[dpdk-dev] [RFC 02/24] vhost: move AF_UNIX code from socket.c to trans_af_unix.c

2018-01-19 Thread Stefan Hajnoczi
The socket.c file serves two purposes:
1. librte_vhost public API entry points, e.g. rte_vhost_driver_register().
2. AF_UNIX socket management.

Move AF_UNIX socket code into trans_af_unix.c so that socket.c only
handles the librte_vhost public API entry points.  This will make it
possible to support other transports besides AF_UNIX.

This patch is a preparatory step that simply moves code from socket.c to
trans_af_unix.c unmodified, besides dropping 'static' qualifiers where
necessary because socket.c now calls into trans_af_unix.c.

A lot of socket.c state is exposed in vhost.h but this is a temporary
measure and will be cleaned up in later patches.  By simply moving code
unmodified in this patch it will be easier to review the actual
refactoring that follows.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost.h |  65 +
 lib/librte_vhost/socket.c| 501 +--
 lib/librte_vhost/trans_af_unix.c | 451 +++
 3 files changed, 517 insertions(+), 500 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 53811a8b1..8c6d6e524 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -5,6 +5,7 @@
 #ifndef _VHOST_NET_CDEV_H_
 #define _VHOST_NET_CDEV_H_
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -12,12 +13,15 @@
 #include 
 #include 
 #include 
+#include  /* TODO remove when trans_af_unix.c refactoring is done */
 #include 
+#include 
 
 #include 
 #include 
 #include 
 
+#include "fd_man.h"
 #include "rte_vhost.h"
 
 /* Used to indicate that the device is running on a data core */
@@ -259,6 +263,67 @@ struct virtio_net {
int slave_req_fd;
 } __rte_cache_aligned;
 
+/* The vhost_user, vhost_user_socket, vhost_user_connection, and reconnect
+ * declarations are temporary measures for moving AF_UNIX code into
+ * trans_af_unix.c.  They will be cleaned up as socket.c is untangled from
+ * trans_af_unix.c.
+ */
+TAILQ_HEAD(vhost_user_connection_list, vhost_user_connection);
+
+/*
+ * Every time rte_vhost_driver_register() is invoked, an associated
+ * vhost_user_socket struct will be created.
+ */
+struct vhost_user_socket {
+   struct vhost_user_connection_list conn_list;
+   pthread_mutex_t conn_mutex;
+   char *path;
+   int socket_fd;
+   struct sockaddr_un un;
+   bool is_server;
+   bool reconnect;
+   bool dequeue_zero_copy;
+   bool iommu_support;
+
+   /*
+* The "supported_features" indicates the feature bits the
+* vhost driver supports. The "features" indicates the feature
+* bits after the rte_vhost_driver_features_disable/enable().
+* It is also the final feature bits used for vhost-user
+* features negotiation.
+*/
+   uint64_t supported_features;
+   uint64_t features;
+
+   struct vhost_device_ops const *notify_ops;
+};
+
+struct vhost_user_connection {
+   struct vhost_user_socket *vsocket;
+   int connfd;
+   int vid;
+
+   TAILQ_ENTRY(vhost_user_connection) next;
+};
+
+#define MAX_VHOST_SOCKET 1024
+struct vhost_user {
+   struct vhost_user_socket *vsockets[MAX_VHOST_SOCKET];
+   struct fdset fdset;
+   int vsocket_cnt;
+   pthread_mutex_t mutex;
+};
+
+extern struct vhost_user vhost_user;
+
+int create_unix_socket(struct vhost_user_socket *vsocket);
+int vhost_user_start_server(struct vhost_user_socket *vsocket);
+int vhost_user_start_client(struct vhost_user_socket *vsocket);
+
+extern pthread_t reconn_tid;
+
+int vhost_user_reconnect_init(void);
+bool vhost_user_remove_reconnect(struct vhost_user_socket *vsocket);
 
 #define VHOST_LOG_PAGE 4096
 
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 6e3857e7a..d681f9cae 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -4,17 +4,14 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
-#include 
 #include 
 
 #include 
@@ -23,61 +20,7 @@
 #include "vhost.h"
 #include "vhost_user.h"
 
-
-TAILQ_HEAD(vhost_user_connection_list, vhost_user_connection);
-
-/*
- * Every time rte_vhost_driver_register() is invoked, an associated
- * vhost_user_socket struct will be created.
- */
-struct vhost_user_socket {
-   struct vhost_user_connection_list conn_list;
-   pthread_mutex_t conn_mutex;
-   char *path;
-   int socket_fd;
-   struct sockaddr_un un;
-   bool is_server;
-   bool reconnect;
-   bool dequeue_zero_copy;
-   bool iommu_support;
-
-   /*
-* The "supported_features" indicates the feature bits the
-* vhost driver supports. The "features" indicates the feature
-* bits after the rte_vhost_driver_features_disable/enable().
-* It is also the final feat

[dpdk-dev] [RFC 03/24] vhost: allocate per-socket transport state

2018-01-19 Thread Stefan Hajnoczi
vhost-user transports have per-socket state (like file descriptors).
Make it possible for transports to keep state beyond what is included in
struct vhost_user_socket.

This patch makes it possible to move AF_UNIX-specific fields from struct
vhost_user_socket into trans_af_unix.c in later patches.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost.h | 9 +
 lib/librte_vhost/socket.c| 6 --
 lib/librte_vhost/trans_af_unix.c | 5 +
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 8c6d6e524..e5279a572 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -210,6 +210,9 @@ struct virtio_net;
  * A structure containing function pointers for transport-specific operations.
  */
 struct vhost_transport_ops {
+   /** Size of struct vhost_user_socket-derived per-socket state */
+   size_t socket_size;
+
/**
 * Notify the guest that used descriptors have been added to the vring.
 * The VRING_AVAIL_F_NO_INTERRUPT flag has already been checked so this
@@ -273,6 +276,11 @@ TAILQ_HEAD(vhost_user_connection_list, 
vhost_user_connection);
 /*
  * Every time rte_vhost_driver_register() is invoked, an associated
  * vhost_user_socket struct will be created.
+ *
+ * Transport-specific per-socket state can be kept by embedding this struct at
+ * the beginning of a transport-specific struct.  Set
+ * vhost_transport_ops->socket_size to the size of the transport-specific
+ * struct.
  */
 struct vhost_user_socket {
struct vhost_user_connection_list conn_list;
@@ -296,6 +304,7 @@ struct vhost_user_socket {
uint64_t features;
 
struct vhost_device_ops const *notify_ops;
+   struct vhost_transport_ops const *trans_ops;
 };
 
 struct vhost_user_connection {
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index d681f9cae..fc663 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -128,6 +128,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
 {
int ret = -1;
struct vhost_user_socket *vsocket;
+   const struct vhost_transport_ops *trans_ops = &af_unix_trans_ops;
 
if (!path)
return -1;
@@ -140,10 +141,11 @@ rte_vhost_driver_register(const char *path, uint64_t 
flags)
goto out;
}
 
-   vsocket = malloc(sizeof(struct vhost_user_socket));
+   vsocket = malloc(trans_ops->socket_size);
if (!vsocket)
goto out;
-   memset(vsocket, 0, sizeof(struct vhost_user_socket));
+   memset(vsocket, 0, trans_ops->socket_size);
+   vsocket->trans_ops = trans_ops;
vsocket->path = strdup(path);
if (vsocket->path == NULL) {
RTE_LOG(ERR, VHOST_CONFIG,
diff --git a/lib/librte_vhost/trans_af_unix.c b/lib/librte_vhost/trans_af_unix.c
index 636f69916..5e3c5ab2a 100644
--- a/lib/librte_vhost/trans_af_unix.c
+++ b/lib/librte_vhost/trans_af_unix.c
@@ -42,6 +42,10 @@
 
 #define MAX_VIRTIO_BACKLOG 128
 
+struct af_unix_socket {
+   struct vhost_user_socket socket; /* must be the first field! */
+};
+
 static void vhost_user_read_cb(int connfd, void *dat, int *remove);
 
 /* return bytes# of read on success or negative val on failure. */
@@ -496,5 +500,6 @@ af_unix_vring_call(struct virtio_net *dev __rte_unused,
 }
 
 const struct vhost_transport_ops af_unix_trans_ops = {
+   .socket_size = sizeof(struct af_unix_socket),
.vring_call = af_unix_vring_call,
 };
-- 
2.14.3



[dpdk-dev] [RFC 05/24] vhost: move start_server/client() calls to trans_af_unix.c

2018-01-19 Thread Stefan Hajnoczi
Introduce a vhost_transport_ops->socket_start() interface so the
transport can begin establishing vhost-user connections.  This is part
of the AF_UNIX transport refactoring and removes AF_UNIX code from
vhost.h and socket.c.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost.h | 16 +---
 lib/librte_vhost/socket.c|  5 +
 lib/librte_vhost/trans_af_unix.c | 16 ++--
 3 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 3aefe6597..7cbef04ab 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -236,6 +236,19 @@ struct vhost_transport_ops {
 */
void (*socket_cleanup)(struct vhost_user_socket *vsocket);
 
+   /**
+* Start establishing vhost-user connections.  This function is
+* asynchronous and connections may be established after it has
+* returned.  Call vhost_user_add_connection() to register new
+* connections.
+*
+* @param vsocket
+*  vhost socket
+* @return
+*  0 on success, -1 on failure
+*/
+   int (*socket_start)(struct vhost_user_socket *vsocket);
+
/**
 * Notify the guest that used descriptors have been added to the vring.
 * The VRING_AVAIL_F_NO_INTERRUPT flag has already been checked so this
@@ -346,9 +359,6 @@ struct vhost_user {
 
 extern struct vhost_user vhost_user;
 
-int vhost_user_start_server(struct vhost_user_socket *vsocket);
-int vhost_user_start_client(struct vhost_user_socket *vsocket);
-
 extern pthread_t reconn_tid;
 
 int vhost_user_reconnect_init(void);
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 78f847ccc..f8a96ab5f 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -318,8 +318,5 @@ rte_vhost_driver_start(const char *path)
"failed to create fdset handling thread");
}
 
-   if (vsocket->is_server)
-   return vhost_user_start_server(vsocket);
-   else
-   return vhost_user_start_client(vsocket);
+   return vsocket->trans_ops->socket_start(vsocket);
 }
diff --git a/lib/librte_vhost/trans_af_unix.c b/lib/librte_vhost/trans_af_unix.c
index cc8d7ccdc..6c22093a4 100644
--- a/lib/librte_vhost/trans_af_unix.c
+++ b/lib/librte_vhost/trans_af_unix.c
@@ -51,6 +51,8 @@ struct af_unix_socket {
 };
 
 static int create_unix_socket(struct vhost_user_socket *vsocket);
+static int vhost_user_start_server(struct vhost_user_socket *vsocket);
+static int vhost_user_start_client(struct vhost_user_socket *vsocket);
 static void vhost_user_read_cb(int connfd, void *dat, int *remove);
 
 /* return bytes# of read on success or negative val on failure. */
@@ -276,7 +278,7 @@ create_unix_socket(struct vhost_user_socket *vsocket)
return 0;
 }
 
-int
+static int
 vhost_user_start_server(struct vhost_user_socket *vsocket)
 {
struct af_unix_socket *s =
@@ -433,7 +435,7 @@ vhost_user_reconnect_init(void)
return ret;
 }
 
-int
+static int
 vhost_user_start_client(struct vhost_user_socket *vsocket)
 {
struct af_unix_socket *s =
@@ -523,6 +525,15 @@ af_unix_socket_cleanup(struct vhost_user_socket *vsocket)
}
 }
 
+static int
+af_unix_socket_start(struct vhost_user_socket *vsocket)
+{
+   if (vsocket->is_server)
+   return vhost_user_start_server(vsocket);
+   else
+   return vhost_user_start_client(vsocket);
+}
+
 static int
 af_unix_vring_call(struct virtio_net *dev __rte_unused,
   struct vhost_virtqueue *vq)
@@ -536,5 +547,6 @@ const struct vhost_transport_ops af_unix_trans_ops = {
.socket_size = sizeof(struct af_unix_socket),
.socket_init = af_unix_socket_init,
.socket_cleanup = af_unix_socket_cleanup,
+   .socket_start = af_unix_socket_start,
.vring_call = af_unix_vring_call,
 };
-- 
2.14.3



[dpdk-dev] [RFC 04/24] vhost: move socket_fd and un sockaddr into trans_af_unix.c

2018-01-19 Thread Stefan Hajnoczi
The socket file descriptor and AF_UNIX sockaddr are specific to the
AF_UNIX transport, so move them into trans_af_unix.c.

In order to do this we need to begin defining the vhost_transport_ops
interface that will allow librte_vhost to support multiple transports.
This patch adds socket_init() and socket_cleanup() to
vhost_transport_ops.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost.h | 31 +-
 lib/librte_vhost/socket.c| 10 ++--
 lib/librte_vhost/trans_af_unix.c | 55 
 3 files changed, 72 insertions(+), 24 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index e5279a572..3aefe6597 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -12,8 +12,6 @@
 #include 
 #include 
 #include 
-#include 
-#include  /* TODO remove when trans_af_unix.c refactoring is done */
 #include 
 #include 
 
@@ -205,6 +203,7 @@ struct guest_page {
 };
 
 struct virtio_net;
+struct vhost_user_socket;
 
 /**
  * A structure containing function pointers for transport-specific operations.
@@ -213,6 +212,30 @@ struct vhost_transport_ops {
/** Size of struct vhost_user_socket-derived per-socket state */
size_t socket_size;
 
+   /**
+* Initialize a vhost-user socket that is being created by
+* rte_vhost_driver_register().  This function checks that the flags
+* are valid but does not establish a vhost-user connection.
+*
+* @param vsocket
+*  new socket
+* @param flags
+*  flags argument from rte_vhost_driver_register()
+* @return
+*  0 on success, -1 on failure
+*/
+   int (*socket_init)(struct vhost_user_socket *vsocket, uint64_t flags);
+
+   /**
+* Free resources associated with a socket, including any established
+* connections.  This function calls vhost_destroy_device() to destroy
+* established connections for this socket.
+*
+* @param vsocket
+*  vhost socket
+*/
+   void (*socket_cleanup)(struct vhost_user_socket *vsocket);
+
/**
 * Notify the guest that used descriptors have been added to the vring.
 * The VRING_AVAIL_F_NO_INTERRUPT flag has already been checked so this
@@ -286,8 +309,6 @@ struct vhost_user_socket {
struct vhost_user_connection_list conn_list;
pthread_mutex_t conn_mutex;
char *path;
-   int socket_fd;
-   struct sockaddr_un un;
bool is_server;
bool reconnect;
bool dequeue_zero_copy;
@@ -325,14 +346,12 @@ struct vhost_user {
 
 extern struct vhost_user vhost_user;
 
-int create_unix_socket(struct vhost_user_socket *vsocket);
 int vhost_user_start_server(struct vhost_user_socket *vsocket);
 int vhost_user_start_client(struct vhost_user_socket *vsocket);
 
 extern pthread_t reconn_tid;
 
 int vhost_user_reconnect_init(void);
-bool vhost_user_remove_reconnect(struct vhost_user_socket *vsocket);
 
 #define VHOST_LOG_PAGE 4096
 
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index fc663..78f847ccc 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -191,7 +191,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
} else {
vsocket->is_server = true;
}
-   ret = create_unix_socket(vsocket);
+   ret = trans_ops->socket_init(vsocket, flags);
if (ret < 0) {
goto out_mutex;
}
@@ -231,13 +231,7 @@ rte_vhost_driver_unregister(const char *path)
struct vhost_user_socket *vsocket = vhost_user.vsockets[i];
 
if (!strcmp(vsocket->path, path)) {
-   if (vsocket->is_server) {
-   fdset_del(&vhost_user.fdset, 
vsocket->socket_fd);
-   close(vsocket->socket_fd);
-   unlink(path);
-   } else if (vsocket->reconnect) {
-   vhost_user_remove_reconnect(vsocket);
-   }
+   vsocket->trans_ops->socket_cleanup(vsocket);
 
pthread_mutex_lock(&vsocket->conn_mutex);
for (conn = TAILQ_FIRST(&vsocket->conn_list);
diff --git a/lib/librte_vhost/trans_af_unix.c b/lib/librte_vhost/trans_af_unix.c
index 5e3c5ab2a..cc8d7ccdc 100644
--- a/lib/librte_vhost/trans_af_unix.c
+++ b/lib/librte_vhost/trans_af_unix.c
@@ -33,6 +33,8 @@
  *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
+#include 
+#include 
 #include 
 
 #include 
@@ -44,8 +46,11 @@
 
 struct af_unix_socket {
struct vhost_user_socket socket; /* must be the first field! */
+   int socket_fd;
+   struct sockaddr_un un;
 };
 
+static int create_unix_socket(struct vhost_user_socket *vsocket);
 static void vhost

[dpdk-dev] [RFC 07/24] vhost: move vhost_user_reconnect_init() into trans_af_unix.c

2018-01-19 Thread Stefan Hajnoczi
The socket reconnection code is highly specific to AF_UNIX, so move the
remaining pieces of it into trans_af_unix.c.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost.h | 10 +++---
 lib/librte_vhost/socket.c|  4 
 lib/librte_vhost/trans_af_unix.c |  9 +++--
 3 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 734a8721d..8f18c1ed0 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -302,9 +302,9 @@ struct virtio_net {
int slave_req_fd;
 } __rte_cache_aligned;
 
-/* The vhost_user, vhost_user_socket, and reconnect declarations are temporary
- * measures for moving AF_UNIX code into trans_af_unix.c.  They will be cleaned
- * up as socket.c is untangled from trans_af_unix.c.
+/* The vhost_user and vhost_user_socket declarations are temporary measures for
+ * moving AF_UNIX code into trans_af_unix.c.  They will be cleaned up as
+ * socket.c is untangled from trans_af_unix.c.
  */
 /*
  * Every time rte_vhost_driver_register() is invoked, an associated
@@ -346,10 +346,6 @@ struct vhost_user {
 
 extern struct vhost_user vhost_user;
 
-extern pthread_t reconn_tid;
-
-int vhost_user_reconnect_init(void);
-
 #define VHOST_LOG_PAGE 4096
 
 /*
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 4fd86fd5b..f9069fcb1 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -177,10 +177,6 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
 
if ((flags & RTE_VHOST_USER_CLIENT) != 0) {
vsocket->reconnect = !(flags & RTE_VHOST_USER_NO_RECONNECT);
-   if (vsocket->reconnect && reconn_tid == 0) {
-   if (vhost_user_reconnect_init() != 0)
-   goto out_free;
-   }
} else {
vsocket->is_server = true;
}
diff --git a/lib/librte_vhost/trans_af_unix.c b/lib/librte_vhost/trans_af_unix.c
index 747fd9690..ad07a8e24 100644
--- a/lib/librte_vhost/trans_af_unix.c
+++ b/lib/librte_vhost/trans_af_unix.c
@@ -347,7 +347,7 @@ struct vhost_user_reconnect_list {
 };
 
 static struct vhost_user_reconnect_list reconn_list;
-pthread_t reconn_tid;
+static pthread_t reconn_tid;
 
 static int
 vhost_user_connect_nonblock(int fd, struct sockaddr *un, size_t sz)
@@ -417,7 +417,7 @@ vhost_user_client_reconnect(void *arg __rte_unused)
return NULL;
 }
 
-int
+static int
 vhost_user_reconnect_init(void)
 {
int ret;
@@ -527,6 +527,11 @@ af_unix_socket_init(struct vhost_user_socket *vsocket,
container_of(vsocket, struct af_unix_socket, socket);
int ret;
 
+   if (vsocket->reconnect && reconn_tid == 0) {
+   if (vhost_user_reconnect_init() != 0)
+   return -1;
+   }
+
TAILQ_INIT(&s->conn_list);
ret = pthread_mutex_init(&s->conn_mutex, NULL);
if (ret) {
-- 
2.14.3



[dpdk-dev] [RFC 06/24] vhost: move vhost_user_connection to trans_af_unix.c

2018-01-19 Thread Stefan Hajnoczi
The AF_UNIX transport can accept multiple client connections on a server
socket.  Each connection instantiates a separate vhost-user device,
which is stored as a vhost_user_connection.  This behavior is specific
to AF_UNIX and other transports may not support N connections per
socket endpoint.

Move struct vhost_user_connection to trans_af_unix.c and
conn_list/conn_mutex into struct af_unix_socket.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost.h | 19 ++---
 lib/librte_vhost/socket.c| 36 ++--
 lib/librte_vhost/trans_af_unix.c | 60 
 3 files changed, 59 insertions(+), 56 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 7cbef04ab..734a8721d 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -302,13 +302,10 @@ struct virtio_net {
int slave_req_fd;
 } __rte_cache_aligned;
 
-/* The vhost_user, vhost_user_socket, vhost_user_connection, and reconnect
- * declarations are temporary measures for moving AF_UNIX code into
- * trans_af_unix.c.  They will be cleaned up as socket.c is untangled from
- * trans_af_unix.c.
+/* The vhost_user, vhost_user_socket, and reconnect declarations are temporary
+ * measures for moving AF_UNIX code into trans_af_unix.c.  They will be cleaned
+ * up as socket.c is untangled from trans_af_unix.c.
  */
-TAILQ_HEAD(vhost_user_connection_list, vhost_user_connection);
-
 /*
  * Every time rte_vhost_driver_register() is invoked, an associated
  * vhost_user_socket struct will be created.
@@ -319,8 +316,6 @@ TAILQ_HEAD(vhost_user_connection_list, 
vhost_user_connection);
  * struct.
  */
 struct vhost_user_socket {
-   struct vhost_user_connection_list conn_list;
-   pthread_mutex_t conn_mutex;
char *path;
bool is_server;
bool reconnect;
@@ -341,14 +336,6 @@ struct vhost_user_socket {
struct vhost_transport_ops const *trans_ops;
 };
 
-struct vhost_user_connection {
-   struct vhost_user_socket *vsocket;
-   int connfd;
-   int vid;
-
-   TAILQ_ENTRY(vhost_user_connection) next;
-};
-
 #define MAX_VHOST_SOCKET 1024
 struct vhost_user {
struct vhost_user_socket *vsockets[MAX_VHOST_SOCKET];
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index f8a96ab5f..4fd86fd5b 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -153,13 +153,6 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
free(vsocket);
goto out;
}
-   TAILQ_INIT(&vsocket->conn_list);
-   ret = pthread_mutex_init(&vsocket->conn_mutex, NULL);
-   if (ret) {
-   RTE_LOG(ERR, VHOST_CONFIG,
-   "error: failed to init connection mutex\n");
-   goto out_free;
-   }
vsocket->dequeue_zero_copy = flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
 
/*
@@ -186,14 +179,14 @@ rte_vhost_driver_register(const char *path, uint64_t 
flags)
vsocket->reconnect = !(flags & RTE_VHOST_USER_NO_RECONNECT);
if (vsocket->reconnect && reconn_tid == 0) {
if (vhost_user_reconnect_init() != 0)
-   goto out_mutex;
+   goto out_free;
}
} else {
vsocket->is_server = true;
}
ret = trans_ops->socket_init(vsocket, flags);
if (ret < 0) {
-   goto out_mutex;
+   goto out_free;
}
 
vhost_user.vsockets[vhost_user.vsocket_cnt++] = vsocket;
@@ -201,11 +194,6 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
pthread_mutex_unlock(&vhost_user.mutex);
return ret;
 
-out_mutex:
-   if (pthread_mutex_destroy(&vsocket->conn_mutex)) {
-   RTE_LOG(ERR, VHOST_CONFIG,
-   "error: failed to destroy connection mutex\n");
-   }
 out_free:
free(vsocket->path);
free(vsocket);
@@ -223,7 +211,6 @@ rte_vhost_driver_unregister(const char *path)
 {
int i;
int count;
-   struct vhost_user_connection *conn, *next;
 
pthread_mutex_lock(&vhost_user.mutex);
 
@@ -232,25 +219,6 @@ rte_vhost_driver_unregister(const char *path)
 
if (!strcmp(vsocket->path, path)) {
vsocket->trans_ops->socket_cleanup(vsocket);
-
-   pthread_mutex_lock(&vsocket->conn_mutex);
-   for (conn = TAILQ_FIRST(&vsocket->conn_list);
-conn != NULL;
-conn = next) {
-   next = TAILQ_NEXT(conn, next);
-
-   fdset_del(&vhost_user.fdset, conn->connfd);
-  

[dpdk-dev] [RFC 08/24] vhost: move vhost_user.fdset to trans_af_unix.c

2018-01-19 Thread Stefan Hajnoczi
The fdset is used by the AF_UNIX transport code but other transports may
not need it.  Move it to trans_af_unix.c and then make struct vhost_user
private again since nothing outside socket.c needs it.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost.h | 15 ---
 lib/librte_vhost/socket.c| 21 +++--
 lib/librte_vhost/trans_af_unix.c | 25 +
 3 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 8f18c1ed0..45167b6a7 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -19,7 +19,6 @@
 #include 
 #include 
 
-#include "fd_man.h"
 #include "rte_vhost.h"
 
 /* Used to indicate that the device is running on a data core */
@@ -302,10 +301,6 @@ struct virtio_net {
int slave_req_fd;
 } __rte_cache_aligned;
 
-/* The vhost_user and vhost_user_socket declarations are temporary measures for
- * moving AF_UNIX code into trans_af_unix.c.  They will be cleaned up as
- * socket.c is untangled from trans_af_unix.c.
- */
 /*
  * Every time rte_vhost_driver_register() is invoked, an associated
  * vhost_user_socket struct will be created.
@@ -336,16 +331,6 @@ struct vhost_user_socket {
struct vhost_transport_ops const *trans_ops;
 };
 
-#define MAX_VHOST_SOCKET 1024
-struct vhost_user {
-   struct vhost_user_socket *vsockets[MAX_VHOST_SOCKET];
-   struct fdset fdset;
-   int vsocket_cnt;
-   pthread_mutex_t mutex;
-};
-
-extern struct vhost_user vhost_user;
-
 #define VHOST_LOG_PAGE 4096
 
 /*
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index f9069fcb1..c46328950 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -20,12 +20,14 @@
 #include "vhost.h"
 #include "vhost_user.h"
 
+#define MAX_VHOST_SOCKET 1024
+struct vhost_user {
+   struct vhost_user_socket *vsockets[MAX_VHOST_SOCKET];
+   int vsocket_cnt;
+   pthread_mutex_t mutex;
+};
+
 struct vhost_user vhost_user = {
-   .fdset = {
-   .fd = { [0 ... MAX_FDS - 1] = {-1, NULL, NULL, NULL, 0} },
-   .fd_mutex = PTHREAD_MUTEX_INITIALIZER,
-   .num = 0
-   },
.vsocket_cnt = 0,
.mutex = PTHREAD_MUTEX_INITIALIZER,
 };
@@ -265,7 +267,6 @@ int
 rte_vhost_driver_start(const char *path)
 {
struct vhost_user_socket *vsocket;
-   static pthread_t fdset_tid;
 
pthread_mutex_lock(&vhost_user.mutex);
vsocket = find_vhost_user_socket(path);
@@ -274,13 +275,5 @@ rte_vhost_driver_start(const char *path)
if (!vsocket)
return -1;
 
-   if (fdset_tid == 0) {
-   int ret = pthread_create(&fdset_tid, NULL, fdset_event_dispatch,
-&vhost_user.fdset);
-   if (ret != 0)
-   RTE_LOG(ERR, VHOST_CONFIG,
-   "failed to create fdset handling thread");
-   }
-
return vsocket->trans_ops->socket_start(vsocket);
 }
diff --git a/lib/librte_vhost/trans_af_unix.c b/lib/librte_vhost/trans_af_unix.c
index ad07a8e24..54c1b2db3 100644
--- a/lib/librte_vhost/trans_af_unix.c
+++ b/lib/librte_vhost/trans_af_unix.c
@@ -39,11 +39,18 @@
 
 #include 
 
+#include "fd_man.h"
 #include "vhost.h"
 #include "vhost_user.h"
 
 #define MAX_VIRTIO_BACKLOG 128
 
+static struct fdset af_unix_fdset = {
+   .fd = { [0 ... MAX_FDS - 1] = {-1, NULL, NULL, NULL, 0} },
+   .fd_mutex = PTHREAD_MUTEX_INITIALIZER,
+   .num = 0
+};
+
 TAILQ_HEAD(vhost_user_connection_list, vhost_user_connection);
 
 struct vhost_user_connection {
@@ -195,7 +202,7 @@ vhost_user_add_connection(int fd, struct vhost_user_socket 
*vsocket)
conn->connfd = fd;
conn->vsocket = vsocket;
conn->vid = vid;
-   ret = fdset_add(&vhost_user.fdset, fd, vhost_user_read_cb,
+   ret = fdset_add(&af_unix_fdset, fd, vhost_user_read_cb,
NULL, conn);
if (ret < 0) {
RTE_LOG(ERR, VHOST_CONFIG,
@@ -316,7 +323,7 @@ vhost_user_start_server(struct vhost_user_socket *vsocket)
if (ret < 0)
goto err;
 
-   ret = fdset_add(&vhost_user.fdset, fd, vhost_user_server_new_connection,
+   ret = fdset_add(&af_unix_fdset, fd, vhost_user_server_new_connection,
  NULL, vsocket);
if (ret < 0) {
RTE_LOG(ERR, VHOST_CONFIG,
@@ -551,7 +558,7 @@ af_unix_socket_cleanup(struct vhost_user_socket *vsocket)
struct vhost_user_connection *conn, *next;
 
if (vsocket->is_server) {
-   fdset_del(&vhost_user.fdset, s->socket_fd);
+   fdset_del(&af_unix_fdset, s->socket_fd);
close(s->socket_fd);
unlink(vsocket->path);
} el

[dpdk-dev] [RFC 09/24] vhost: pass vhost_transport_ops through vhost_new_device()

2018-01-19 Thread Stefan Hajnoczi
This patch propagates struct vhost_user_socket's vhost_transport_ops
into the newly created vhost device.

This patch completes the initial refactoring of socket.c, with the
AF_UNIX-specific code now in trans_af_unix.c and the librte_vhost API
entrypoints in socket.c.

Now it is time to turn towards vhost_user.c and its mixture of
vhost-user protocol processing and socket I/O.  The socket I/O will be
moved into trans_af_unix.c so that other transports can be added that
don't use file descriptors.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost.h | 2 +-
 lib/librte_vhost/trans_af_unix.c | 2 +-
 lib/librte_vhost/vhost.c | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 45167b6a7..ca7507284 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -434,7 +434,7 @@ gpa_to_hpa(struct virtio_net *dev, uint64_t gpa, uint64_t 
size)
 
 struct virtio_net *get_device(int vid);
 
-int vhost_new_device(void);
+int vhost_new_device(const struct vhost_transport_ops *trans_ops);
 void cleanup_device(struct virtio_net *dev, int destroy);
 void reset_device(struct virtio_net *dev);
 void vhost_destroy_device(int);
diff --git a/lib/librte_vhost/trans_af_unix.c b/lib/librte_vhost/trans_af_unix.c
index 54c1b2db3..c85a162e8 100644
--- a/lib/librte_vhost/trans_af_unix.c
+++ b/lib/librte_vhost/trans_af_unix.c
@@ -176,7 +176,7 @@ vhost_user_add_connection(int fd, struct vhost_user_socket 
*vsocket)
return;
}
 
-   vid = vhost_new_device();
+   vid = vhost_new_device(vsocket->trans_ops);
if (vid == -1) {
goto err;
}
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 630dbd014..f8754e261 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -261,7 +261,7 @@ reset_device(struct virtio_net *dev)
  * there is a new virtio device being attached).
  */
 int
-vhost_new_device(void)
+vhost_new_device(const struct vhost_transport_ops *trans_ops)
 {
struct virtio_net *dev;
int i;
@@ -287,7 +287,7 @@ vhost_new_device(void)
vhost_devices[i] = dev;
dev->vid = i;
dev->slave_req_fd = -1;
-   dev->trans_ops = &af_unix_trans_ops;
+   dev->trans_ops = trans_ops;
 
return i;
 }
-- 
2.14.3



[dpdk-dev] [RFC 12/24] vhost: move slave_req_fd field to AF_UNIX transport

2018-01-19 Thread Stefan Hajnoczi
The slave request file descriptor is specific to the AF_UNIX transport.
Move the field out of struct virtio_net and into the trans_af_unix.c
private struct vhost_user_connection struct.

This change will allow future transports to implement the slave request
fd without relying on socket I/O.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost.h | 27 --
 lib/librte_vhost/trans_af_unix.c | 41 +++-
 lib/librte_vhost/vhost.c |  3 ++-
 lib/librte_vhost/vhost_user.c| 18 +-
 4 files changed, 68 insertions(+), 21 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index ac9ceefb9..60e4d10bd 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -252,6 +252,16 @@ struct vhost_transport_ops {
 */
int (*socket_start)(struct vhost_user_socket *vsocket);
 
+   /**
+   * Free resources associated with this device.
+   *
+   * @param dev
+   *  vhost device
+   * @param destroy
+   *  0 on device reset, 1 on full cleanup.
+   */
+   void (*cleanup_device)(struct virtio_net *dev, int destroy);
+
/**
 * Notify the guest that used descriptors have been added to the vring.
 * The VRING_AVAIL_F_NO_INTERRUPT flag has already been checked so this
@@ -290,6 +300,21 @@ struct vhost_transport_ops {
 */
int (*send_slave_req)(struct virtio_net *dev,
  struct VhostUserMsg *req);
+
+   /**
+* Process VHOST_USER_SET_SLAVE_REQ_FD message.  After this function
+* succeeds send_slave_req() may be called to submit requests to the
+* master.
+*
+* @param dev
+*  vhost device
+* @param msg
+*  message
+* @return
+*  0 on success, -1 on failure
+*/
+   int (*set_slave_req_fd)(struct virtio_net *dev,
+   struct VhostUserMsg *msg);
 };
 
 /** The traditional AF_UNIX vhost-user protocol transport. */
@@ -331,8 +356,6 @@ struct virtio_net {
uint32_tnr_guest_pages;
uint32_tmax_guest_pages;
struct guest_page   *guest_pages;
-
-   int slave_req_fd;
 } __rte_cache_aligned;
 
 /*
diff --git a/lib/librte_vhost/trans_af_unix.c b/lib/librte_vhost/trans_af_unix.c
index 9e5a5c127..7128e121e 100644
--- a/lib/librte_vhost/trans_af_unix.c
+++ b/lib/librte_vhost/trans_af_unix.c
@@ -57,6 +57,7 @@ struct vhost_user_connection {
struct virtio_net device; /* must be the first field! */
struct vhost_user_socket *vsocket;
int connfd;
+   int slave_req_fd;
 
TAILQ_ENTRY(vhost_user_connection) next;
 };
@@ -173,10 +174,32 @@ af_unix_send_reply(struct virtio_net *dev, struct 
VhostUserMsg *msg)
 static int
 af_unix_send_slave_req(struct virtio_net *dev, struct VhostUserMsg *msg)
 {
-   return send_fd_message(dev->slave_req_fd, msg,
+   struct vhost_user_connection *conn =
+   container_of(dev, struct vhost_user_connection, device);
+
+   return send_fd_message(conn->slave_req_fd, msg,
   VHOST_USER_HDR_SIZE + msg->size, NULL, 0);
 }
 
+static int
+af_unix_set_slave_req_fd(struct virtio_net *dev, struct VhostUserMsg *msg)
+{
+   struct vhost_user_connection *conn =
+   container_of(dev, struct vhost_user_connection, device);
+   int fd = msg->fds[0];
+
+   if (fd < 0) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "Invalid file descriptor for slave channel 
(%d)\n",
+   fd);
+   return -1;
+   }
+
+   conn->slave_req_fd = fd;
+
+   return 0;
+}
+
 static void
 vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 {
@@ -194,6 +217,7 @@ vhost_user_add_connection(int fd, struct vhost_user_socket 
*vsocket)
 
conn = container_of(dev, struct vhost_user_connection, device);
conn->connfd = fd;
+   conn->slave_req_fd = -1;
conn->vsocket = vsocket;
 
size = strnlen(vsocket->path, PATH_MAX);
@@ -657,6 +681,19 @@ af_unix_socket_start(struct vhost_user_socket *vsocket)
return vhost_user_start_client(vsocket);
 }
 
+static void
+af_unix_cleanup_device(struct virtio_net *dev __rte_unused,
+  int destroy __rte_unused)
+{
+   struct vhost_user_connection *conn =
+   container_of(dev, struct vhost_user_connection, device);
+
+   if (conn->slave_req_fd >= 0) {
+   close(conn->slave_req_fd);
+   conn->slave_req_fd = -1;
+   }
+}
+
 static int
 af_unix_vring_call(struct virtio_net *dev __rte_unused,
   struct vhost_virtqueue *vq)
@@ -672,7 +709,9 @@ const struct vhost_transport_ops af_unix_trans_ops = {
.socket

[dpdk-dev] [RFC 11/24] vhost: extract vhost_user.c socket I/O into transport

2018-01-19 Thread Stefan Hajnoczi
The core vhost-user protocol code should not do socket I/O because the
details are transport-specific.  Move code to send and receive
vhost-user messages into trans_af_unix.c.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost.h | 26 
 lib/librte_vhost/vhost_user.h|  6 +--
 lib/librte_vhost/trans_af_unix.c | 70 +++--
 lib/librte_vhost/vhost_user.c| 85 +++-
 4 files changed, 115 insertions(+), 72 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 757e18391..ac9ceefb9 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -203,6 +203,7 @@ struct guest_page {
 
 struct virtio_net;
 struct vhost_user_socket;
+struct VhostUserMsg;
 
 /**
  * A structure containing function pointers for transport-specific operations.
@@ -264,6 +265,31 @@ struct vhost_transport_ops {
 *  0 on success, -1 on failure
 */
int (*vring_call)(struct virtio_net *dev, struct vhost_virtqueue *vq);
+
+   /**
+* Send a reply to the master.
+*
+* @param dev
+*  vhost device
+* @param reply
+*  reply message
+* @return
+*  0 on success, -1 on failure
+*/
+   int (*send_reply)(struct virtio_net *dev, struct VhostUserMsg *reply);
+
+   /**
+* Send a slave request to the master.
+*
+* @param dev
+*  vhost device
+* @param req
+*  request message
+* @return
+*  0 on success, -1 on failure
+*/
+   int (*send_slave_req)(struct virtio_net *dev,
+ struct VhostUserMsg *req);
 };
 
 /** The traditional AF_UNIX vhost-user protocol transport. */
diff --git a/lib/librte_vhost/vhost_user.h b/lib/librte_vhost/vhost_user.h
index d4bd604b9..dec658dff 100644
--- a/lib/librte_vhost/vhost_user.h
+++ b/lib/librte_vhost/vhost_user.h
@@ -110,11 +110,7 @@ typedef struct VhostUserMsg {
 
 
 /* vhost_user.c */
-int vhost_user_msg_handler(int vid, int fd);
+int vhost_user_msg_handler(int vid, const struct VhostUserMsg *msg);
 int vhost_user_iotlb_miss(struct virtio_net *dev, uint64_t iova, uint8_t perm);
 
-/* socket.c */
-int read_fd_message(int sockfd, char *buf, int buflen, int *fds, int fd_num);
-int send_fd_message(int sockfd, char *buf, int buflen, int *fds, int fd_num);
-
 #endif
diff --git a/lib/librte_vhost/trans_af_unix.c b/lib/librte_vhost/trans_af_unix.c
index dde3e76cd..9e5a5c127 100644
--- a/lib/librte_vhost/trans_af_unix.c
+++ b/lib/librte_vhost/trans_af_unix.c
@@ -75,7 +75,7 @@ static int vhost_user_start_client(struct vhost_user_socket 
*vsocket);
 static void vhost_user_read_cb(int connfd, void *dat, int *remove);
 
 /* return bytes# of read on success or negative val on failure. */
-int
+static int
 read_fd_message(int sockfd, char *buf, int buflen, int *fds, int fd_num)
 {
struct iovec iov;
@@ -117,8 +117,8 @@ read_fd_message(int sockfd, char *buf, int buflen, int 
*fds, int fd_num)
return ret;
 }
 
-int
-send_fd_message(int sockfd, char *buf, int buflen, int *fds, int fd_num)
+static int
+send_fd_message(int sockfd, void *buf, int buflen, int *fds, int fd_num)
 {
 
struct iovec iov;
@@ -160,6 +160,23 @@ send_fd_message(int sockfd, char *buf, int buflen, int 
*fds, int fd_num)
return ret;
 }
 
+static int
+af_unix_send_reply(struct virtio_net *dev, struct VhostUserMsg *msg)
+{
+   struct vhost_user_connection *conn =
+   container_of(dev, struct vhost_user_connection, device);
+
+   return send_fd_message(conn->connfd, msg,
+  VHOST_USER_HDR_SIZE + msg->size, NULL, 0);
+}
+
+static int
+af_unix_send_slave_req(struct virtio_net *dev, struct VhostUserMsg *msg)
+{
+   return send_fd_message(dev->slave_req_fd, msg,
+  VHOST_USER_HDR_SIZE + msg->size, NULL, 0);
+}
+
 static void
 vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 {
@@ -234,6 +251,36 @@ vhost_user_server_new_connection(int fd, void *dat, int 
*remove __rte_unused)
vhost_user_add_connection(fd, vsocket);
 }
 
+/* return bytes# of read on success or negative val on failure. */
+static int
+read_vhost_message(int sockfd, struct VhostUserMsg *msg)
+{
+   int ret;
+
+   ret = read_fd_message(sockfd, (char *)msg, VHOST_USER_HDR_SIZE,
+   msg->fds, VHOST_MEMORY_MAX_NREGIONS);
+   if (ret <= 0)
+   return ret;
+
+   if (msg && msg->size) {
+   if (msg->size > sizeof(msg->payload)) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "invalid msg size: %d\n", msg->size);
+   return -1;
+   }
+   ret = read(sockfd, &msg->payload, msg->size);
+   if (ret <= 0)
+

[dpdk-dev] [RFC 10/24] vhost: embed struct virtio_net inside struct vhost_user_connection

2018-01-19 Thread Stefan Hajnoczi
There is a 1:1 relationship between struct virtio_net and struct
vhost_user_connection.  They share the same lifetime.  struct virtio_net
is the per-device state that is part of the vhost.h API.  struct
vhost_user_connection is the AF_UNIX-specific per-device state and is
private to trans_af_unix.c.  It will be necessary to go between these
two structs.

This patch embeds struct virtio_net within struct vhost_user_connection
so that AF_UNIX transport code can convert a struct virtio_net pointer
into a struct vhost_user_connection pointer.  There is now just a single
malloc/free for both of these structs together.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost.h | 11 +-
 lib/librte_vhost/trans_af_unix.c | 44 +---
 lib/librte_vhost/vhost.c | 10 -
 3 files changed, 34 insertions(+), 31 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index ca7507284..757e18391 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -211,6 +211,9 @@ struct vhost_transport_ops {
/** Size of struct vhost_user_socket-derived per-socket state */
size_t socket_size;
 
+   /** Size of struct virtio_net-derived per-device state */
+   size_t device_size;
+
/**
 * Initialize a vhost-user socket that is being created by
 * rte_vhost_driver_register().  This function checks that the flags
@@ -269,6 +272,11 @@ extern const struct vhost_transport_ops af_unix_trans_ops;
 /**
  * Device structure contains all configuration information relating
  * to the device.
+ *
+ * Transport-specific per-device state can be kept by embedding this struct at
+ * the beginning of a transport-specific struct.  Set
+ * vhost_transport_ops->device_size to the size of the transport-specific
+ * struct.
  */
 struct virtio_net {
/* Frontend (QEMU) memory and memory region information */
@@ -434,7 +442,8 @@ gpa_to_hpa(struct virtio_net *dev, uint64_t gpa, uint64_t 
size)
 
 struct virtio_net *get_device(int vid);
 
-int vhost_new_device(const struct vhost_transport_ops *trans_ops);
+struct virtio_net *
+vhost_new_device(const struct vhost_transport_ops *trans_ops);
 void cleanup_device(struct virtio_net *dev, int destroy);
 void reset_device(struct virtio_net *dev);
 void vhost_destroy_device(int);
diff --git a/lib/librte_vhost/trans_af_unix.c b/lib/librte_vhost/trans_af_unix.c
index c85a162e8..dde3e76cd 100644
--- a/lib/librte_vhost/trans_af_unix.c
+++ b/lib/librte_vhost/trans_af_unix.c
@@ -54,9 +54,9 @@ static struct fdset af_unix_fdset = {
 TAILQ_HEAD(vhost_user_connection_list, vhost_user_connection);
 
 struct vhost_user_connection {
+   struct virtio_net device; /* must be the first field! */
struct vhost_user_socket *vsocket;
int connfd;
-   int vid;
 
TAILQ_ENTRY(vhost_user_connection) next;
 };
@@ -165,32 +165,30 @@ vhost_user_add_connection(int fd, struct 
vhost_user_socket *vsocket)
 {
struct af_unix_socket *s =
container_of(vsocket, struct af_unix_socket, socket);
-   int vid;
size_t size;
+   struct virtio_net *dev;
struct vhost_user_connection *conn;
int ret;
 
-   conn = malloc(sizeof(*conn));
-   if (conn == NULL) {
-   close(fd);
+   dev = vhost_new_device(vsocket->trans_ops);
+   if (!dev) {
return;
}
 
-   vid = vhost_new_device(vsocket->trans_ops);
-   if (vid == -1) {
-   goto err;
-   }
+   conn = container_of(dev, struct vhost_user_connection, device);
+   conn->connfd = fd;
+   conn->vsocket = vsocket;
 
size = strnlen(vsocket->path, PATH_MAX);
-   vhost_set_ifname(vid, vsocket->path, size);
+   vhost_set_ifname(dev->vid, vsocket->path, size);
 
if (vsocket->dequeue_zero_copy)
-   vhost_enable_dequeue_zero_copy(vid);
+   vhost_enable_dequeue_zero_copy(dev->vid);
 
-   RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", vid);
+   RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", dev->vid);
 
if (vsocket->notify_ops->new_connection) {
-   ret = vsocket->notify_ops->new_connection(vid);
+   ret = vsocket->notify_ops->new_connection(dev->vid);
if (ret < 0) {
RTE_LOG(ERR, VHOST_CONFIG,
"failed to add vhost user connection with fd 
%d\n",
@@ -199,9 +197,6 @@ vhost_user_add_connection(int fd, struct vhost_user_socket 
*vsocket)
}
}
 
-   conn->connfd = fd;
-   conn->vsocket = vsocket;
-   conn->vid = vid;
ret = fdset_add(&af_unix_fdset, fd, vhost_user_read_cb,
NULL, conn);
if (ret < 0) {
@@ -210,7 +205,7 @@ vhost_user_add_connection(int fd, 

[dpdk-dev] [RFC 13/24] vhost: move mmap/munmap to AF_UNIX transport

2018-01-19 Thread Stefan Hajnoczi
How mem table regions are mapped is transport-specific, so move the mmap
code into trans_af_unix.c.  The new .map_mem_table()/.unmap_mem_table()
interfaces allow transports to perform the mapping and unmapping.

Drop the "mmap align:" debug output because the alignment is no longer
available from vhost_user.c.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost.h | 17 +++
 lib/librte_vhost/vhost_user.h|  3 ++
 lib/librte_vhost/trans_af_unix.c | 78 +
 lib/librte_vhost/vhost_user.c| 95 ++--
 4 files changed, 121 insertions(+), 72 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 60e4d10bd..a50b802e7 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -315,6 +315,23 @@ struct vhost_transport_ops {
 */
int (*set_slave_req_fd)(struct virtio_net *dev,
struct VhostUserMsg *msg);
+
+   /**
+* Map memory table regions in dev->mem->regions[].
+*
+* @param dev
+*  vhost device
+*/
+   int (*map_mem_regions)(struct virtio_net *dev);
+
+   /**
+* Unmap memory table regions in dev->mem->regions[] and free any
+* resources, such as file descriptors.
+*
+* @param dev
+*  vhost device
+*/
+   void (*unmap_mem_regions)(struct virtio_net *dev);
 };
 
 /** The traditional AF_UNIX vhost-user protocol transport. */
diff --git a/lib/librte_vhost/vhost_user.h b/lib/librte_vhost/vhost_user.h
index dec658dff..4181f34c9 100644
--- a/lib/librte_vhost/vhost_user.h
+++ b/lib/librte_vhost/vhost_user.h
@@ -110,6 +110,9 @@ typedef struct VhostUserMsg {
 
 
 /* vhost_user.c */
+void vhost_add_guest_pages(struct virtio_net *dev,
+  struct rte_vhost_mem_region *reg,
+  uint64_t page_size);
 int vhost_user_msg_handler(int vid, const struct VhostUserMsg *msg);
 int vhost_user_iotlb_miss(struct virtio_net *dev, uint64_t iova, uint8_t perm);
 
diff --git a/lib/librte_vhost/trans_af_unix.c b/lib/librte_vhost/trans_af_unix.c
index 7128e121e..d3a5519b7 100644
--- a/lib/librte_vhost/trans_af_unix.c
+++ b/lib/librte_vhost/trans_af_unix.c
@@ -34,6 +34,8 @@
  */
 
 #include 
+#include 
+#include 
 #include 
 #include 
 
@@ -703,6 +705,80 @@ af_unix_vring_call(struct virtio_net *dev __rte_unused,
return 0;
 }
 
+static uint64_t
+get_blk_size(int fd)
+{
+   struct stat stat;
+   int ret;
+
+   ret = fstat(fd, &stat);
+   return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
+}
+
+static int
+af_unix_map_mem_regions(struct virtio_net *dev)
+{
+   uint32_t i;
+
+   for (i = 0; i < dev->mem->nregions; i++) {
+   struct rte_vhost_mem_region *reg = &dev->mem->regions[i];
+   uint64_t mmap_size = reg->mmap_size;
+   uint64_t mmap_offset = mmap_size - reg->size;
+   uint64_t alignment;
+   void *mmap_addr;
+
+   /* mmap() without flag of MAP_ANONYMOUS, should be called
+* with length argument aligned with hugepagesz at older
+* longterm version Linux, like 2.6.32 and 3.2.72, or
+* mmap() will fail with EINVAL.
+*
+* to avoid failure, make sure in caller to keep length
+* aligned.
+*/
+   alignment = get_blk_size(reg->fd);
+   if (alignment == (uint64_t)-1) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "couldn't get hugepage size through fstat\n");
+   return -1;
+   }
+   mmap_size = RTE_ALIGN_CEIL(mmap_size, alignment);
+
+   mmap_addr = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
+MAP_SHARED | MAP_POPULATE, reg->fd, 0);
+
+   if (mmap_addr == MAP_FAILED) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "mmap region %u failed.\n", i);
+   return -1;
+   }
+
+   reg->mmap_addr = mmap_addr;
+   reg->mmap_size = mmap_size;
+   reg->host_user_addr = (uint64_t)(uintptr_t)reg->mmap_addr +
+ mmap_offset;
+
+   if (dev->dequeue_zero_copy)
+   vhost_add_guest_pages(dev, reg, alignment);
+   }
+
+   return 0;
+}
+
+static void
+af_unix_unmap_mem_regions(struct virtio_net *dev)
+{
+   uint32_t i;
+   struct rte_vhost_mem_region *reg;
+
+   for (i = 0; i < dev->mem->nregions; i++) {
+   reg = &dev->mem->regions[i];
+   if (reg->host_user_addr) {
+   munmap(reg->mmap_addr, reg->mmap_size);

[dpdk-dev] [RFC 14/24] vhost: move librte_vhost to drivers/

2018-01-19 Thread Stefan Hajnoczi
PCI drivers depend on the rte_bus_pci.h header file which is only
available when drivers/ is built.  Code in lib/ cannot depend on code in
drivers/ because lib/ is built first during make install.

This patch moves librte_vhost into drivers/ so that later patches can
add a virtio-vhost-pci driver to librte_vhost without exporting all
private vhost.h symbols.

Signed-off-by: Stefan Hajnoczi 
---
Is there a better way of giving a PCI driver access to librte_vhost
transport (not a public API that apps should use)?
---
 drivers/Makefile| 2 ++
 {lib => drivers}/librte_vhost/Makefile  | 0
 lib/Makefile| 3 ---
 {lib => drivers}/librte_vhost/fd_man.h  | 0
 {lib => drivers}/librte_vhost/iotlb.h   | 0
 {lib => drivers}/librte_vhost/rte_vhost.h   | 0
 {lib => drivers}/librte_vhost/vhost.h   | 0
 {lib => drivers}/librte_vhost/vhost_user.h  | 0
 {lib => drivers}/librte_vhost/fd_man.c  | 0
 {lib => drivers}/librte_vhost/iotlb.c   | 0
 {lib => drivers}/librte_vhost/socket.c  | 0
 {lib => drivers}/librte_vhost/trans_af_unix.c   | 0
 {lib => drivers}/librte_vhost/vhost.c   | 0
 {lib => drivers}/librte_vhost/vhost_user.c  | 0
 {lib => drivers}/librte_vhost/virtio_net.c  | 0
 {lib => drivers}/librte_vhost/rte_vhost_version.map | 0
 16 files changed, 2 insertions(+), 3 deletions(-)
 rename {lib => drivers}/librte_vhost/Makefile (100%)
 rename {lib => drivers}/librte_vhost/fd_man.h (100%)
 rename {lib => drivers}/librte_vhost/iotlb.h (100%)
 rename {lib => drivers}/librte_vhost/rte_vhost.h (100%)
 rename {lib => drivers}/librte_vhost/vhost.h (100%)
 rename {lib => drivers}/librte_vhost/vhost_user.h (100%)
 rename {lib => drivers}/librte_vhost/fd_man.c (100%)
 rename {lib => drivers}/librte_vhost/iotlb.c (100%)
 rename {lib => drivers}/librte_vhost/socket.c (100%)
 rename {lib => drivers}/librte_vhost/trans_af_unix.c (100%)
 rename {lib => drivers}/librte_vhost/vhost.c (100%)
 rename {lib => drivers}/librte_vhost/vhost_user.c (100%)
 rename {lib => drivers}/librte_vhost/virtio_net.c (100%)
 rename {lib => drivers}/librte_vhost/rte_vhost_version.map (100%)

diff --git a/drivers/Makefile b/drivers/Makefile
index 57e1a48a8..67f110c25 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -12,5 +12,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_CRYPTODEV) += crypto
 DEPDIRS-crypto := bus mempool
 DIRS-$(CONFIG_RTE_LIBRTE_EVENTDEV) += event
 DEPDIRS-event := bus mempool net
+DIRS-$(CONFIG_RTE_LIBRTE_VHOST) += librte_vhost
+DEPDIRS-librte_vhost := bus
 
 include $(RTE_SDK)/mk/rte.subdir.mk
diff --git a/lib/librte_vhost/Makefile b/drivers/librte_vhost/Makefile
similarity index 100%
rename from lib/librte_vhost/Makefile
rename to drivers/librte_vhost/Makefile
diff --git a/lib/Makefile b/lib/Makefile
index 679912a28..dd916901b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -30,9 +30,6 @@ DEPDIRS-librte_security += librte_ether
 DEPDIRS-librte_security += librte_cryptodev
 DIRS-$(CONFIG_RTE_LIBRTE_EVENTDEV) += librte_eventdev
 DEPDIRS-librte_eventdev := librte_eal librte_ring librte_ether librte_hash
-DIRS-$(CONFIG_RTE_LIBRTE_VHOST) += librte_vhost
-DEPDIRS-librte_vhost := librte_eal librte_mempool librte_mbuf librte_ether \
-   librte_net
 DIRS-$(CONFIG_RTE_LIBRTE_HASH) += librte_hash
 DEPDIRS-librte_hash := librte_eal librte_ring
 DIRS-$(CONFIG_RTE_LIBRTE_EFD) += librte_efd
diff --git a/lib/librte_vhost/fd_man.h b/drivers/librte_vhost/fd_man.h
similarity index 100%
rename from lib/librte_vhost/fd_man.h
rename to drivers/librte_vhost/fd_man.h
diff --git a/lib/librte_vhost/iotlb.h b/drivers/librte_vhost/iotlb.h
similarity index 100%
rename from lib/librte_vhost/iotlb.h
rename to drivers/librte_vhost/iotlb.h
diff --git a/lib/librte_vhost/rte_vhost.h b/drivers/librte_vhost/rte_vhost.h
similarity index 100%
rename from lib/librte_vhost/rte_vhost.h
rename to drivers/librte_vhost/rte_vhost.h
diff --git a/lib/librte_vhost/vhost.h b/drivers/librte_vhost/vhost.h
similarity index 100%
rename from lib/librte_vhost/vhost.h
rename to drivers/librte_vhost/vhost.h
diff --git a/lib/librte_vhost/vhost_user.h b/drivers/librte_vhost/vhost_user.h
similarity index 100%
rename from lib/librte_vhost/vhost_user.h
rename to drivers/librte_vhost/vhost_user.h
diff --git a/lib/librte_vhost/fd_man.c b/drivers/librte_vhost/fd_man.c
similarity index 100%
rename from lib/librte_vhost/fd_man.c
rename to drivers/librte_vhost/fd_man.c
diff --git a/lib/librte_vhost/iotlb.c b/drivers/librte_vhost/iotlb.c
similarity index 100%
rename from lib/librte_vhost/iotlb.c
rename to drivers/librte_vhost/iotlb.c
diff --git a/lib/librte_vhost/socket.c b/drivers/librte_vhost/socket.c
similarity index 100%
rename from lib/librte_vhost/socket.c
rename to drivers/librte_vhost/socket.c
diff --git a

[dpdk-dev] [RFC 15/24] vhost: add virtio pci framework

2018-01-19 Thread Stefan Hajnoczi
The virtio-vhost-user transport will involve a virtio pci device driver.
There is currently no librte_virtio API that we can reusable.

This commit is a hack that duplicates the virtio pci code from
drivers/net/.  A better solution would be to extract the code cleanly
from drivers/net/ and share it.  Or perhaps we could backport SPDK's
lib/virtio.  I don't have time to do either right now so I've just
copied the code, removed virtio-net and ethdev parts, and renamed
symbols to avoid link errors.

Signed-off-by: Stefan Hajnoczi 
---
 drivers/librte_vhost/Makefile |   4 +-
 drivers/librte_vhost/virtio_pci.h | 267 
 drivers/librte_vhost/virtqueue.h  | 181 ++
 drivers/librte_vhost/virtio_pci.c | 504 ++
 4 files changed, 955 insertions(+), 1 deletion(-)
 create mode 100644 drivers/librte_vhost/virtio_pci.h
 create mode 100644 drivers/librte_vhost/virtqueue.h
 create mode 100644 drivers/librte_vhost/virtio_pci.c

diff --git a/drivers/librte_vhost/Makefile b/drivers/librte_vhost/Makefile
index ccbbce3af..8a56c32af 100644
--- a/drivers/librte_vhost/Makefile
+++ b/drivers/librte_vhost/Makefile
@@ -21,7 +21,9 @@ LDLIBS += -lrte_eal -lrte_mempool -lrte_mbuf -lrte_ethdev 
-lrte_net
 
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
-   vhost_user.c virtio_net.c 
trans_af_unix.c
+   vhost_user.c virtio_net.c \
+   trans_af_unix.c \
+   virtio_pci.c
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
diff --git a/drivers/librte_vhost/virtio_pci.h 
b/drivers/librte_vhost/virtio_pci.h
new file mode 100644
index 0..7afc24853
--- /dev/null
+++ b/drivers/librte_vhost/virtio_pci.h
@@ -0,0 +1,267 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+/* XXX This file is based on drivers/net/virtio/virtio_pci.h.  It would be
+ * better to create a shared rte_virtio library instead of duplicating this
+ * code.
+ */
+
+#ifndef _VIRTIO_PCI_H_
+#define _VIRTIO_PCI_H_
+
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+/* Macros for printing using RTE_LOG */
+#define RTE_LOGTYPE_VIRTIO_PCI_CONFIG RTE_LOGTYPE_USER2
+
+struct virtqueue;
+
+/* VirtIO PCI vendor/device ID. */
+#define VIRTIO_PCI_VENDORID 0x1AF4
+#define VIRTIO_PCI_LEGACY_DEVICEID_VHOST_USER 0x1017
+#define VIRTIO_PCI_MODERN_DEVICEID_VHOST_USER 0x1058
+
+/* VirtIO ABI version, this must match exactly. */
+#define VIRTIO_PCI_ABI_VERSION 0
+
+/*
+ * VirtIO Header, located in BAR 0.
+ */
+#define VIRTIO_PCI_HOST_FEATURES  0  /* host's supported features (32bit, RO)*/
+#define VIRTIO_PCI_GUEST_FEATURES 4  /* guest's supported features (32, RW) */
+#define VIRTIO_PCI_QUEUE_PFN  8  /* physical address of VQ (32, RW) */
+#define VIRTIO_PCI_QUEUE_NUM  12 /* number of ring entries (16, RO) */
+#define VIRTIO_PCI_QUEUE_SEL  14 /* current VQ selection (16, RW) */
+#define VIRTIO_PCI_QUEUE_NOTIFY   16 /* notify host regarding VQ (16, RW) */
+#define VIRTIO_PCI_STATUS 18 /* device status register (8, RW) */
+#define VIRTIO_PCI_ISR   19 /* interrupt status register, reading
+ * also clears the register (8, RO) */
+/* Only if MSIX is enabled: */
+#define VIRTIO_MSI_CONFIG_VECTOR  20 /* configuration change vector (16, RW) */
+#define VIRTIO_MSI_QUEUE_VECTOR  22 /* vector for selected VQ 
notifications
+ (16, RW) */
+
+/* The bit of the ISR which indicates a device has an interrupt. */
+#define VIRTIO_PCI_ISR_INTR   0x1
+/* The bit of the ISR which indicates a device configuration change. */
+#define VIRTIO_PCI_ISR_CONFIG 0x2
+/* Vector value used to disable MSI for queue. */
+#define VIRTIO_MSI_NO_VECTOR 0x
+
+/* VirtIO device IDs. */
+#define VIRTIO_ID_VHOST_USER  0x18
+
+/* Status byte for guest to report progress. */
+#define VIRTIO_CONFIG_STATUS_RESET 0x00
+#define VIRTIO_CONFIG_STATUS_ACK   0x01
+#define VIRTIO_CONFIG_STATUS_DRIVER0x02
+#define VIRTIO_CONFIG_STATUS_DRIVER_OK 0x04
+#define VIRTIO_CONFIG_STATUS_FEATURES_OK 0x08
+#define VIRTIO_CONFIG_STATUS_FAILED0x80
+
+/*
+ * Each virtqueue indirect descriptor list must be physically contiguous.
+ * To allow us to malloc(9) each list individually, limit the number
+ * supported to what will fit in one page. With 4KB pages, this is a limit
+ * of 256 descriptors. If there is ever a need for more, we can switch to
+ * contigmalloc(9) for the larger allocations, similar to what
+ * bus_dmamem_alloc(9) does.
+ *
+ * Note the sizeof(struct vring_desc) is 16 bytes.
+ */
+#define VIRTIO_MAX_INDIRECT ((int) (PAGE_SIZE / 16))
+
+/* Do we get callbacks when the ring is completely used, even if we've
+ 

[dpdk-dev] [RFC 16/24] vhost: remember a vhost_virtqueue's queue index

2018-01-19 Thread Stefan Hajnoczi
Currently the only way of determining a struct vhost_virtqueue's index
is to search struct virtio_net->virtqueue[] for its address.  Stash the
index in struct vhost_virtqueue so we won't have to search the array.

This new field will be used by virtio-vhost-user.

Signed-off-by: Stefan Hajnoczi 
---
 drivers/librte_vhost/vhost.h | 1 +
 drivers/librte_vhost/vhost.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/librte_vhost/vhost.h b/drivers/librte_vhost/vhost.h
index a50b802e7..08ad874ef 100644
--- a/drivers/librte_vhost/vhost.h
+++ b/drivers/librte_vhost/vhost.h
@@ -72,6 +72,7 @@ struct vhost_virtqueue {
struct vring_avail  *avail;
struct vring_used   *used;
uint32_tsize;
+   uint32_tvring_idx;
 
uint16_tlast_avail_idx;
uint16_tlast_used_idx;
diff --git a/drivers/librte_vhost/vhost.c b/drivers/librte_vhost/vhost.c
index 0d95a4b3a..886444683 100644
--- a/drivers/librte_vhost/vhost.c
+++ b/drivers/librte_vhost/vhost.c
@@ -191,6 +191,8 @@ init_vring_queue(struct virtio_net *dev, uint32_t vring_idx)
 
memset(vq, 0, sizeof(struct vhost_virtqueue));
 
+   vq->vring_idx = vring_idx;
+
vq->kickfd = VIRTIO_UNINITIALIZED_EVENTFD;
vq->callfd = VIRTIO_UNINITIALIZED_EVENTFD;
 
-- 
2.14.3



[dpdk-dev] [RFC 18/24] vhost: add RTE_VHOST_USER_VIRTIO_TRANSPORT flag

2018-01-19 Thread Stefan Hajnoczi
Extend the  API to support the virtio-vhost-user transport
as an alternative to the AF_UNIX transport.  The caller provides a PCI
DomBDF address:

  rte_vhost_driver_register(":00:04.0",
RTE_VHOST_USER_VIRTIO_TRANSPORT);

Signed-off-by: Stefan Hajnoczi 
---
 drivers/librte_vhost/rte_vhost.h | 1 +
 drivers/librte_vhost/socket.c| 3 +++
 2 files changed, 4 insertions(+)

diff --git a/drivers/librte_vhost/rte_vhost.h b/drivers/librte_vhost/rte_vhost.h
index d33206997..d91d8d992 100644
--- a/drivers/librte_vhost/rte_vhost.h
+++ b/drivers/librte_vhost/rte_vhost.h
@@ -28,6 +28,7 @@ extern "C" {
 #define RTE_VHOST_USER_NO_RECONNECT(1ULL << 1)
 #define RTE_VHOST_USER_DEQUEUE_ZERO_COPY   (1ULL << 2)
 #define RTE_VHOST_USER_IOMMU_SUPPORT   (1ULL << 3)
+#define RTE_VHOST_USER_VIRTIO_TRANSPORT(1ULL << 4)
 
 /**
  * Information relating to memory regions including offsets to
diff --git a/drivers/librte_vhost/socket.c b/drivers/librte_vhost/socket.c
index c46328950..265a014cf 100644
--- a/drivers/librte_vhost/socket.c
+++ b/drivers/librte_vhost/socket.c
@@ -132,6 +132,9 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
struct vhost_user_socket *vsocket;
const struct vhost_transport_ops *trans_ops = &af_unix_trans_ops;
 
+   if (flags & RTE_VHOST_USER_VIRTIO_TRANSPORT)
+   trans_ops = &virtio_vhost_user_trans_ops;
+
if (!path)
return -1;
 
-- 
2.14.3



[dpdk-dev] [RFC 17/24] vhost: add virtio-vhost-user transport

2018-01-19 Thread Stefan Hajnoczi
This patch adds a new transport to librte_vhost for the
virtio-vhost-user device.  This device replaces the AF_UNIX socket used
by traditional vhost-user with a virtio device that tunnels vhost-user
protocol messages.  This allows a guest to act as a vhost device backend
for other guests.

The intended use case is for running DPDK inside a guest.  Other guests
can communicate via DPDK's "vhost" vdev driver.

For more information on virtio-vhost-user, see
https://wiki.qemu.org/Features/VirtioVhostUser.

Signed-off-by: Stefan Hajnoczi 
---
 drivers/librte_vhost/Makefile  |1 +
 drivers/librte_vhost/vhost.h   |3 +
 drivers/librte_vhost/virtio_vhost_user.h   |   18 +
 drivers/librte_vhost/trans_virtio_vhost_user.c | 1050 
 4 files changed, 1072 insertions(+)
 create mode 100644 drivers/librte_vhost/virtio_vhost_user.h
 create mode 100644 drivers/librte_vhost/trans_virtio_vhost_user.c

diff --git a/drivers/librte_vhost/Makefile b/drivers/librte_vhost/Makefile
index 8a56c32af..c8cce6fac 100644
--- a/drivers/librte_vhost/Makefile
+++ b/drivers/librte_vhost/Makefile
@@ -23,6 +23,7 @@ LDLIBS += -lrte_eal -lrte_mempool -lrte_mbuf -lrte_ethdev 
-lrte_net
 SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
vhost_user.c virtio_net.c \
trans_af_unix.c \
+   trans_virtio_vhost_user.c \
virtio_pci.c
 
 # install includes
diff --git a/drivers/librte_vhost/vhost.h b/drivers/librte_vhost/vhost.h
index 08ad874ef..1d4d1d139 100644
--- a/drivers/librte_vhost/vhost.h
+++ b/drivers/librte_vhost/vhost.h
@@ -338,6 +338,9 @@ struct vhost_transport_ops {
 /** The traditional AF_UNIX vhost-user protocol transport. */
 extern const struct vhost_transport_ops af_unix_trans_ops;
 
+/** The virtio-vhost-user PCI vhost-user protocol transport. */
+extern const struct vhost_transport_ops virtio_vhost_user_trans_ops;
+
 /**
  * Device structure contains all configuration information relating
  * to the device.
diff --git a/drivers/librte_vhost/virtio_vhost_user.h 
b/drivers/librte_vhost/virtio_vhost_user.h
new file mode 100644
index 0..baeaa7494
--- /dev/null
+++ b/drivers/librte_vhost/virtio_vhost_user.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (C) 2018 Red Hat, Inc.
+ */
+
+#ifndef _LINUX_VIRTIO_VHOST_USER_H
+#define _LINUX_VIRTIO_VHOST_USER_H
+
+#include 
+
+struct virtio_vhost_user_config {
+uint32_t status;
+#define VIRTIO_VHOST_USER_STATUS_SLAVE_UP 0
+#define VIRTIO_VHOST_USER_STATUS_MASTER_UP 1
+uint32_t max_vhost_queues;
+uint8_t uuid[16];
+};
+
+#endif /* _LINUX_VIRTIO_VHOST_USER_H */
diff --git a/drivers/librte_vhost/trans_virtio_vhost_user.c 
b/drivers/librte_vhost/trans_virtio_vhost_user.c
new file mode 100644
index 0..df654eb71
--- /dev/null
+++ b/drivers/librte_vhost/trans_virtio_vhost_user.c
@@ -0,0 +1,1050 @@
+/* SPDX-License-Idenitifier: BSD-3-Clause
+ * Copyright (C) 2018 Red Hat, Inc.
+ */
+
+/*
+ * @file
+ * virtio-vhost-user PCI transport driver
+ *
+ * This vhost-user transport communicates with the vhost-user master process
+ * over the virtio-vhost-user PCI device.
+ *
+ * Interrupts are used since this is the control path, not the data path.  This
+ * way the vhost-user command processing doesn't interfere with packet
+ * processing.  This is similar to the AF_UNIX transport's fdman thread that
+ * processes socket I/O separately.
+ *
+ * This transport replaces the usual vhost-user file descriptor passing with a
+ * PCI BAR that contains doorbell registers for callfd and logfd, and shared
+ * memory for the memory table regions.
+ *
+ * VIRTIO device specification:
+ * https://stefanha.github.io/virtio/vhost-user-slave.html#x1-2830007
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include "vhost.h"
+#include "virtio_pci.h"
+#include "virtqueue.h"
+#include "virtio_vhost_user.h"
+#include "vhost_user.h"
+
+/*
+ * Data structures:
+ *
+ * Successfully probed virtio-vhost-user PCI adapters are added to
+ * vvu_pci_device_list as struct vvu_pci_device elements.
+ *
+ * When rte_vhost_driver_register() is called, a struct vvu_socket is created
+ * as the endpoint for future vhost-user connections.  The struct vvu_socket is
+ * associated with the struct vvu_pci_device that will be used for
+ * communication.
+ *
+ * When a vhost-user protocol connection is established, a struct
+ * vvu_connection is created and the application's new_device(int vid) callback
+ * is invoked.
+ */
+
+/** Probed PCI devices for lookup by rte_vhost_driver_register() */
+TAILQ_HEAD(, vvu_pci_device) vvu_pci_device_list =
+   TAILQ_HEAD_INITIALIZER(vvu_pci_device_list);
+
+struct vvu_socket;
+struct vvu_connection;
+
+/** A virtio-vhost-vsock PCI

[dpdk-dev] [RFC 19/24] net/vhost: add virtio-vhost-user support

2018-01-19 Thread Stefan Hajnoczi
The new virtio-transport=0|1 argument enables virtio-vhost-user support:

  testpmd ... --pci-whitelist :00:04.0 \
  --vdev vhost,iface=:00:04.0,virtio-transport=1

Signed-off-by: Stefan Hajnoczi 
---
 drivers/net/vhost/rte_eth_vhost.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/net/vhost/rte_eth_vhost.c 
b/drivers/net/vhost/rte_eth_vhost.c
index 2536ee4a2..a136ce89f 100644
--- a/drivers/net/vhost/rte_eth_vhost.c
+++ b/drivers/net/vhost/rte_eth_vhost.c
@@ -53,6 +53,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
 #define ETH_VHOST_CLIENT_ARG   "client"
 #define ETH_VHOST_DEQUEUE_ZERO_COPY"dequeue-zero-copy"
 #define ETH_VHOST_IOMMU_SUPPORT"iommu-support"
+#define ETH_VHOST_VIRTIO_TRANSPORT "virtio-transport"
 #define VHOST_MAX_PKT_BURST 32
 
 static const char *valid_arguments[] = {
@@ -61,6 +62,7 @@ static const char *valid_arguments[] = {
ETH_VHOST_CLIENT_ARG,
ETH_VHOST_DEQUEUE_ZERO_COPY,
ETH_VHOST_IOMMU_SUPPORT,
+   ETH_VHOST_VIRTIO_TRANSPORT,
NULL
 };
 
@@ -1167,6 +1169,7 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
int client_mode = 0;
int dequeue_zero_copy = 0;
int iommu_support = 0;
+   uint16_t virtio_transport = 0;
 
RTE_LOG(INFO, PMD, "Initializing pmd_vhost for %s\n",
rte_vdev_device_name(dev));
@@ -1224,6 +1227,16 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
flags |= RTE_VHOST_USER_IOMMU_SUPPORT;
}
 
+   if (rte_kvargs_count(kvlist, ETH_VHOST_VIRTIO_TRANSPORT) == 1) {
+   ret = rte_kvargs_process(kvlist, ETH_VHOST_VIRTIO_TRANSPORT,
+&open_int, &virtio_transport);
+   if (ret < 0)
+   goto out_free;
+
+   if (virtio_transport)
+   flags |= RTE_VHOST_USER_VIRTIO_TRANSPORT;
+   }
+
if (dev->device.numa_node == SOCKET_ID_ANY)
dev->device.numa_node = rte_socket_id();
 
-- 
2.14.3



[dpdk-dev] [RFC 20/24] examples/vhost_scsi: add --socket-file argument

2018-01-19 Thread Stefan Hajnoczi
The default filename built into examples/vhost_scsi may not be
convenient.  Allow the user to specify the full UNIX domain socket path
on the command-line.

Signed-off-by: Stefan Hajnoczi 
---
 examples/vhost_scsi/vhost_scsi.c | 93 
 1 file changed, 75 insertions(+), 18 deletions(-)

diff --git a/examples/vhost_scsi/vhost_scsi.c b/examples/vhost_scsi/vhost_scsi.c
index da01ad378..0a8302e02 100644
--- a/examples/vhost_scsi/vhost_scsi.c
+++ b/examples/vhost_scsi/vhost_scsi.c
@@ -2,6 +2,7 @@
  * Copyright(c) 2010-2017 Intel Corporation
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -359,26 +360,10 @@ static const struct vhost_device_ops 
vhost_scsi_device_ops = {
 };
 
 static struct vhost_scsi_ctrlr *
-vhost_scsi_ctrlr_construct(const char *ctrlr_name)
+vhost_scsi_ctrlr_construct(void)
 {
int ret;
struct vhost_scsi_ctrlr *ctrlr;
-   char *path;
-   char cwd[PATH_MAX];
-
-   /* always use current directory */
-   path = getcwd(cwd, PATH_MAX);
-   if (!path) {
-   fprintf(stderr, "Cannot get current working directory\n");
-   return NULL;
-   }
-   snprintf(dev_pathname, sizeof(dev_pathname), "%s/%s", path, ctrlr_name);
-
-   if (access(dev_pathname, F_OK) != -1) {
-   if (unlink(dev_pathname) != 0)
-   rte_exit(EXIT_FAILURE, "Cannot remove %s.\n",
-dev_pathname);
-   }
 
if (rte_vhost_driver_register(dev_pathname, 0) != 0) {
fprintf(stderr, "socket %s already exists\n", dev_pathname);
@@ -412,6 +397,71 @@ signal_handler(__rte_unused int signum)
exit(0);
 }
 
+static void
+set_dev_pathname(const char *path)
+{
+   if (dev_pathname[0])
+   rte_exit(EXIT_FAILURE, "--socket-file can only be given 
once.\n");
+
+   snprintf(dev_pathname, sizeof(dev_pathname), "%s", path);
+}
+
+static void
+vhost_scsi_usage(const char *prgname)
+{
+   fprintf(stderr, "%s [EAL options] --\n"
+   "--socket-file PATH: The path of the UNIX domain socket\n",
+   prgname);
+}
+
+static void
+vhost_scsi_parse_args(int argc, char **argv)
+{
+   int opt;
+   int option_index;
+   const char *prgname = argv[0];
+   static struct option long_option[] = {
+   {"socket-file", required_argument, NULL, 0},
+   {NULL, 0, 0, 0},
+   };
+
+   while ((opt = getopt_long(argc, argv, "", long_option,
+ &option_index)) != -1) {
+   switch (opt) {
+   case 0:
+   if (!strcmp(long_option[option_index].name,
+   "socket-file")) {
+   set_dev_pathname(optarg);
+   }
+   break;
+   default:
+   vhost_scsi_usage(prgname);
+   rte_exit(EXIT_FAILURE, "Invalid argument\n");
+   }
+   }
+}
+
+static void
+vhost_scsi_set_default_dev_pathname(void)
+{
+   char *path;
+   char cwd[PATH_MAX];
+
+   /* always use current directory */
+   path = getcwd(cwd, PATH_MAX);
+   if (!path) {
+   rte_exit(EXIT_FAILURE,
+"Cannot get current working directory\n");
+   }
+   snprintf(dev_pathname, sizeof(dev_pathname), "%s/vhost.socket", path);
+
+   if (access(dev_pathname, F_OK) != -1) {
+   if (unlink(dev_pathname) != 0)
+   rte_exit(EXIT_FAILURE, "Cannot remove %s.\n",
+dev_pathname);
+   }
+}
+
 int main(int argc, char *argv[])
 {
int ret;
@@ -422,8 +472,15 @@ int main(int argc, char *argv[])
ret = rte_eal_init(argc, argv);
if (ret < 0)
rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
+   argc -= ret;
+   argv += ret;
 
-   g_vhost_ctrlr = vhost_scsi_ctrlr_construct("vhost.socket");
+   vhost_scsi_parse_args(argc, argv);
+
+   if (!dev_pathname[0])
+   vhost_scsi_set_default_dev_pathname();
+
+   g_vhost_ctrlr = vhost_scsi_ctrlr_construct();
if (g_vhost_ctrlr == NULL) {
fprintf(stderr, "Construct vhost scsi controller failed\n");
return 0;
-- 
2.14.3



[dpdk-dev] [RFC 21/24] examples/vhost_scsi: add virtio-vhost-user support

2018-01-19 Thread Stefan Hajnoczi
The new --virtio-vhost-user-pci command-line argument uses
virtio-vhost-user instead of the default AF_UNIX transport.

Signed-off-by: Stefan Hajnoczi 
---
 examples/vhost_scsi/vhost_scsi.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/examples/vhost_scsi/vhost_scsi.c b/examples/vhost_scsi/vhost_scsi.c
index 0a8302e02..61001cadb 100644
--- a/examples/vhost_scsi/vhost_scsi.c
+++ b/examples/vhost_scsi/vhost_scsi.c
@@ -28,6 +28,7 @@
 
 /* Path to folder where character device will be created. Can be set by user. 
*/
 static char dev_pathname[PATH_MAX] = "";
+static uint64_t dev_flags; /* for rte_vhost_driver_register() */
 
 static struct vhost_scsi_ctrlr *g_vhost_ctrlr;
 static int g_should_stop;
@@ -365,7 +366,7 @@ vhost_scsi_ctrlr_construct(void)
int ret;
struct vhost_scsi_ctrlr *ctrlr;
 
-   if (rte_vhost_driver_register(dev_pathname, 0) != 0) {
+   if (rte_vhost_driver_register(dev_pathname, dev_flags) != 0) {
fprintf(stderr, "socket %s already exists\n", dev_pathname);
return NULL;
}
@@ -401,7 +402,8 @@ static void
 set_dev_pathname(const char *path)
 {
if (dev_pathname[0])
-   rte_exit(EXIT_FAILURE, "--socket-file can only be given 
once.\n");
+   rte_exit(EXIT_FAILURE, "Only one of --socket-file or "
+"--virtio-vhost-user-pci can be given.\n");
 
snprintf(dev_pathname, sizeof(dev_pathname), "%s", path);
 }
@@ -410,7 +412,8 @@ static void
 vhost_scsi_usage(const char *prgname)
 {
fprintf(stderr, "%s [EAL options] --\n"
-   "--socket-file PATH: The path of the UNIX domain socket\n",
+   "--socket-file PATH: The path of the UNIX domain socket\n"
+   "--virtio-vhost-user-pci DomBDF: PCI adapter address\n",
prgname);
 }
 
@@ -422,6 +425,7 @@ vhost_scsi_parse_args(int argc, char **argv)
const char *prgname = argv[0];
static struct option long_option[] = {
{"socket-file", required_argument, NULL, 0},
+   {"virtio-vhost-user-pci", required_argument, NULL, 0},
{NULL, 0, 0, 0},
};
 
@@ -432,6 +436,10 @@ vhost_scsi_parse_args(int argc, char **argv)
if (!strcmp(long_option[option_index].name,
"socket-file")) {
set_dev_pathname(optarg);
+   } else if (!strcmp(long_option[option_index].name,
+  "virtio-vhost-user-pci")) {
+   set_dev_pathname(optarg);
+   dev_flags = RTE_VHOST_USER_VIRTIO_TRANSPORT;
}
break;
default:
-- 
2.14.3



[dpdk-dev] [RFC 22/24] usertools: add virtio-vhost-user devices to dpdk-devbind.py

2018-01-19 Thread Stefan Hajnoczi
The virtio-vhost-user PCI adapter is not detected in any existing group
of devices supported by dpdk-devbind.py.  Add a new "Others" group for
miscellaneous devices like this one.

Signed-off-by: Stefan Hajnoczi 
---
 usertools/dpdk-devbind.py | 8 
 1 file changed, 8 insertions(+)

diff --git a/usertools/dpdk-devbind.py b/usertools/dpdk-devbind.py
index 894b51969..35e1883b9 100755
--- a/usertools/dpdk-devbind.py
+++ b/usertools/dpdk-devbind.py
@@ -22,11 +22,14 @@
   'SVendor': None, 'SDevice': None}
 cavium_pkx = {'Class': '08', 'Vendor': '177d', 'Device': 'a0dd,a049',
   'SVendor': None, 'SDevice': None}
+virtio_vhost_user = {'Class': '00', 'Vendor': '1af4', 'Device': '1017,1058',
+ 'SVendor': None, 'SDevice': None}
 
 network_devices = [network_class, cavium_pkx]
 crypto_devices = [encryption_class, intel_processor_class]
 eventdev_devices = [cavium_sso]
 mempool_devices = [cavium_fpa]
+other_devices = [virtio_vhost_user]
 
 # global dict ethernet devices present. Dictionary indexed by PCI address.
 # Each device within this is itself a dictionary of device properties
@@ -596,6 +599,9 @@ def show_status():
 if status_dev == "mempool" or status_dev == "all":
 show_device_status(mempool_devices, "Mempool")
 
+if status_dev == 'other' or status_dev == 'all':
+show_device_status(other_devices, "Other")
+
 def parse_args():
 '''Parses the command-line arguments given by the user and takes the
 appropriate action for each'''
@@ -669,6 +675,7 @@ def do_arg_actions():
 get_device_details(crypto_devices)
 get_device_details(eventdev_devices)
 get_device_details(mempool_devices)
+get_device_details(other_devices)
 show_status()
 
 
@@ -681,6 +688,7 @@ def main():
 get_device_details(crypto_devices)
 get_device_details(eventdev_devices)
 get_device_details(mempool_devices)
+get_device_details(other_devices)
 do_arg_actions()
 
 if __name__ == "__main__":
-- 
2.14.3



[dpdk-dev] [RFC 24/24] WORKAROUND examples/vhost_scsi: avoid broken EVENT_IDX

2018-01-19 Thread Stefan Hajnoczi
The EVENT_IDX code in DPDK is broken.  It's missing the
signalled_used_valid flag that handles the corner cases (startup and
wrapping).  Disable it for now.

Signed-off-by: Stefan Hajnoczi 
---
 examples/vhost_scsi/vhost_scsi.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/examples/vhost_scsi/vhost_scsi.c b/examples/vhost_scsi/vhost_scsi.c
index 61001cadb..7106dc6d2 100644
--- a/examples/vhost_scsi/vhost_scsi.c
+++ b/examples/vhost_scsi/vhost_scsi.c
@@ -22,7 +22,6 @@
 #include "scsi_spec.h"
 
 #define VIRTIO_SCSI_FEATURES ((1 << VIRTIO_F_NOTIFY_ON_EMPTY) |\
- (1 << VIRTIO_RING_F_EVENT_IDX) |\
  (1 << VIRTIO_SCSI_F_INOUT) |\
  (1 << VIRTIO_SCSI_F_CHANGE))
 
-- 
2.14.3



[dpdk-dev] [RFC 23/24] WORKAROUND revert virtio-net mq vring deletion

2018-01-19 Thread Stefan Hajnoczi
The virtio-net mq vring deletion code should be in virtio_net.c, not in
the generic vhost_user.c code where it breaks non-virtio-net devices.

Signed-off-by: Stefan Hajnoczi 
---
 drivers/librte_vhost/vhost_user.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/librte_vhost/vhost_user.c 
b/drivers/librte_vhost/vhost_user.c
index a819684b4..08fab933b 100644
--- a/drivers/librte_vhost/vhost_user.c
+++ b/drivers/librte_vhost/vhost_user.c
@@ -163,6 +163,7 @@ vhost_user_set_features(struct virtio_net *dev, uint64_t 
features)
(dev->features & (1 << VIRTIO_NET_F_MRG_RXBUF)) ? "on" : "off",
(dev->features & (1ULL << VIRTIO_F_VERSION_1)) ? "on" : "off");
 
+#if 0
if (!(dev->features & (1ULL << VIRTIO_NET_F_MQ))) {
/*
 * Remove all but first queue pair if MQ hasn't been
@@ -181,6 +182,7 @@ vhost_user_set_features(struct virtio_net *dev, uint64_t 
features)
free_vq(vq);
}
}
+#endif
 
return 0;
 }
-- 
2.14.3



[dpdk-dev] [PATCH 1/2] vhost: add flag for built-in virtio_net.c driver

2018-01-31 Thread Stefan Hajnoczi
The librte_vhost API is used in two ways:
1. As a vhost net device backend via rte_vhost_enqueue/dequeue_burst().
2. As a library for implementing vhost device backends.

There is no distinction between the two at the API level or in the
librte_vhost implementation.  For example, device state is kept in
"struct virtio_net" regardless of whether this is actually a net device
backend or whether the built-in virtio_net.c driver is in use.

The virtio_net.c driver should be a librte_vhost API client just like
the vhost-scsi code and have no special access to vhost.h internals.
Unfortunately, fixing this requires significant librte_vhost API
changes.

This patch takes a different approach: keep the librte_vhost API
unchanged but track whether the built-in virtio_net.c driver is in use.
See the next patch for a bug fix that requires knowledge of whether
virtio_net.c is in use.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost.h  |  3 +++
 lib/librte_vhost/socket.c | 15 +++
 lib/librte_vhost/vhost.c  | 17 -
 lib/librte_vhost/virtio_net.c | 14 ++
 4 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index ba805843b..8f24d4c7e 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -24,6 +24,8 @@
 #define VIRTIO_DEV_RUNNING 1
 /* Used to indicate that the device is ready to operate */
 #define VIRTIO_DEV_READY 2
+/* Used to indicate that the built-in vhost net device backend is enabled */
+#define VIRTIO_DEV_BUILTIN_VIRTIO_NET 4
 
 /* Backend value set by guest. */
 #define VIRTIO_DEV_STOPPED -1
@@ -353,6 +355,7 @@ int alloc_vring_queue(struct virtio_net *dev, uint32_t 
vring_idx);
 
 void vhost_set_ifname(int, const char *if_name, unsigned int if_len);
 void vhost_enable_dequeue_zero_copy(int vid);
+void vhost_set_builtin_virtio_net(int vid, bool enable);
 
 struct vhost_device_ops const *vhost_driver_callback_get(const char *path);
 
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 6e3857e7a..83befdced 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -40,6 +40,7 @@ struct vhost_user_socket {
bool reconnect;
bool dequeue_zero_copy;
bool iommu_support;
+   bool use_builtin_virtio_net;
 
/*
 * The "supported_features" indicates the feature bits the
@@ -195,6 +196,8 @@ vhost_user_add_connection(int fd, struct vhost_user_socket 
*vsocket)
size = strnlen(vsocket->path, PATH_MAX);
vhost_set_ifname(vid, vsocket->path, size);
 
+   vhost_set_builtin_virtio_net(vid, vsocket->use_builtin_virtio_net);
+
if (vsocket->dequeue_zero_copy)
vhost_enable_dequeue_zero_copy(vid);
 
@@ -527,6 +530,12 @@ rte_vhost_driver_disable_features(const char *path, 
uint64_t features)
 
pthread_mutex_lock(&vhost_user.mutex);
vsocket = find_vhost_user_socket(path);
+
+   /* Note that use_builtin_virtio_net is not affected by this function
+* since callers may want to selectively disable features of the
+* built-in vhost net device backend.
+*/
+
if (vsocket)
vsocket->features &= ~features;
pthread_mutex_unlock(&vhost_user.mutex);
@@ -567,6 +576,11 @@ rte_vhost_driver_set_features(const char *path, uint64_t 
features)
if (vsocket) {
vsocket->supported_features = features;
vsocket->features = features;
+
+   /* Anyone setting feature bits is implementing their own vhost
+* device backend.
+*/
+   vsocket->use_builtin_virtio_net = false;
}
pthread_mutex_unlock(&vhost_user.mutex);
 
@@ -647,6 +661,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
 * rte_vhost_driver_set_features(), which will overwrite following
 * two values.
 */
+   vsocket->use_builtin_virtio_net = true;
vsocket->supported_features = VIRTIO_NET_SUPPORTED_FEATURES;
vsocket->features   = VIRTIO_NET_SUPPORTED_FEATURES;
 
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 1dd9adbc7..a31ca5002 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -251,7 +251,7 @@ reset_device(struct virtio_net *dev)
 
dev->features = 0;
dev->protocol_features = 0;
-   dev->flags = 0;
+   dev->flags &= VIRTIO_DEV_BUILTIN_VIRTIO_NET;
 
for (i = 0; i < dev->nr_vring; i++)
reset_vring_queue(dev, i);
@@ -287,6 +287,7 @@ vhost_new_device(void)
 
vhost_devices[i] = dev;
dev->vid = i;
+   dev->flags = VIRTIO_DEV_BUILTIN_VIRTIO_NET;
dev->slave_req_fd = -1;
 
return i;
@@ -343,6 +344,20 @@ vhost_enable_dequeue_zero_copy(int vid)
dev->dequeue_zero_cop

[dpdk-dev] [PATCH 0/2] vhost: fix VIRTIO_NET_F_MQ vhost_scsi breakage

2018-01-31 Thread Stefan Hajnoczi
These patches fix a recent regression in librte_vhost that breaks the
vhost_scsi example application.  vhost_user.c assumes all devices are vhost net
backends when handling the VIRTIO_NET_F_MQ feature bit.  The code is triggered
by vhost scsi devices and causes virtqueues to be removed.  See Patch 2 for
details.

Patch 1 puts the infrastructure in place to distinguish between the built-in
virtio_net.c driver and generic vhost device backend usage.

Patch 2 fixes the regression by handling VIRTIO_NET_F_MQ only when the built-in
virtio_net.c driver is in use.

Stefan Hajnoczi (2):
  vhost: add flag for built-in virtio_net.c driver
  vhost: only drop vqs with built-in virtio_net.c driver

 lib/librte_vhost/vhost.h  |  3 +++
 lib/librte_vhost/socket.c | 15 +++
 lib/librte_vhost/vhost.c  | 17 -
 lib/librte_vhost/vhost_user.c |  3 ++-
 lib/librte_vhost/virtio_net.c | 14 ++
 5 files changed, 50 insertions(+), 2 deletions(-)

-- 
2.14.3



[dpdk-dev] [PATCH 2/2] vhost: only drop vqs with built-in virtio_net.c driver

2018-01-31 Thread Stefan Hajnoczi
Commit e29109323595beb3884da58126ebb3b878cb66f5 ("vhost: destroy unused
virtqueues when multiqueue not negotiated") broke vhost-scsi by removing
virtqueues when the virtio-net-specific VIRTIO_NET_F_MQ feature bit is
missing.

The vhost_user.c code shouldn't assume all devices are vhost net device
backends.  Use the new VIRTIO_DEV_BUILTIN_VIRTIO_NET flag to check
whether virtio_net.c is being used.

This fixes examples/vhost_scsi.

Cc: Maxime Coquelin 
Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost_user.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 1dd1a61b6..65ee33919 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -187,7 +187,8 @@ vhost_user_set_features(struct virtio_net *dev, uint64_t 
features)
(dev->features & (1 << VIRTIO_NET_F_MRG_RXBUF)) ? "on" : "off",
(dev->features & (1ULL << VIRTIO_F_VERSION_1)) ? "on" : "off");
 
-   if (!(dev->features & (1ULL << VIRTIO_NET_F_MQ))) {
+   if ((dev->flags & VIRTIO_DEV_BUILTIN_VIRTIO_NET) &&
+   !(dev->features & (1ULL << VIRTIO_NET_F_MQ))) {
/*
 * Remove all but first queue pair if MQ hasn't been
 * negotiated. This is safe because the device is not
-- 
2.14.3



[dpdk-dev] [PATCH] examples/vhost_scsi: drop unimplemented EVENT_IDX feature bit

2018-01-31 Thread Stefan Hajnoczi
The vhost_scsi example application negotiates the
VIRTIO_RING_F_EVENT_IDX feature bit but does not honor it when accessing
vrings.

In particular, commit e37ff954405addb8ea422426a2d162d00dcad196 ("vhost:
support virtqueue interrupt/notification suppression") broke vring call
because vq->last_used_idx is never updated by vhost_scsi.  The
vq->last_used_idx field is not even available via the librte_vhost
public API, so VIRTIO_RING_F_EVENT_IDX is currently only usable by the
built-in virtio_net.c driver in librte_vhost.

This patch drops VIRTIO_RING_F_EVENT_IDX from vhost_scsi so that vring
call works again.

Cc: Changpeng Liu 
Cc: Junjie Chen 
Signed-off-by: Stefan Hajnoczi 
---
 examples/vhost_scsi/vhost_scsi.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/examples/vhost_scsi/vhost_scsi.c b/examples/vhost_scsi/vhost_scsi.c
index da01ad378..3cb4383e9 100644
--- a/examples/vhost_scsi/vhost_scsi.c
+++ b/examples/vhost_scsi/vhost_scsi.c
@@ -21,7 +21,6 @@
 #include "scsi_spec.h"
 
 #define VIRTIO_SCSI_FEATURES ((1 << VIRTIO_F_NOTIFY_ON_EMPTY) |\
- (1 << VIRTIO_RING_F_EVENT_IDX) |\
  (1 << VIRTIO_SCSI_F_INOUT) |\
  (1 << VIRTIO_SCSI_F_CHANGE))
 
-- 
2.14.3



[dpdk-dev] [PATCH 1/8] vhost: add security model documentation to vhost_user.c

2018-02-05 Thread Stefan Hajnoczi
Input validation is not applied consistently in vhost_user.c.  This
suggests that not everyone has the same security model in mind when
working on the code.

Make the security model explicit so that everyone can understand and
follow the same model when modifying the code.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost_user.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 1dd1a61b6..a96afbe84 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -2,6 +2,23 @@
  * Copyright(c) 2010-2016 Intel Corporation
  */
 
+/* Security model
+ * --
+ * The vhost-user protocol connection is an external interface, so it must be
+ * robust against invalid inputs.
+ *
+ * This is important because the vhost-user master is only one step removed
+ * from the guest.  Malicious guests that have escaped will then launch further
+ * attacks from the vhost-user master.
+ *
+ * Even in deployments where guests are trusted, a bug in the vhost-user master
+ * can still cause invalid messages to be sent.  Such messages must not
+ * compromise the stability of the DPDK application by causing crashes, memory
+ * corruption, or other problematic behavior.
+ *
+ * Do not assume received VhostUserMsg fields contain sensible values!
+ */
+
 #include 
 #include 
 #include 
-- 
2.14.3



[dpdk-dev] [PATCH 2/8] vhost: avoid enum fields in VhostUserMsg

2018-02-05 Thread Stefan Hajnoczi
The VhostUserMsg struct binary representation must match the vhost-user
protocol specification since this struct is read from and written to the
socket.

The VhostUserMsg.request union contains enum fields.  Enum binary
representation is implementation-defined according to the C standard and
it is unportable to make assumptions about the representation:

  6.7.2.2 Enumeration specifiers
  ...
  Each enumerated type shall be compatible with char, a signed integer
  type, or an unsigned integer type. The choice of type is
  implementation-defined, but shall be capable of representing the
  values of all the members of the enumeration.

Additionally, librte_vhost relies on the enum type being unsigned when
validating untrusted inputs:

  if (ret <= 0 || msg.request.master >= VHOST_USER_MAX) {

If msg.request.master is signed then negative values pass this check!

Even if we assume gcc on x86_64 (SysV amd64 ABI) and don't care about
portability, the actual enum constants still affect the final type.  For
example, if we add a negative constant then its type changes to signed
int:

  typedef enum VhostUserRequest {
  ...
  VHOST_USER_INVALID = -1,
  };

This is very fragile and it's unlikely that anyone changing the code
would remember this.  A security hole can be introduced accidentally.

This patch switches VhostUserMsg.request fields to uint32_t to avoid the
portability and potential security issues.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost_user.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_vhost/vhost_user.h b/lib/librte_vhost/vhost_user.h
index d4bd604b9..0fafbe6e0 100644
--- a/lib/librte_vhost/vhost_user.h
+++ b/lib/librte_vhost/vhost_user.h
@@ -81,8 +81,8 @@ typedef struct VhostUserLog {
 
 typedef struct VhostUserMsg {
union {
-   VhostUserRequest master;
-   VhostUserSlaveRequest slave;
+   uint32_t master; /* a VhostUserRequest value */
+   uint32_t slave;  /* a VhostUserSlaveRequest value*/
} request;
 
 #define VHOST_USER_VERSION_MASK 0x3
-- 
2.14.3



[dpdk-dev] [PATCH 0/8] vhost: input validation enhancements

2018-02-05 Thread Stefan Hajnoczi
This patch series addresses missing input validation that I came across when
reviewing vhost_user.c.

The first patch explains the security model and the rest fixes places with
missing checks.

Now is a good time to discuss the security model if anyone disagrees or has
questions about what Patch 1 says.

Stefan Hajnoczi (8):
  vhost: add security model documentation to vhost_user.c
  vhost: avoid enum fields in VhostUserMsg
  vhost: validate untrusted memory.nregions field
  vhost: clear out unused SCM_RIGHTS file descriptors
  vhost: reject invalid log base mmap_offset values
  vhost: fix msg->payload union typo in vhost_user_set_vring_addr()
  vhost: validate virtqueue size
  vhost: check for memory_size + mmap_offset overflow

 lib/librte_vhost/vhost_user.h |  4 +--
 lib/librte_vhost/socket.c |  8 +-
 lib/librte_vhost/vhost_user.c | 57 +--
 3 files changed, 64 insertions(+), 5 deletions(-)

-- 
2.14.3



[dpdk-dev] [PATCH 3/8] vhost: validate untrusted memory.nregions field

2018-02-05 Thread Stefan Hajnoczi
Check if memory.nregions is valid right away.  This eliminates the
possibility of bugs when memory.nregions is used later on in
vhost_user_set_mem_table().

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost_user.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index a96afbe84..48b493d70 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -662,6 +662,12 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct 
VhostUserMsg *pmsg)
uint32_t i;
int fd;
 
+   if (memory.nregions > VHOST_MEMORY_MAX_NREGIONS) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "too many memory regions (%u)\n", memory.nregions);
+   return -1;
+   }
+
if (dev->mem && !vhost_memory_changed(&memory, dev->mem)) {
RTE_LOG(INFO, VHOST_CONFIG,
"(%d) memory regions not changed\n", dev->vid);
-- 
2.14.3



[dpdk-dev] [PATCH 6/8] vhost: fix msg->payload union typo in vhost_user_set_vring_addr()

2018-02-05 Thread Stefan Hajnoczi
vhost_user_set_vring_addr() uses the msg->payload.addr union member, not
msg->payload.state.  Luckily the offset of the 'index' field is
identical in both structs, so there was never any buggy behavior.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost_user.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index fd47404ce..3a58d1082 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -516,7 +516,7 @@ vhost_user_set_vring_addr(struct virtio_net **pdev, 
VhostUserMsg *msg)
 
if (vq->enabled && (dev->features &
(1ULL << VHOST_USER_F_PROTOCOL_FEATURES))) {
-   dev = translate_ring_addresses(dev, msg->payload.state.index);
+   dev = translate_ring_addresses(dev, msg->payload.addr.index);
if (!dev)
return -1;
 
-- 
2.14.3



[dpdk-dev] [PATCH 7/8] vhost: validate virtqueue size

2018-02-05 Thread Stefan Hajnoczi
Check the virtqueue size constraints so that invalid values don't cause
bugs later on in the code.  For example, sometimes the virtqueue size is
stored as unsigned int and sometimes as uint16_t, so bad things happen
if it is ever larger than 65535.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost_user.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 3a58d1082..7d282cb36 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -237,6 +237,17 @@ vhost_user_set_vring_num(struct virtio_net *dev,
 
vq->size = msg->payload.state.num;
 
+   /* VIRTIO 1.0, 2.4 Virtqueues says:
+*
+*   Queue Size value is always a power of 2. The maximum Queue Size
+*   value is 32768.
+*/
+   if ((vq->size & (vq->size - 1)) || vq->size > 32768) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "invalid virtqueue size %u\n", vq->size);
+   return -1;
+   }
+
if (dev->dequeue_zero_copy) {
vq->nr_zmbuf = 0;
vq->last_zmbuf_idx = 0;
-- 
2.14.3



[dpdk-dev] [PATCH 4/8] vhost: clear out unused SCM_RIGHTS file descriptors

2018-02-05 Thread Stefan Hajnoczi
The number of file descriptors received is not stored by vhost_user.c.
vhost_user_set_mem_table() assumes that memory.nregions matches the
number of file descriptors received, but nothing guarantees this:

  for (i = 0; i < memory.nregions; i++)
  close(pmsg->fds[i]);

Another questionable code snippet is:

  case VHOST_USER_SET_LOG_FD:
  close(msg.fds[0]);

If not enough file descriptors were received then fds[] contains
uninitialized data from the stack (see read_fd_message()).  This might
cause non-vhost file descriptors to be closed if the uninitialized data
happens to match.

Refactoring vhost_user.c to pass around and check the number of file
descriptors everywhere would make the code more complex.  It is simpler
for read_fd_message() to set unused elements in fds[] to -1.  This way
close(-1) is called and no harm is done.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/socket.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 6e3857e7a..4e3f8abb9 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -96,6 +96,7 @@ read_fd_message(int sockfd, char *buf, int buflen, int *fds, 
int fd_num)
size_t fdsize = fd_num * sizeof(int);
char control[CMSG_SPACE(fdsize)];
struct cmsghdr *cmsg;
+   int got_fds = 0;
int ret;
 
memset(&msgh, 0, sizeof(msgh));
@@ -122,11 +123,16 @@ read_fd_message(int sockfd, char *buf, int buflen, int 
*fds, int fd_num)
cmsg = CMSG_NXTHDR(&msgh, cmsg)) {
if ((cmsg->cmsg_level == SOL_SOCKET) &&
(cmsg->cmsg_type == SCM_RIGHTS)) {
-   memcpy(fds, CMSG_DATA(cmsg), fdsize);
+   got_fds = (cmsg->cmsg_len - CMSG_LEN(0)) / sizeof(int);
+   memcpy(fds, CMSG_DATA(cmsg), got_fds * sizeof(int));
break;
}
}
 
+   /* Clear out unused file descriptors */
+   while (got_fds < fd_num)
+   fds[got_fds++] = -1;
+
return ret;
 }
 
-- 
2.14.3



[dpdk-dev] [PATCH 5/8] vhost: reject invalid log base mmap_offset values

2018-02-05 Thread Stefan Hajnoczi
If the log base mmap_offset is larger than mmap_size then it points
outside the mmap region.  We must not write to memory outside the mmap
region, so validate mmap_offset in vhost_user_set_log_base().

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost_user.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 48b493d70..fd47404ce 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -1005,6 +1005,15 @@ vhost_user_set_log_base(struct virtio_net *dev, struct 
VhostUserMsg *msg)
 
size = msg->payload.log.mmap_size;
off  = msg->payload.log.mmap_offset;
+
+   /* Don't allow mmap_offset to point outside the mmap region */
+   if (off > size) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "log offset %#"PRIx64" exceeds log size %#"PRIx64"\n",
+   off, size);
+   return -1;
+   }
+
RTE_LOG(INFO, VHOST_CONFIG,
"log mmap size: %"PRId64", offset: %"PRId64"\n",
size, off);
-- 
2.14.3



[dpdk-dev] [PATCH 8/8] vhost: check for memory_size + mmap_offset overflow

2018-02-05 Thread Stefan Hajnoczi
If memory_size + mmap_offset overflows then the memory region is bogus.
Do not use the overflowed mmap_size value for mmap().

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost_user.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 7d282cb36..cf742a558 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -729,7 +729,17 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct 
VhostUserMsg *pmsg)
reg->fd  = fd;
 
mmap_offset = memory.regions[i].mmap_offset;
-   mmap_size   = reg->size + mmap_offset;
+
+   /* Check for memory_size + mmap_offset overflow */
+   if (mmap_offset >= -reg->size) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "mmap_offset (%#"PRIx64") and memory_size "
+   "(%#"PRIx64") overflow\n",
+   mmap_offset, reg->size);
+   goto err_mmap;
+   }
+
+   mmap_size = reg->size + mmap_offset;
 
/* mmap() without flag of MAP_ANONYMOUS, should be called
 * with length argument aligned with hugepagesz at older
-- 
2.14.3



[dpdk-dev] [PATCH] vhost: introduce rte_vhost_vring_call()

2017-12-21 Thread Stefan Hajnoczi
Users of librte_vhost currently implement the vring call operation
themselves.  Each caller performs the operation slightly differently.

This patch introduces a new librte_vhost API called
rte_vhost_vring_call() that performs the operation so that vhost-user
applications don't have to duplicate it.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/rte_vhost.h   | 15 +++
 examples/vhost/virtio_net.c| 11 ++-
 examples/vhost_scsi/vhost_scsi.c   |  6 +++---
 lib/librte_vhost/vhost.c   | 27 +++
 lib/librte_vhost/virtio_net.c  | 29 ++---
 lib/librte_vhost/rte_vhost_version.map |  7 +++
 6 files changed, 60 insertions(+), 35 deletions(-)

diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index fe5c94c69..dde0eeb03 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -85,7 +85,9 @@ struct rte_vhost_vring {
struct vring_used   *used;
uint64_tlog_guest_addr;
 
+   /** Deprecated, use rte_vhost_vring_call() instead. */
int callfd;
+
int kickfd;
uint16_tsize;
 };
@@ -435,6 +437,19 @@ int rte_vhost_get_mem_table(int vid, struct 
rte_vhost_memory **mem);
 int rte_vhost_get_vhost_vring(int vid, uint16_t vring_idx,
  struct rte_vhost_vring *vring);
 
+/**
+ * Notify the guest that used descriptors have been added to the vring.  This
+ * function acts as a memory barrier.
+ *
+ * @param vid
+ *  vhost device ID
+ * @param vring_idx
+ *  vring index
+ * @return
+ *  0 on success, -1 on failure
+ */
+int rte_vhost_vring_call(int vid, uint16_t vring_idx);
+
 /**
  * Get vhost RX queue avail count.
  *
diff --git a/examples/vhost/virtio_net.c b/examples/vhost/virtio_net.c
index 1ab57f526..252c5b8ce 100644
--- a/examples/vhost/virtio_net.c
+++ b/examples/vhost/virtio_net.c
@@ -207,13 +207,8 @@ vs_enqueue_pkts(struct vhost_dev *dev, uint16_t queue_id,
*(volatile uint16_t *)&vr->used->idx += count;
queue->last_used_idx += count;
 
-   /* flush used->idx update before we read avail->flags. */
-   rte_mb();
+   rte_vhost_vring_call(dev->vid, queue_id);
 
-   /* Kick the guest if necessary. */
-   if (!(vr->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
-   && (vr->callfd >= 0))
-   eventfd_write(vr->callfd, (eventfd_t)1);
return count;
 }
 
@@ -396,9 +391,7 @@ vs_dequeue_pkts(struct vhost_dev *dev, uint16_t queue_id,
 
vr->used->idx += i;
 
-   if (!(vr->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
-   && (vr->callfd >= 0))
-   eventfd_write(vr->callfd, (eventfd_t)1);
+   rte_vhost_vring_call(dev->vid, queue_id);
 
return i;
 }
diff --git a/examples/vhost_scsi/vhost_scsi.c b/examples/vhost_scsi/vhost_scsi.c
index b4f1f8d27..e30e61f6d 100644
--- a/examples/vhost_scsi/vhost_scsi.c
+++ b/examples/vhost_scsi/vhost_scsi.c
@@ -110,7 +110,7 @@ descriptor_is_wr(struct vring_desc *cur_desc)
 }
 
 static void
-submit_completion(struct vhost_scsi_task *task)
+submit_completion(struct vhost_scsi_task *task, uint32_t q_idx)
 {
struct rte_vhost_vring *vq;
struct vring_used *used;
@@ -131,7 +131,7 @@ submit_completion(struct vhost_scsi_task *task)
/* Send an interrupt back to the guest VM so that it knows
 * a completion is ready to be processed.
 */
-   eventfd_write(vq->callfd, (eventfd_t)1);
+   rte_vhost_vring_call(task->bdev->vid, q_idx);
 }
 
 static void
@@ -263,7 +263,7 @@ process_requestq(struct vhost_scsi_ctrlr *ctrlr, uint32_t 
q_idx)
task->resp->status = 0;
task->resp->resid = 0;
}
-   submit_completion(task);
+   submit_completion(task, q_idx);
rte_free(task);
}
 }
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 4f8b73a09..0d269f12a 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -519,6 +519,33 @@ rte_vhost_get_vhost_vring(int vid, uint16_t vring_idx,
return 0;
 }
 
+int
+rte_vhost_vring_call(int vid, uint16_t vring_idx)
+{
+   struct virtio_net *dev;
+   struct vhost_virtqueue *vq;
+
+   dev = get_device(vid);
+   if (!dev)
+   return -1;
+
+   if (vring_idx >= VHOST_MAX_VRING)
+   return -1;
+
+   vq = dev->virtqueue[vring_idx];
+   if (!vq)
+   return -1;
+
+   /* flush used->idx update before we read avail->flags. */
+   rte_mb();
+
+   /* Kick the guest if necessary. */
+   if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
+   && (vq->ca

[dpdk-dev] [PATCH v2 0/2] vhost: introduce rte_vhost_vring_call()

2018-01-02 Thread Stefan Hajnoczi
v2:
 * Add internal vhost_vring_call() helper function [Maxime]

These patches eliminate code duplication for vhost_virtqueue->callfd users by
introducing rte_vhost_vring_call() (public API) and vhost_vring_call()
(librte_vhost-internal API).

Stefan Hajnoczi (2):
  vhost: add vhost_vring_call() helper
  vhost: introduce rte_vhost_vring_call()

 lib/librte_vhost/rte_vhost.h   | 15 +++
 lib/librte_vhost/vhost.h   | 12 
 examples/vhost/virtio_net.c| 11 ++-
 examples/vhost_scsi/vhost_scsi.c   |  6 +++---
 lib/librte_vhost/vhost.c   | 21 +
 lib/librte_vhost/virtio_net.c  | 23 +++
 lib/librte_vhost/rte_vhost_version.map |  7 +++
 7 files changed, 63 insertions(+), 32 deletions(-)

-- 
2.14.3



[dpdk-dev] [PATCH v2 1/2] vhost: add vhost_vring_call() helper

2018-01-02 Thread Stefan Hajnoczi
Extract the callfd eventfd signal operation so virtio_net.c does not
have to repeat it multiple times.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/vhost.h  | 12 
 lib/librte_vhost/virtio_net.c | 23 +++
 2 files changed, 15 insertions(+), 20 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 04f54cb60..ac81d83bb 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -394,4 +394,16 @@ vhost_iova_to_vva(struct virtio_net *dev, struct 
vhost_virtqueue *vq,
return __vhost_iova_to_vva(dev, vq, iova, size, perm);
 }
 
+static __rte_always_inline void
+vhost_vring_call(struct vhost_virtqueue *vq)
+{
+   /* Flush used->idx update before we read avail->flags. */
+   rte_mb();
+
+   /* Kick the guest if necessary. */
+   if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
+   && (vq->callfd >= 0))
+   eventfd_write(vq->callfd, (eventfd_t)1);
+}
+
 #endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 79d80f7fd..a92a5181d 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -408,13 +408,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
offsetof(struct vring_used, idx),
sizeof(vq->used->idx));
 
-   /* flush used->idx update before we read avail->flags. */
-   rte_mb();
-
-   /* Kick the guest if necessary. */
-   if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
-   && (vq->callfd >= 0))
-   eventfd_write(vq->callfd, (eventfd_t)1);
+   vhost_vring_call(vq);
 out:
if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
vhost_user_iotlb_rd_unlock(vq);
@@ -701,14 +695,7 @@ virtio_dev_merge_rx(struct virtio_net *dev, uint16_t 
queue_id,
 
if (likely(vq->shadow_used_idx)) {
flush_shadow_used_ring(dev, vq);
-
-   /* flush used->idx update before we read avail->flags. */
-   rte_mb();
-
-   /* Kick the guest if necessary. */
-   if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
-   && (vq->callfd >= 0))
-   eventfd_write(vq->callfd, (eventfd_t)1);
+   vhost_vring_call(vq);
}
 
 out:
@@ -1107,11 +1094,7 @@ update_used_idx(struct virtio_net *dev, struct 
vhost_virtqueue *vq,
vq->used->idx += count;
vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
sizeof(vq->used->idx));
-
-   /* Kick guest if required. */
-   if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
-   && (vq->callfd >= 0))
-   eventfd_write(vq->callfd, (eventfd_t)1);
+   vhost_vring_call(vq);
 }
 
 static __rte_always_inline struct zcopy_mbuf *
-- 
2.14.3



[dpdk-dev] [PATCH v2 2/2] vhost: introduce rte_vhost_vring_call()

2018-01-02 Thread Stefan Hajnoczi
Users of librte_vhost currently implement the vring call operation
themselves.  Each caller performs the operation slightly differently.

This patch introduces a new librte_vhost API called
rte_vhost_vring_call() that performs the operation so that vhost-user
applications don't have to duplicate it.

Signed-off-by: Stefan Hajnoczi 
---
 lib/librte_vhost/rte_vhost.h   | 15 +++
 examples/vhost/virtio_net.c| 11 ++-
 examples/vhost_scsi/vhost_scsi.c   |  6 +++---
 lib/librte_vhost/vhost.c   | 21 +
 lib/librte_vhost/rte_vhost_version.map |  7 +++
 5 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index f65364495..890f8a831 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -86,7 +86,9 @@ struct rte_vhost_vring {
struct vring_used   *used;
uint64_tlog_guest_addr;
 
+   /** Deprecated, use rte_vhost_vring_call() instead. */
int callfd;
+
int kickfd;
uint16_tsize;
 };
@@ -436,6 +438,19 @@ int rte_vhost_get_mem_table(int vid, struct 
rte_vhost_memory **mem);
 int rte_vhost_get_vhost_vring(int vid, uint16_t vring_idx,
  struct rte_vhost_vring *vring);
 
+/**
+ * Notify the guest that used descriptors have been added to the vring.  This
+ * function acts as a memory barrier.
+ *
+ * @param vid
+ *  vhost device ID
+ * @param vring_idx
+ *  vring index
+ * @return
+ *  0 on success, -1 on failure
+ */
+int rte_vhost_vring_call(int vid, uint16_t vring_idx);
+
 /**
  * Get vhost RX queue avail count.
  *
diff --git a/examples/vhost/virtio_net.c b/examples/vhost/virtio_net.c
index 1ab57f526..252c5b8ce 100644
--- a/examples/vhost/virtio_net.c
+++ b/examples/vhost/virtio_net.c
@@ -207,13 +207,8 @@ vs_enqueue_pkts(struct vhost_dev *dev, uint16_t queue_id,
*(volatile uint16_t *)&vr->used->idx += count;
queue->last_used_idx += count;
 
-   /* flush used->idx update before we read avail->flags. */
-   rte_mb();
+   rte_vhost_vring_call(dev->vid, queue_id);
 
-   /* Kick the guest if necessary. */
-   if (!(vr->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
-   && (vr->callfd >= 0))
-   eventfd_write(vr->callfd, (eventfd_t)1);
return count;
 }
 
@@ -396,9 +391,7 @@ vs_dequeue_pkts(struct vhost_dev *dev, uint16_t queue_id,
 
vr->used->idx += i;
 
-   if (!(vr->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
-   && (vr->callfd >= 0))
-   eventfd_write(vr->callfd, (eventfd_t)1);
+   rte_vhost_vring_call(dev->vid, queue_id);
 
return i;
 }
diff --git a/examples/vhost_scsi/vhost_scsi.c b/examples/vhost_scsi/vhost_scsi.c
index b4f1f8d27..e30e61f6d 100644
--- a/examples/vhost_scsi/vhost_scsi.c
+++ b/examples/vhost_scsi/vhost_scsi.c
@@ -110,7 +110,7 @@ descriptor_is_wr(struct vring_desc *cur_desc)
 }
 
 static void
-submit_completion(struct vhost_scsi_task *task)
+submit_completion(struct vhost_scsi_task *task, uint32_t q_idx)
 {
struct rte_vhost_vring *vq;
struct vring_used *used;
@@ -131,7 +131,7 @@ submit_completion(struct vhost_scsi_task *task)
/* Send an interrupt back to the guest VM so that it knows
 * a completion is ready to be processed.
 */
-   eventfd_write(vq->callfd, (eventfd_t)1);
+   rte_vhost_vring_call(task->bdev->vid, q_idx);
 }
 
 static void
@@ -263,7 +263,7 @@ process_requestq(struct vhost_scsi_ctrlr *ctrlr, uint32_t 
q_idx)
task->resp->status = 0;
task->resp->resid = 0;
}
-   submit_completion(task);
+   submit_completion(task, q_idx);
rte_free(task);
}
 }
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 4f8b73a09..1244d76d4 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -519,6 +519,27 @@ rte_vhost_get_vhost_vring(int vid, uint16_t vring_idx,
return 0;
 }
 
+int
+rte_vhost_vring_call(int vid, uint16_t vring_idx)
+{
+   struct virtio_net *dev;
+   struct vhost_virtqueue *vq;
+
+   dev = get_device(vid);
+   if (!dev)
+   return -1;
+
+   if (vring_idx >= VHOST_MAX_VRING)
+   return -1;
+
+   vq = dev->virtqueue[vring_idx];
+   if (!vq)
+   return -1;
+
+   vhost_vring_call(vq);
+   return 0;
+}
+
 uint16_t
 rte_vhost_avail_entries(int vid, uint16_t queue_id)
 {
diff --git a/lib/librte_vhost/rte_vhost_version.map 
b/lib/librte_vhost/rte_vhost_version.map
index 1e7049535..b30187601 100644
--- a/lib/librte_vhost/rte_vhost_version.map
+++ b/lib/librte_vhost/rte